用python和beautifulsoup从web页面获取数据

1条回答

网友

1楼 · 发布于 2024-10-02 00:38:07

首先，我想指出您发布的代码中存在的几个问题。首先，glob模块通常不用于发出HTTP请求。它对于迭代指定路径上的文件子集非常有用，您可以阅读有关它的更多信息in its docs。在

第二个问题是：

for file in glob.glob("http://www.asusparts.eu/partfinder/*"):

您有一个缩进错误，因为后面没有缩进的代码。并阻止代码的其余部分被执行。在

另一个问题是，您正在为变量使用一些python的“保留”名称。千万不要用all或file等词来表示变量名。在

最后，当您循环option_tags时：

^{pr2}$

open语句将尝试打开路径为url + option['value']的本地文件。这可能会引发一个错误，因为我怀疑您在该位置是否有一个文件。另外，你应该知道你没有对这个打开的文件做任何事情。在

好吧，评论到此为止。我已经看了华硕的网页，我想我有一个想法，你想完成什么。据我所知，你想在华硕页面上为每台电脑型号刮出一份零件清单（图片、文字、价格等）。每个模型在一个惟一的URL（例如：http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220）上都有它的部件列表。这意味着您需要能够为每个模型创建这个惟一的URL。更复杂的是，每个零件类别都是动态加载的，因此，例如，“冷却”部分的零件在单击“冷却”链接之前不会加载。这意味着我们有一个由两部分组成的问题：1）获取所有有效的（品牌、类型、系列、型号）组合；2）找出如何加载给定型号的所有部件。在

我有点无聊，决定写一个简单的程序来处理大部分的繁重工作。这不是最优雅的东西，但它会完成任务的。步骤1）在get_model_information()中完成。步骤2）在parse_models()中得到了处理，但不太明显。看看华硕网站，每当你点击一个部件小节，JavaScript函数^{}就会运行，它会对格式化的PRODUCT_URL（见下文）进行ajax调用。响应是一些JSON信息，用于填充您单击的部分。在

import urllib2
import json
import urlparse
from bs4 import BeautifulSoup

BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
               '44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
               'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']


def get_options(url, select_id):
    """
    Gets all the options from a select element.
    """
    r = urllib2.urlopen(url)
    soup = BeautifulSoup(r)
    select = soup.find('select', id=select_id)
    try:
        options = [option for option in select.strings]
    except AttributeError:
        print url, select_id, select
        raise
    return options[1:]  # The first option is the menu text


def get_model_information():
    """
    Finds all the models for each family, all the families and models for each
    type, and all the types, families, and models for each brand.

    These are all added as tuples (brand, type, family, model) to the list
    models.
    """
    model_info = []

    print "Getting brands"
    brand_options = get_options(BASE_URL, 'mySelectList')

    for brand in brand_options:
        print "Getting types for {0}".format(brand)
        # brand = brand.replace(' ', '%20')  # URL encode spaces
        brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
        types = get_options(brand_url, 'mySelectListType')

        for _type in types:
            print "Getting families for {0}->{1}".format(brand, _type)
            bt = '{0}/{1}'.format(brand, _type)
            type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
            families = get_options(type_url, 'myselectListFamily')

            for family in families:
                print "Getting models for {0}->{1}->{2}".format(brand,
                                                                _type, family)
                btf = '{0}/{1}'.format(bt, family)
                fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
                models = get_options(fam_url, 'myselectListModel')

                model_info.extend((brand, _type, family, m) for m in models)

    return model_info


def parse_models(model_information):
    """
    Get all the information for each accessory type for every
    (brand, type, family, model). accessory_info will be the python formatted
    json results. You can parse, filter, and save this information or use
    it however suits your needs.
    """

    for brand, _type, family, model in model_information:
        for accessory in ACCESSORIES:
            r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
                                                 accessory=accessory,
                                                 brand=brand,))
            accessory_info = json.load(r)
            # Do something with accessory_info
            # ...


def main():
    models = get_model_information()
    parse_models(models)


if __name__ == '__main__':
    main()

最后，有一点需要注意。我已经放弃了urllib2，转而使用requests库。我个人认为它提供了更多的功能和更好的语义，但是你可以使用你想要的任何东西。在

相关问题更多 >

编程相关推荐

热门问题

热门文章