使用Python-BeautifulSoup进行Web抓取时出错：从github profi提取内容

2条回答

网友

1楼 · 编辑于 2024-05-20 16:06:09

发生这种情况的原因是，当您通过BeautifulSoup查找元素时，它就像一个dict.get()调用。当您转到find元素时，它将get从元素树中删除它们。如果找不到，则返回Exception，而不是一个None。None不具有Element将具有的属性，如text、attr等。因此，当您在没有try/except或没有验证类型的情况下进行Element.text调用时，您就在打赌元素将始终存在。你知道吗

我可能会先将给您带来问题的元素保存在一个temp变量中，这样您就可以进行类型检查了。或者实现try/except

型式检验

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text


    p = row.find('div').p
    if p is not None:
        repo['desc'] = p.text
    else:
        repo['desc'] = None

    lang = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'})

    if lang is not None
        # Do something to pass here
        repo['lang'] = lang.text
    else:
        repo['lang'] = None
    repos.append(repo)

尝试/例外

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
    try:
        repo['desc'] = row.find('div').p.text
    except TypeError:
        repo['desc'] = None
        #Second Error Postion
    try:
        repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    except TypeError:
         repo['lang'] = None
    repos.append(repo)

我个人倾向于try/except，因为它更简洁，异常捕捉是增强程序健壮性的一个很好的实践

网友
2楼 · 编辑于 2024-05-20 16:06:09

你的find调用是不准确的，并且是链式的，所以当你试图找到一个<div>标记，它没有p子标记时，你得到了None，但是你继续在None上调用属性.text，这会使你的程序崩溃。你知道吗
请尝试以下一组.find调用，这些调用使用您要查找的itemProp属性，并使用try-except块以null合并任何缺少的字段：
import requests from bs4 import BeautifulSoup import csv URL = "https://github.com/DURGESHBARWAL?tab=repositories" r = requests.get(URL) soup = BeautifulSoup(r.text, 'html.parser') repos = [] table = soup.find('ul', attrs = {'data-filterable-for': 'your-repos-filter'}) for row in table.find_all('li', {'itemprop': 'owns'}): repo = { 'name': row.find('a', {'itemprop' : 'name codeRepository'}), 'desc': row.find('p', {'itemprop' : 'description'}), 'lang': row.find('span', {'itemprop' : 'programmingLanguage'}) } for k, v in repo.items(): try: repo[k] = v.text.strip() except AttributeError: pass repos.append(repo) filename = 'extract.csv' with open(filename, 'w') as f: w = csv.DictWriter(f,['name','desc','lang']) w.writeheader() for repo in repos: w.writerow(repo)
调试输出（除了写入的CSV）：
[ { 'desc': 'This a Django-Python Powered a simple functionality based ' 'Bot application', 'lang': 'Python', 'name': 'Sandesh'}, {'desc': None, 'lang': 'Jupyter Notebook', 'name': 'python_notes'}, { 'desc': 'Installing DSpace using docker', 'lang': 'Java', 'name': 'DSpace-Docker-Installation-1'}, { 'desc': 'This Repo Contains the DSpace Installation Steps', 'lang': None, 'name': 'DSpace-Installation'}, { 'desc': '(Official) The DSpace digital asset management system that ' 'powers your Institutional Repository', 'lang': 'Java', 'name': 'DSpace'}, { 'desc': 'This Repo contain the DSpace installation steps with ' 'docker.', 'lang': None, 'name': 'DSpace-Docker-Installation'}, { 'desc': 'This Repository contain the Intermediate system for the ' 'Collaboration and DSpace System', 'lang': 'Python', 'name': 'Community-OER-Repository'}, { 'desc': 'A class website to share the knowledge and expanding the ' 'productivity through digital communication.', 'lang': 'PHP', 'name': 'class-website'}, { 'desc': 'This is a POC for the Voting System. It is a precise ' 'design and implementation of Voting System based on the ' 'features of Blockchain which has the potential to ' 'substitute the traditional e-ballet/EVM system for voting ' 'purpose.', 'lang': 'Python', 'name': 'Blockchain-Based-Ballot-System'}, { 'desc': 'It is a short describtion of Modern Django', 'lang': 'Python', 'name': 'modern-django'}, { 'desc': 'It is just for the sample work.', 'lang': 'HTML', 'name': 'Task'}, { 'desc': 'This Repo contain the sorting algorithms in C,predefiend ' 'function of C, C++ and Java', 'lang': 'C', 'name': 'Sorting_Algos_Predefined_functions'}, { 'desc': 'It is a arduino program, for monitor the temperature and ' 'humidity from sensor DHT11.', 'lang': 'C++', 'name': 'DHT_11_Arduino'}, { 'desc': "This is a registration from,which collect data from user's " 'desktop and put into database after validation.', 'lang': 'PHP', 'name': 'Registration_Form'}, { 'desc': 'It is a dynamic multi-part data driven search engine in ' 'PHP & MySQL from absolutely scratch for the website.', 'lang': 'PHP', 'name': 'search_engine'}, { 'desc': 'It is just for learning github.', 'lang': None, 'name': 'Hello_world'}]

型式检验

尝试/例外

相关问题更多 >

编程相关推荐

热门问题

热门文章