使用Python-BeautifulSoup进行Web抓取时出错:从github profi提取内容

2024-05-20 16:06:09 发布

您现在位置:Python中文网/ 问答频道 /正文

这是python代码,用于使用BeautifulSoup库从github存储库中抓取web内容。我面临错误:

"NoneType' object has no attribute 'text'"

在这个简单的代码中。我面临着2行代码中的注释错误。你知道吗

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for':'your-repos-filter'}) 

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
        repo['desc'] = row.find('div').p.text
        #Second Error Postion
    repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    repos.append(repo) 

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

输出

Traceback (most recent call last): File "webscrapping.py", line 16, in repo['desc'] = row.find('div').p.text AttributeError: 'NoneType' object has no attribute 'text'


Tags: csv代码textinimportdivgithubfor
2条回答

发生这种情况的原因是,当您通过BeautifulSoup查找元素时,它就像一个dict.get()调用。当您转到find元素时,它将get从元素树中删除它们。如果找不到,则返回Exception,而不是一个NoneNone不具有Element将具有的属性,如textattr等。因此,当您在没有try/except或没有验证类型的情况下进行Element.text调用时,您就在打赌元素将始终存在。你知道吗

我可能会先将给您带来问题的元素保存在一个temp变量中,这样您就可以进行类型检查了。或者实现try/except

型式检验

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text


    p = row.find('div').p
    if p is not None:
        repo['desc'] = p.text
    else:
        repo['desc'] = None

    lang = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'})

    if lang is not None
        # Do something to pass here
        repo['lang'] = lang.text
    else:
        repo['lang'] = None
    repos.append(repo)

尝试/例外

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
    try:
        repo['desc'] = row.find('div').p.text
    except TypeError:
        repo['desc'] = None
        #Second Error Postion
    try:
        repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    except TypeError:
         repo['lang'] = None
    repos.append(repo)

我个人倾向于try/except,因为它更简洁,异常捕捉是增强程序健壮性的一个很好的实践

你的find调用是不准确的,并且是链式的,所以当你试图找到一个<div>标记,它没有p子标记时,你得到了None,但是你继续在None上调用属性.text,这会使你的程序崩溃。你知道吗

请尝试以下一组.find调用,这些调用使用您要查找的itemProp属性,并使用try-except块以null合并任何缺少的字段:

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for': 'your-repos-filter'}) 

for row in table.find_all('li', {'itemprop': 'owns'}): 
    repo = {
        'name': row.find('a', {'itemprop' : 'name codeRepository'}),
        'desc': row.find('p', {'itemprop' : 'description'}),
        'lang': row.find('span', {'itemprop' : 'programmingLanguage'})
    }

    for k, v in repo.items():
        try: 
            repo[k] = v.text.strip()
        except AttributeError: pass

    repos.append(repo)

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

调试输出(除了写入的CSV):

[   {   'desc': 'This a Django-Python Powered a simple functionality based '
                'Bot application',
        'lang': 'Python',
        'name': 'Sandesh'},
    {'desc': None, 'lang': 'Jupyter Notebook', 'name': 'python_notes'},
    {   'desc': 'Installing DSpace using docker',
        'lang': 'Java',
        'name': 'DSpace-Docker-Installation-1'},
    {   'desc': 'This Repo Contains the DSpace Installation Steps',
        'lang': None,
        'name': 'DSpace-Installation'},
    {   'desc': '(Official) The DSpace digital asset management system that '
                'powers your Institutional Repository',
        'lang': 'Java',
        'name': 'DSpace'},
    {   'desc': 'This Repo contain the DSpace installation steps with '
                'docker.',
        'lang': None,
        'name': 'DSpace-Docker-Installation'},
    {   'desc': 'This Repository contain the Intermediate system for the '
                'Collaboration and DSpace System',
        'lang': 'Python',
        'name': 'Community-OER-Repository'},
    {   'desc': 'A class website to share the knowledge and expanding the '
                'productivity through digital communication.',
        'lang': 'PHP',
        'name': 'class-website'},
    {   'desc': 'This is a POC for the Voting System. It is a precise '
                'design and implementation of Voting System based on the '
                'features of Blockchain which has the potential to '
                'substitute the traditional e-ballet/EVM system for voting '
                'purpose.',
        'lang': 'Python',
        'name': 'Blockchain-Based-Ballot-System'},
    {   'desc': 'It is a short describtion of Modern Django',
        'lang': 'Python',
        'name': 'modern-django'},
    {   'desc': 'It is just for the sample work.',
        'lang': 'HTML',
        'name': 'Task'},
    {   'desc': 'This Repo contain the sorting algorithms in C,predefiend '
                'function of C, C++ and Java',
        'lang': 'C',
        'name': 'Sorting_Algos_Predefined_functions'},
    {   'desc': 'It is a arduino program, for monitor the temperature and '
                'humidity from sensor DHT11.',
        'lang': 'C++',
        'name': 'DHT_11_Arduino'},
    {   'desc': "This is a registration from,which collect data from user's "
                'desktop and put into database after validation.',
        'lang': 'PHP',
        'name': 'Registration_Form'},
    {   'desc': 'It is a dynamic multi-part data driven search engine in '
                'PHP & MySQL from absolutely scratch for the website.',
        'lang': 'PHP',
        'name': 'search_engine'},
    {   'desc': 'It is just for learning github.',
        'lang': None,
        'name': 'Hello_world'}]

相关问题 更多 >