Python多页网页只抓取文本

2024-09-23 04:26:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Python还不熟悉。我目前正在网络报废工作。我们的任务是搜集戴尔社区Inspiron问题的前5页。我有运行并返回所需信息的代码。然而,我不能得到的文本只有。我当前的代码返回text+html。我尝试过在代码的不同位置放置.text,但只有这样做时才会出错。你知道吗

最常见的错误是:“AttributeError:ResultSet对象没有属性‘text’。你可能把一个项目列表当作一个单独的项目。当您打算调用find()时,是否调用了find \u all()你知道吗

下面是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, csv
from time import sleep



pages = ['https://www.dell.com/community/Inspiron/bd-p/Inspiron',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/2',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/3',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/4',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/5'

    ]
import requests
data = []

for page in pages:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')
    rows = soup.select('tbody tr')

    for row in rows:
        d = dict()
        d['title'] = soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})
        d['author'] = soup.find_all ('span', attrs = {'class': 'login-bold'})
        d['time'] = soup.find_all ('span', attrs = {'class': 'local-time'})
        d['kudos'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-kudos-count'})
        d['messages'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-replies-count'})
        d['views'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-topic-views-count'})
        d['solved'] = soup.find_all ('td', attrs = {'aria-label': 'triangletop lia-data-cell-secondary lia-data-cell-icon'})
        d['latest']= soup.find_all ('span', attrs = {'cssclass': 'lia-info-area-item'})
        data.append(d)

    sleep(10)
print(data[0])

非常感谢您的帮助。谢谢!你知道吗


Tags: httpscommunityimportcomwwwpageallfind
2条回答

find_all返回html元素的列表。如果希望打印每个元素的文本,则需要遍历使用find_all创建的每个列表,然后对每个条目应用.text方法。例如:

titles = soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})
for title in titles:
    print(title.text())

正如Joseph所提到的,find_all返回html元素的列表,循环遍历这些列表中的每个元素,然后对每个项应用.text方法。你知道吗

下面我使用列表理解来循环并应用.text方法。使用strip()删除任何尾随字符,如\t、\n等。。。你知道吗

最终代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, csv
from time import sleep



pages = ['https://www.dell.com/community/Inspiron/bd-p/Inspiron',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/2',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/3',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/4',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/5'

    ]
import requests
data = []

for page in pages:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')
    rows = soup.select('tbody tr')

    for row in rows:
        d = dict()
        d['title'] = [i.text.strip() for i in soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})]
        d['author'] = [i.text.strip() for i in soup.find_all ('span', attrs = {'class': 'login-bold'})]
        d['time'] = [i.text.strip() for i in soup.find_all ('span', attrs = {'class': 'local-time'})]
        d['kudos'] = [i.text.strip() for i in soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-kudos-count'})]
        d['messages'] = [i.text.strip() for i in soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-replies-count'})]
        d['views'] = [i.text.strip() for i in soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-topic-views-count'})]
        d['solved'] = [i.text.strip() for i in soup.find_all ('td', attrs = {'aria-label': 'triangletop lia-data-cell-secondary lia-data-cell-icon'})]
        d['latest']= [i.text.strip() for i in soup.find_all ('span', attrs = {'cssclass': 'lia-info-area-item'})]
        data.append(d)

    sleep(10)
print(data[0])

编辑:将其包含在代码中以将词典另存为csv。你知道吗

import pandas as pd

pd.DataFrame.from_dict(data)
pd.head()     # confirm if the data is correct
pd.to_csv('name.csv', index=False)

相关问题 更多 >