如何在BeautifulSoup中解析出不需要的文本?

2024-09-28 20:59:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图抓住文章和标题,但有一部分我只是不知道解析出来。你知道吗

url = "http://insideevs.com/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "lxml")
latest = []
b = soup.find_all('div', class_=re.compile("content"))
for a in b:
    latest.append(a.get_text(strip=True))

例如,列表中的所有项目都有一个时间框架和附加到文章的评论。”两周前,574条评论“等。有人能告诉我如何排除这些最后的片段吗?你知道吗


Tags: textcomhttpurl标题dataget文章
2条回答

首先使用BeautifulSoup来获取h3元素的集合,这些元素具有所需的数据项或与之相邻。我说相邻是因为其中一个缩写文本在每种情况下都是h3的兄弟。你知道吗

现在,在h3项中,您可以再次使用select方法来查找其中的a链接元素并获取其text。您想要的文本是link元素的同级;但是,它只是几个元素中的一个,因此我使用:nth-of-type(1)请求第一个。差点忘了~ p说,“把我的兄弟姐妹给我”,不管我用什么称呼,碰巧是h3。你知道吗

然后,我们可以通过请求链接的href属性来获得全文的链接,而在我们请求链接的text属性之前。你知道吗

我将所有这些放在一个enumerate中,这样我就可以整齐地安排将输出从页面中截断为五个项目。你知道吗

>>> import requests
>>> import bs4
>>> page = requests.get('http://insideevs.com/').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.select('article div h3')):
...     title = item.select('a')[0].text
...     text = item.select('~ p:nth-of-type(1)')
...     url = item.select('a')[0].attrs['href']
...     if i < 5:
...         title
...         text[0].text
...         url
...         
'Plug-In Volvo XC60 T8 Enters U.S. Next Month With 10.4 kWh Battery'
'Volvo latest plug-in hybrid, the\xa0premium mid-sized SUV XC60 T8 Twin Engine, debuted in March at the Geneva Motor Show. The car is based on the company’s\xa0SPA vehicle architecture, first used in the 90 series (XC90 and S90). Production of the XC60 actually began in mid-April at the Torslanda Plant in…'
'http://insideevs.com/plug-in-volvo-xc60-t8-enters-u-s-next-month-with-10-4-kwh-battery/'
'Examining Tesla Model 3 Production Goals – Are Targets Even Feasible?'
'Tesla CEO Elon Musk notes a potential of factory production speed improvement by a factor of 10. Where does this put Model 3 production, and at what point might Tesla achieve this monumentally lofty goal? The real answer may be “never”, that is until Tesla has more than a single…'
'http://insideevs.com/examining-tesla-model-3-production-goals/'
'All-Electric Class 5 Work Truck With 100 Miles Range To Arrive This Fall'
'Chanje is a new company based out of Los Angeles,\xa0California, that intends to introduce an all-electric medium-duty vehicle on a mass scale in the U.S., promising first deliveries in 2017. The company is related to Hong Kong based FDG Electric Vehicles, which together with other partners have reportedly invested nearly…'
'http://insideevs.com/all-electric-class-5-work-truck-with-100-miles-range-to-arrive-this-fall/'
'Volkswagen CEO Admits Tesla Has Abilities It Lacks'
'It seems Volkswagen CEO Herbert Diess isn’t quite sure what to say about Tesla. About a month ago, we shared that Diess (whose personal car is a VW eGolf) believes Volkswagen can stop Tesla. His reasoning behind the statement was simply\xa0VW has abilities that Tesla doesn’t possess. Of course, this…'
'http://insideevs.com/volkswagen-ceo-admits-tesla-ahead/'
'Tesla Model 3 Sighting In New Zealand – Video'
'It’s winter over there, so why not conduct some winter testing? This isn’t the first time we’ve seen a Model 3 in New Zealand and likely won’t be the last. Imagine being in New Zealand and spotting a Model 3 prior to anyone outside of the U.S. That’s brag-worthy for…'
'http://insideevs.com/tesla-model-3-sighting-new-zealand-video/'

使用extract删除所需的标记。你知道吗

代码示例:

for a in b:
    a.find('p', {"class" : "details"}).extract()
    latest.append(a.get_text(strip=True))

相关问题 更多 >