如何使用BeautifulSoup仅解析引号?

2024-09-26 17:43:30 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我试图解析一个网站的引用,但是在Result类中有多个段落。 有没有办法忽略日期和作者,只选择引用的材料?所以我只剩下一个引文列表?使用BeautifulSoup顺便说一句。谢谢

<div class="result">
  <p><strong>Date:</strong> February 2, 2019</p>
  <p>"My mind had no choice but to drift into an elaborate fantasy realm."</p>

  <blockquote>
    <p class="attribution">&mdash; Pamela, Paul</p>
  </blockquote>
  <a href="/metaphors/25249" class="load_details">preview</a> |
  <a href="/metaphors/25249" title="Let Children Get Bored Again [from The New York Times]">full record</a>
  <div class="details_container"></div>
</div>
<div class="result">
  <p><strong>Date:</strong> February 2, 2019</p>
  <p>"You let your mind wander and follow it where it goes."</p>
  <blockquote>
    <p class="attribution">&mdash; Pamela, Paul</p>
  </blockquote>
  <a href="/metaphors/25250" class="load_details">preview</a> |
  <a href="/metaphors/25250" title="Let Children Get Bored Again [from The New York Times]">full record</a>

  <div class="details_container"></div>
</div>

我目前的代码如下:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('URLHERE').read()
soup = bs.BeautifulSoup(sauce,'lxml')

body = soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)

Tags: divdatebodyresultdetailsclassstronghref
2条回答

如果我正确理解了您的问题,您希望只打印引号,这些引号出现在第三段的每个元素中,从第二段开始

quotes = soup.find_all('p')

for i in range(1, len(quotes), 3):
   print(quotes[i].text)

也许有一种更干净的方法可以做到这一点,但这应该是可行的

您可以使用xpath进行查询,例如:

import requests

from lxml import html

page = requests.get('enter_your_url')
tree = html.fromstring(page.content)
data = tree.xpath('//div[@class="result"]//p[2]/text()')

print(data)

相关问题 更多 >

    热门问题