在2个ul标签之间刮取数据

2024-10-04 05:33:46 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨,我正试着在标签之间刮擦。下面我附上一部分的来源,我想刮。如果你仔细看,有3个ul标签。第一个ul标签具有class=“listGroup”。我试图提取第二个“ul”标记的文本,其思想是后面跟着另一个具有类“listGroup”的“ul”标记。请分享我如何做到这一点

<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&amp;utm_medium=Link&amp;utm_campaign=FamilyHealth-Book&amp;utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&amp;utm_medium=Link&amp;utm_campaign=HealthLetter-Digital&amp;utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>

我附上我到目前为止所做的简短脚本。请帮忙

import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
    page = requests.get(f"https://www.mayoclinic.org{link}")
    soup = BeautifulSoup(page.content, "html.parser")
    for each in soup.find_all("ul"):
        print(each)

Tags: orofhttpsforli标签contentul
3条回答
也许你应该考虑使用正则表达式来捕获。

这似乎是CSS选择器的自然用例,即:

ul.listGroup + ul li将选择类listGroup的每个ul标记后面的第一个ul标记中的所有li标记。将+替换为~将取而代之的是选择所有li标记中的所有li标记(在本例中为2)ul标记,每个标记后面都有类listGroup

要在脚本中实现此答案,请将find_all替换为select,并使用相关CSS选择器更新选择器

import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
    page = requests.get(f"https://www.mayoclinic.org{link}")
    soup = BeautifulSoup(page.content, "html.parser")
    for each in soup.select("ul.listGroup + ul li"):
        print(each.text)

您可以使用CSS选择器ul.listGroup + ul li->;这将选择类为"listGroup"<ul>标签旁边的所有<li>标签:

txt = '''<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&amp;utm_medium=Link&amp;utm_campaign=FamilyHealth-Book&amp;utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&amp;utm_medium=Link&amp;utm_campaign=HealthLetter-Digital&amp;utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>

<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>'''

soup = BeautifulSoup(txt, 'html.parser')

for li in soup.select('ul.listGroup + ul li'):
    print(li.text)

印刷品:

Osteoporosis
Kidney stones
Excessive urination
Abdominal pain
Tiring easily or weakness
Depression or forgetfulness
Bone and joint pain
Frequent complaints of illness with no apparent cause
Nausea, vomiting or loss of appetite

相关问题 更多 >