嗨,我正试着在标签之间刮擦。下面我附上一部分的来源,我想刮。如果你仔细看,有3个ul标签。第一个ul标签具有class=“listGroup”。我试图提取第二个“ul”标记的文本,其思想是后面跟着另一个具有类“listGroup”的“ul”标记。请分享我如何做到这一点
<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>
我附上我到目前为止所做的简短脚本。请帮忙
import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
page = requests.get(f"https://www.mayoclinic.org{link}")
soup = BeautifulSoup(page.content, "html.parser")
for each in soup.find_all("ul"):
print(each)
这似乎是CSS选择器的自然用例,即:
ul.listGroup + ul li
将选择类listGroup
的每个ul
标记后面的第一个ul
标记中的所有li
标记。将+
替换为~
将取而代之的是选择所有li
标记中的所有li
标记(在本例中为2)ul
标记,每个标记后面都有类listGroup
要在脚本中实现此答案,请将
find_all
替换为select
,并使用相关CSS选择器更新选择器您可以使用CSS选择器
ul.listGroup + ul li
->;这将选择类为"listGroup"
的<ul>
标签旁边的所有<li>
标签:印刷品:
相关问题 更多 >
编程相关推荐