简单介绍如何在BeautifulSoup/Python中提取正确的标记？

import requests from bs4 import BeautifulSoup import pandas as pd PageParagraphs = [] url='https://www.aacr.org/patients-caregivers/cancer/breast-cancer/' r=requests.get(url) soup=BeautifulSoup(r.content,'html.parser') section=soup.find_all('div',class_='first-section section clearfix') for item in section: paragraphs=soup.find_all('p') print(paragraphs)

2条回答

网友

1楼 · 编辑于 2024-09-27 23:22:42

如果您想要包含内容的段落，可以在bs4.7.1+中使用以下内容。我使用：不排除空白和右侧方框段落。我以为你不想以源头作为结局。如果需要源参数，请删除, :has(span)

soup = bs(r.content, 'lxml')
print('\n'.join([i.text for i in soup.select(".content p:not(:empty, :has(span))")]))

网友

2楼 · 编辑于 2024-09-27 23:22:42

要使用CSS选择器仅获取第一个<p>标记，请执行以下操作：

import requests
from bs4 import BeautifulSoup

url = "https://www.aacr.org/patients-caregivers/cancer/breast-cancer/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

print(soup.select_one('.first-section p').text)

或者使用上述代码，使用find()仅获取第一个<p>标记，而不是find_all()：

import requests
from bs4 import BeautifulSoup

url = "https://www.aacr.org/patients-caregivers/cancer/breast-cancer/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
section = soup.find_all("div", class_="first-section section clearfix")[0]
print(section.find("p").text)

编辑以获取所有段落标记：

import requests
from bs4 import BeautifulSoup

url = "https://www.aacr.org/patients-caregivers/cancer/breast-cancer/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for tag in soup.select('.first-section p'):
    print(tag.text)

输出：

There are a number of different types of breast cancer. The most common form of breast cancer is ductal carcinoma, which begins in the cells of the ducts. Cancer that begins in the lobes or lobules is called lobular carcinoma and is more often found in both breasts than are other types of breast cancer. Inflammatory breast cancer is an uncommon type of breast cancer in which the breast is warm, red, and swollen.
Hereditary breast cancer makes up from 5 percent to 10 percent of all breast cancer diagnoses. Women who have certain gene mutations, such as a BRCA1 or BRCA2 mutation, have an increased risk of developing breast cancer and are also at increased risk of ovarian cancer. Other risk factors include estrogen (made in the body), dense breast tissue, age at menstruation and first birth, taking hormones for symptoms of menopause, obesity, and not getting enough exercise.
The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program estimates that in 2020 276,480 women in the United States will be diagnosed with breast cancer and 42,170 will die of the disease. From 2010 to 2016, the five-year survival rate for those diagnosed with breast cancer was 90.0 percent.  
Men can also develop breast cancer, making up slightly less than 1 percent of those diagnosed each year. Radiation exposure, high levels of estrogen, and a family history of breast cancer can increase a man’s risk of the disease.

Source: National Cancer Institute

不使用CSS选择器编辑2：

for tag in soup.find_all(class_='first-section section clearfix'):
    for p in tag.find_all('p'):
        print(p.text)

相关问题更多 >

编程相关推荐

热门问题

热门文章