简单介绍如何在BeautifulSoup/Python中提取正确的标记?

2024-09-27 23:22:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在从一个站点拉车-->https://www.aacr.org/patients-caregivers/cancer/breast-cancer/

我只想要这一页的段落信息(从开始有许多不同类型的乳腺癌..等等

根据我的理解,一旦你在一个div和类中,进一步的循环应该“深入”并从你所在的部分/类中获取任何信息

我的代码从不同区域提取段落信息。当我运行代码时,它会提供所需的输出,但也会提供来自HTML部分的不同类/部分的段落。(<;P class='desc')它不在我声明的部分(class='first-section clearfix)中

我如何仅获取所需的输出

代码如下:

import requests
from bs4 import BeautifulSoup
import pandas as pd


PageParagraphs = []
url='https://www.aacr.org/patients-caregivers/cancer/breast-cancer/'
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
section=soup.find_all('div',class_='first-section section clearfix')
for item in section:
      paragraphs=soup.find_all('p')
      print(paragraphs)

Tags: 代码orgimportdiv信息wwwsectionclass
2条回答

如果您想要包含内容的段落,可以在bs4.7.1+中使用以下内容。我使用:不排除空白和右侧方框段落。我以为你不想以源头作为结局。如果需要源参数,请删除, :has(span)

soup = bs(r.content, 'lxml')
print('\n'.join([i.text for i in soup.select(".content p:not(:empty, :has(span))")]))

要使用CSS选择器仅获取第一个<p>标记,请执行以下操作:

import requests
from bs4 import BeautifulSoup

url = "https://www.aacr.org/patients-caregivers/cancer/breast-cancer/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

print(soup.select_one('.first-section p').text)

或者使用上述代码,使用find()仅获取第一个<p>标记,而不是find_all()

import requests
from bs4 import BeautifulSoup

url = "https://www.aacr.org/patients-caregivers/cancer/breast-cancer/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
section = soup.find_all("div", class_="first-section section clearfix")[0]
print(section.find("p").text)

编辑以获取所有段落标记:

import requests
from bs4 import BeautifulSoup

url = "https://www.aacr.org/patients-caregivers/cancer/breast-cancer/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for tag in soup.select('.first-section p'):
    print(tag.text)

输出:

There are a number of different types of breast cancer. The most common form of breast cancer is ductal carcinoma, which begins in the cells of the ducts. Cancer that begins in the lobes or lobules is called lobular carcinoma and is more often found in both breasts than are other types of breast cancer. Inflammatory breast cancer is an uncommon type of breast cancer in which the breast is warm, red, and swollen.
Hereditary breast cancer makes up from 5 percent to 10 percent of all breast cancer diagnoses. Women who have certain gene mutations, such as a BRCA1 or BRCA2 mutation, have an increased risk of developing breast cancer and are also at increased risk of ovarian cancer. Other risk factors include estrogen (made in the body), dense breast tissue, age at menstruation and first birth, taking hormones for symptoms of menopause, obesity, and not getting enough exercise.
The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program estimates that in 2020 276,480 women in the United States will be diagnosed with breast cancer and 42,170 will die of the disease. From 2010 to 2016, the five-year survival rate for those diagnosed with breast cancer was 90.0 percent.  
Men can also develop breast cancer, making up slightly less than 1 percent of those diagnosed each year. Radiation exposure, high levels of estrogen, and a family history of breast cancer can increase a man’s risk of the disease.

Source: National Cancer Institute

不使用CSS选择器编辑2

for tag in soup.find_all(class_='first-section section clearfix'):
    for p in tag.find_all('p'):
        print(p.text)

相关问题 更多 >

    热门问题