如何使用python从div中特定标题的段落元素中提取网页文本

2024-09-24 22:30:50 发布

您现在位置:Python中文网/ 问答频道 /正文

基本上就是标题。我试图以IL-6基因为例,从https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6中的“genecards summary for name\u of \u gene”下面的区域提取段落文本。我想说的是“IL6(白细胞介素6)是一种蛋白质编码基因。与IL6相关的疾病包括卡波西肉瘤和类风湿性关节炎,全身性青少年。其相关途径包括IL-1家族信号通路和免疫应答IFN-α/β信号通路。与该基因相关的基因本体(GO)注释包括信号受体结合和生长因子活性

我一直在尝试将beautifulsoup4与python结合使用。我的问题是,我只是不知道如何指定什么文本,我想从网站上拉

from bs4 import BeautifulSoup

from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")`
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

for tag in soup.find_all(['script', 'style']):
   tag.decompose()
soup.get_text(strip=True)
VALID_TAGS = ['div', 'p']

for tag in soup.findAll('GeneCards Summary for '+ GeneToSearch +    'Gene'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print(soup.text)

这给了我网站上的每一个元素


Tags: inhttpsorgfor信号tagwww基因
2条回答

尝试在标记之间导航,类似于:

soup.select('.gc-subsection-header')[1].next_sibling.next_sibling.text

参考号:Beautiful Soup

使用最新版本的BeautifulSoup,可以使用伪css选择器(:contains)搜索具有特定文本的标记。然后可以导航到下一个p标记并提取相应的文本:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

text_find = 'GeneCards Summary for ' + GeneToSearch + ' Gene'

el = soup.select_one('h3:contains("' + text_find + '")')
summary = el.parent.find_next('p').text.strip()

print(summary)

输出:

IL6 (Interleukin 6) is a Protein Coding gene.
Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile.
Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway.
Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity.

相关问题 更多 >