如何使用python从div中特定标题的段落元素中提取网页文本

from bs4 import BeautifulSoup from urllib.request import Request, urlopen baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=" GeneToSearch = input("Gene of Interest: ")` updatedURL = baseURL + GeneToSearch print(updatedURL) req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'}) response = urlopen(req).read() soup = BeautifulSoup(response, 'lxml') for tag in soup.find_all(['script', 'style']): tag.decompose() soup.get_text(strip=True) VALID_TAGS = ['div', 'p'] for tag in soup.findAll('GeneCards Summary for '+ GeneToSearch + 'Gene'): if tag.name not in VALID_TAGS: tag.replaceWith(tag.renderContents()) print(soup.text)

2条回答

网友

1楼 · 编辑于 2024-09-24 22:30:50

尝试在标记之间导航，类似于：

soup.select('.gc-subsection-header')[1].next_sibling.next_sibling.text

参考号：Beautiful Soup

网友

2楼 · 编辑于 2024-09-24 22:30:50

使用最新版本的BeautifulSoup，可以使用伪css选择器（：contains）搜索具有特定文本的标记。然后可以导航到下一个p标记并提取相应的文本：

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

text_find = 'GeneCards Summary for ' + GeneToSearch + ' Gene'

el = soup.select_one('h3:contains("' + text_find + '")')
summary = el.parent.find_next('p').text.strip()

print(summary)

输出：

IL6 (Interleukin 6) is a Protein Coding gene.
Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile.
Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway.
Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity.

相关问题更多 >

编程相关推荐

热门问题

热门文章