BeautifulSoup提取包含随机标记的完整文本句子

2024-09-28 22:01:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试用beautiful soup来提取标题文本(用^{cl1}表示)$

<div class="title"><a ref="ordinalpos=2&amp;ncbi_uid=5514220&amp;link_uid=5514220&amp;linksrc=docsum_title" href="/pmc/articles/PMC5514220/" class="view"><b>Sirtuins</b>, a promising target in slowing down the ageing process</a></div>

目前,我正在使用正则表达式匹配标记,并用空字符串替换它们

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.ncbi.nlm.nih.gov/pmc/?term=sirtuins"
soup = BeautifulSoup(urlopen(url))
new_content = []
pattern = re.compile(r'(<b>)|(<\/b>)|(<em>)|(<\/em>)|[^\w ]|(PDF.*)')
for view in soup.findAll('a', attrs={'class':'view'}):
    for item in view.contents:
        new_content.append(re.sub(pattern,'', str(item), flags=0))

然而,我剩下的最终产品是多个标题中的断字列表

['The Role of ',
 'Sirtuins',
 ' in Antioxidant and Redox Signaling',
 '',
 'Sirtuins',
 ' a promising target in slowing down the ageing process',
 '',
 'The NAD',
 'supsup',
 'Dependent Family of ',
 'Sirtuins',
 ' in Cerebral Ischemia and Preconditioning',
 '',

是否有任何方式提取/加入/清理这段文字,使我留下了整个句子没有这些标签

谢谢:)

编辑:预期输出:

['The Role of Sirtuins in Antioxidant and Redox Signaling',
 'Sirtuins a promising target in slowing down the ageing process',
 'The NAD Dependent Family of Sirtuins in Cerebral Ischemia and Preconditioning']

Tags: andoftheinviewtargetclassdown
1条回答
网友
1楼 · 发布于 2024-09-28 22:01:57

你就不能抓取class=title的div标记中的文本吗?beautifulsoup的妙处在于它了解标记和降价的工作原理,因此要获得文本,不需要过滤掉降价标记

但如前所述,很难理解你想要什么。我猜你是想把他的头衔拿出来

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.ncbi.nlm.nih.gov/pmc/?term=sirtuins"
soup = BeautifulSoup(urlopen(url))
new_content = []

for view in soup.findAll('div', attrs={'class':'title'}):
    new_content.append(view.text)

输出:

print (new_content)
['The Role of Sirtuins in Antioxidant and Redox Signaling', 'Sirtuins, a promising target in slowing down the ageing process', 'The NAD+-Dependent Family of Sirtuins in Cerebral Ischemia and Preconditioning', 'Sirtuins and the metabolic hurdles in cancer', 'Mitochondrial Sirtuins and Molecular Mechanisms of Aging', 'Sirtuins in the Cardiovascular System: Potential Targets in Pediatric Cardiology', 'Sirtuins at the crossroads of stemness, aging, and cancer', 'Sirtuins of parasitic protozoa: In search of function(s)', 'Genealogy of an ancient protein family: the Sirtuins, a family of disordered members', 'Sirtuins and Their Roles in Brain Aging and Neurodegenerative Disorders', 'The controversial world of sirtuins', 'Potential Modulation of Sirtuins by Oxidative Stress', 'Sirtuins in Skin and Skin Cancers', 'Schistosoma mansoni Sirtuins: Characterization and Potential as Chemotherapeutic Targets', 'Sirtuins, aging, and cardiovascular risks', 'Controversial Impact of Sirtuins in Chronic Non-Transmissible Diseases and Rehabilitation Medicine', 'Sirtuins Link Inflammation and Metabolism', 'Sirtuins in metabolism, DNA repair and cancer', 'Sirtuins Expression and Their Role in Retinal Diseases', 'Application of Targeted Mass Spectrometry for the Quantification of Sirtuins in the Central Nervous System']

相关问题 更多 >