BeautifulSoup提取包含随机标记的完整文本句子

<div class="title"><a ref="ordinalpos=2&ncbi_uid=5514220&link_uid=5514220&linksrc=docsum_title" href="/pmc/articles/PMC5514220/" class="view"><b>Sirtuins</b>, a promising target in slowing down the ageing process</a></div>

from urllib.request import urlopen from bs4 import BeautifulSoup url = "https://www.ncbi.nlm.nih.gov/pmc/?term=sirtuins" soup = BeautifulSoup(urlopen(url)) new_content = [] pattern = re.compile(r'(<b>)|(<\/b>)|(<em>)|(<\/em>)|[^\w ]|(PDF.*)') for view in soup.findAll('a', attrs={'class':'view'}): for item in view.contents: new_content.append(re.sub(pattern,'', str(item), flags=0))

['The Role of ', 'Sirtuins', ' in Antioxidant and Redox Signaling', '', 'Sirtuins', ' a promising target in slowing down the ageing process', '', 'The NAD', 'supsup', 'Dependent Family of ', 'Sirtuins', ' in Cerebral Ischemia and Preconditioning', '',

['The Role of Sirtuins in Antioxidant and Redox Signaling', 'Sirtuins a promising target in slowing down the ageing process', 'The NAD Dependent Family of Sirtuins in Cerebral Ischemia and Preconditioning']

1条回答

网友

1楼 · 发布于 2024-09-28 22:01:57

你就不能抓取class=title的div标记中的文本吗？beautifulsoup的妙处在于它了解标记和降价的工作原理，因此要获得文本，不需要过滤掉降价标记

但如前所述，很难理解你想要什么。我猜你是想把他的头衔拿出来

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.ncbi.nlm.nih.gov/pmc/?term=sirtuins"
soup = BeautifulSoup(urlopen(url))
new_content = []

for view in soup.findAll('div', attrs={'class':'title'}):
    new_content.append(view.text)

输出：

print (new_content)
['The Role of Sirtuins in Antioxidant and Redox Signaling', 'Sirtuins, a promising target in slowing down the ageing process', 'The NAD+-Dependent Family of Sirtuins in Cerebral Ischemia and Preconditioning', 'Sirtuins and the metabolic hurdles in cancer', 'Mitochondrial Sirtuins and Molecular Mechanisms of Aging', 'Sirtuins in the Cardiovascular System: Potential Targets in Pediatric Cardiology', 'Sirtuins at the crossroads of stemness, aging, and cancer', 'Sirtuins of parasitic protozoa: In search of function(s)', 'Genealogy of an ancient protein family: the Sirtuins, a family of disordered members', 'Sirtuins and Their Roles in Brain Aging and Neurodegenerative Disorders', 'The controversial world of sirtuins', 'Potential Modulation of Sirtuins by Oxidative Stress', 'Sirtuins in Skin and Skin Cancers', 'Schistosoma mansoni Sirtuins: Characterization and Potential as Chemotherapeutic Targets', 'Sirtuins, aging, and cardiovascular risks', 'Controversial Impact of Sirtuins in Chronic Non-Transmissible Diseases and Rehabilitation Medicine', 'Sirtuins Link Inflammation and Metabolism', 'Sirtuins in metabolism, DNA repair and cancer', 'Sirtuins Expression and Their Role in Retinal Diseases', 'Application of Targeted Mass Spectrometry for the Quantification of Sirtuins in the Central Nervous System']

相关问题更多 >

编程相关推荐

热门问题

热门文章