我正在尝试用beautiful soup来提取标题文本(用^{cl1}表示)$
<div class="title"><a ref="ordinalpos=2&ncbi_uid=5514220&link_uid=5514220&linksrc=docsum_title" href="/pmc/articles/PMC5514220/" class="view"><b>Sirtuins</b>, a promising target in slowing down the ageing process</a></div>
目前,我正在使用正则表达式匹配标记,并用空字符串替换它们
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/pmc/?term=sirtuins"
soup = BeautifulSoup(urlopen(url))
new_content = []
pattern = re.compile(r'(<b>)|(<\/b>)|(<em>)|(<\/em>)|[^\w ]|(PDF.*)')
for view in soup.findAll('a', attrs={'class':'view'}):
for item in view.contents:
new_content.append(re.sub(pattern,'', str(item), flags=0))
然而,我剩下的最终产品是多个标题中的断字列表
['The Role of ',
'Sirtuins',
' in Antioxidant and Redox Signaling',
'',
'Sirtuins',
' a promising target in slowing down the ageing process',
'',
'The NAD',
'supsup',
'Dependent Family of ',
'Sirtuins',
' in Cerebral Ischemia and Preconditioning',
'',
是否有任何方式提取/加入/清理这段文字,使我留下了整个句子没有这些标签
谢谢:)
编辑:预期输出:
['The Role of Sirtuins in Antioxidant and Redox Signaling',
'Sirtuins a promising target in slowing down the ageing process',
'The NAD Dependent Family of Sirtuins in Cerebral Ischemia and Preconditioning']
你就不能抓取class=title的
div
标记中的文本吗?beautifulsoup的妙处在于它了解标记和降价的工作原理,因此要获得文本,不需要过滤掉降价标记但如前所述,很难理解你想要什么。我猜你是想把他的头衔拿出来
输出:
相关问题 更多 >
编程相关推荐