python中提取作者姓名的复杂正则表达式

2024-06-18 13:03:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图创建一个regex,但没有成功,我要做的是获取任何一个类为(author | byline | writer)的html元素的内容

这是我到目前为止的情况

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

我需要匹配的示例

^{pr2}$

或者

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

任何帮助都将不胜感激。 -斯特凡


Tags: div元素示例内容byhtml情况class
3条回答

试试这个:

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

我添加的内容:
-*?,以防class属性没有出现在起始标记之后。
-*?,将*运算符设置为非贪心,以便查找结束符>;

由于前面提到的原因,强烈建议使用regexp解析html。使用现有的HTML解析器。作为一个如何简单的例子,我提供了一个使用lxml及其CSS选择器的示例。在

from lxml import etree
from lxml.cssselect import CSSSelector

## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''

## lxml html parser
html = etree.HTML(html_string)

## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')

## Call the selector to get matches
matching_elements = sel(html)

for elem in matching_elements:
    primt elem.text

Regex并不特别适合解析HTML。
值得庆幸的是,有专门为解析HTML而创建的工具,例如BeautifulSoup和{};后者如下所示:

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''

import lxml.html

import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
    print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By 
# Sarah Shemkus

相关问题 更多 >