擅长:python、mysql、java
<p>Regex并不特别适合解析HTML。<br/>
值得庆幸的是,有专门为解析HTML而创建的工具,例如<code>BeautifulSoup</code>和{<cd2>};后者如下所示:</p>
<pre><code>markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''
import lxml.html
import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By
# Sarah Shemkus
</code></pre>