擅长:python、mysql、java
<p>由于前面提到的原因,强烈建议使用regexp解析html。使用现有的HTML解析器。作为一个如何简单的例子,我提供了一个使用lxml及其CSS选择器的示例。在</p>
<pre><code>from lxml import etree
from lxml.cssselect import CSSSelector
## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''
## lxml html parser
html = etree.HTML(html_string)
## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')
## Call the selector to get matches
matching_elements = sel(html)
for elem in matching_elements:
primt elem.text
</code></pre>