<p>由于大多数HTML基本上都是xml(或者可以很容易地进行裁剪以与大多数xml解析器兼容),所以我建议使用xml解析器。无论如何,python解析器只是一个特定于xml解析器的子类。在</p>
<p>退房:<a href="http://oreilly.com/catalog/pythonxml/chapter/ch01.html" rel="nofollow">Python and XML</a>。在</p>
<p>这里有一个很好的教程:<a href="http://www.travisglines.com/web-coding/python-xml-parser-tutorial" rel="nofollow">Python XML Parser Tutorial</a>。在</p>
<p>另外,<a href="http://docs.python.org/library/xml.dom.minidom.html" rel="nofollow">xml.dom.minidom Class</a>对我个人来说非常有用。在</p>
<p>另一个类似的方法解释如下:<a href="http://docs.python.org/library/xml.etree.elementtree.html" rel="nofollow">xml.etree.ElementTree</a>。在</p>
<p>这是<a href="http://docs.python.org/library/xml.dom.minidom.html" rel="nofollow">xml.dom.minidom reference page</a>中的一个很好的例子:</p>
<pre><code>import xml.dom.minidom
document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>
<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""
dom = xml.dom.minidom.parseString(document)
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def handleSlideshow(slideshow):
print "<html>"
handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
slides = slideshow.getElementsByTagName("slide")
handleToc(slides)
handleSlides(slides)
print "</html>"
def handleSlides(slides):
for slide in slides:
handleSlide(slide)
def handleSlide(slide):
handleSlideTitle(slide.getElementsByTagName("title")[0])
handlePoints(slide.getElementsByTagName("point"))
def handleSlideshowTitle(title):
print "<title>%s</title>" % getText(title.childNodes)
def handleSlideTitle(title):
print "<h2>%s</h2>" % getText(title.childNodes)
def handlePoints(points):
print "<ul>"
for point in points:
handlePoint(point)
print "</ul>"
def handlePoint(point):
print "<li>%s</li>" % getText(point.childNodes)
def handleToc(slides):
for slide in slides:
title = slide.getElementsByTagName("title")[0]
print "<p>%s</p>" % getText(title.childNodes)
handleSlideshow(dom)
</code></pre>
<p>如果您绝对必须使用regex而不是解析器,请查看<a href="http://docs.python.org/library/re.html" rel="nofollow">re module</a>:</p>
^{pr2}$