Python，使用regex在文件中搜索html标记问题的回答

Python，使用regex在文件中搜索html标记

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

由于大多数HTML基本上都是xml（或者可以很容易地进行裁剪以与大多数xml解析器兼容），所以我建议使用xml解析器。无论如何，python解析器只是一个特定于xml解析器的子类。在 退房：<a href="http://oreilly.com/catalog/pythonxml/chapter/ch01.html" rel="nofollow">Python and XML</a>。在 这里有一个很好的教程：<a href="http://www.travisglines.com/web-coding/python-xml-parser-tutorial" rel="nofollow">Python XML Parser Tutorial</a>。在 另外，<a href="http://docs.python.org/library/xml.dom.minidom.html" rel="nofollow">xml.dom.minidom Class</a>对我个人来说非常有用。在 另一个类似的方法解释如下：<a href="http://docs.python.org/library/xml.etree.elementtree.html" rel="nofollow">xml.etree.ElementTree</a>。在 这是<a href="http://docs.python.org/library/xml.dom.minidom.html" rel="nofollow">xml.dom.minidom reference page</a>中的一个很好的例子： <pre><code>import xml.dom.minidom document = """\ <slideshow> <title>Demo slideshow</title> <slide><title>Slide title</title> <point>This is a demo</point> <point>Of a program for processing slides</point> </slide> <slide><title>Another demo slide</title> <point>It is important</point> <point>To have more than</point> <point>one slide</point> </slide> </slideshow> """ dom = xml.dom.minidom.parseString(document) def getText(nodelist): rc = [] for node in nodelist: if node.nodeType == node.TEXT_NODE: rc.append(node.data) return ''.join(rc) def handleSlideshow(slideshow): print "<html>" handleSlideshowTitle(slideshow.getElementsByTagName("title")[0]) slides = slideshow.getElementsByTagName("slide") handleToc(slides) handleSlides(slides) print "</html>" def handleSlides(slides): for slide in slides: handleSlide(slide) def handleSlide(slide): handleSlideTitle(slide.getElementsByTagName("title")[0]) handlePoints(slide.getElementsByTagName("point")) def handleSlideshowTitle(title): print "<title>%s</title>" % getText(title.childNodes) def handleSlideTitle(title): print "<h2>%s</h2>" % getText(title.childNodes) def handlePoints(points): print "<ul>" for point in points: handlePoint(point) print "</ul>" def handlePoint(point): print "<li>%s</li>" % getText(point.childNodes) def handleToc(slides): for slide in slides: title = slide.getElementsByTagName("title")[0] print "%s" % getText(title.childNodes) handleSlideshow(dom) </code></pre> 如果您绝对必须使用regex而不是解析器，请查看<a href="http://docs.python.org/library/re.html" rel="nofollow">re module</a>： ^{pr2}$

Python，使用regex在文件中搜索html标记

1 个回答

相关Python问题