Python web抓取

Govt has nothing to do with former CAG official RP Singh: Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper"> <table class="al-attribution single-line-height" cellspacing="0" cellpadding="0"> <tbody><tr><td class="al-attribution-cell source-cell"> <span class='al-attribution-source'>Times of India</span></td> <td class="al-attribution-cell timestamp-cell"> <span class='dash-separator'> - </span> <span class='al-attribution-timestamp'>&lrm;46 minutes ago&lrm;

3条回答

网友

1楼 · 编辑于 2024-10-02 10:25:45

.*是任何字符的贪婪匹配；它将消耗尽可能多的字符。相反，使用非贪婪版本.*?，如

pathstring = '<span class="titletext">(.*?)</span>'

网友

2楼 · 编辑于 2024-10-02 10:25:45

.*将匹配</span>，因此它一直持续到最后一个。在

最好的答案是：不要用正则表达式解析html。使用lxml库（或类似的库）。在

from lxml import html

html_string = '<blah>'
tree = html.fromstring(html_string)
titles = tree.xpath("//span[@class='titletext']")
for title in titles:
    print title.text

使用适当的xml/html解析器将为您节省大量的时间和麻烦。如果您运行自己的解析器，您将不得不处理格式错误的标记、注释和无数其他东西。不要重新发明轮子。在

网友

3楼 · 编辑于 2024-10-02 10:25:45

我建议使用pyquery而不是在正则表达式上发疯。。。它基于lxml，使得HTML解析和使用jQuery一样简单。在

像这样的东西就是你所需要的一切：

doc = PyQuery(html)
doc('span.titletext').text()

您也可以使用beautifulsoup，但结果总是一样的：不要使用正则表达式解析HTML，有一些工具可以让您的生活更轻松。在

相关问题更多 >

编程相关推荐

热门问题

热门文章