Regex multiline如何获取页面的一部分sou

<div id="viewed"><div class="shortstory-block"> <div class="shortstoey-block-image"> <a href="...."><img src="/uploads/posts/cov.jpg" alt="instance 1"/></a> <span class="format"><a href="http://www..../">something</a></span> </div> <a href="http://....."><span class="shortstory-block-title" style="text-decoration:none !important;"> Something </span> </a> </div><div class="shortstory-block"> <div class="shortstoey-block-image"> <a href="...."><img src="/uploads/posts/cov.jpg" alt="something 2"/></a> <span class="format"><a href="http://www.website/xfsearch/smth/">something</a></span> </div> <a href="http://web.html"><span class="shortstory-block-title" style="text-decoration:none !important;"> Something </span> </a> </div> (* x times) <div id="rated">....

2条回答

网友

1楼 · 编辑于 2024-09-26 18:03:53

re.DOTALL标志使。匹配任何字符。没有那面旗子，它就不符合新行。你知道吗

（DOTALL也可以在regexp本身中拼写为(?s)）

有关类似的问题，以及代码示例和更好的方法，请参见： Python's "re" module not working?

网友

2楼 · 编辑于 2024-09-26 18:03:53

如果您确实只是想在文本的两个元素之间找到一些东西，可以使用以下正则表达式：

import re

with open('yourfile') as fin:
    page_source = fin.read()

start_text = re.escape('<div id="viewed">')
until_text = re.escape('<div id="rated">')
match_text = re.search('{}(.*?){}'.format(start_text, until_text), page_source, flags=re.DOTALL)
if match_text:
    print match_text.group(1)

相关问题更多 >

编程相关推荐

热门问题

热门文章