多行和HTML标签的Regex

2024-10-02 18:17:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在写一个脚本,应该依次打开10个文本文件(它们是来自不同网页的源代码)。然后我希望脚本遍历并将<br />的任何实例替换为\n。然后我希望它删除整个标题,基本上。在任何情况下,文档总是以DOCTYPE开头,我想要的信息之前的最后一行结束

"decoration:underline">no year</span><br />

据我所知,regex /.../s意味着“忽略换行符”,我已经转义了出现在</span>标记中的HTML /。 到目前为止,我有以下内容

^{pr2}$

不过,我得到的只是同一根弦。在

期望输出如下:

"""  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  """

Tags: comhttptargetimgtitlestyleimdbhref
1条回答
网友
1楼 · 发布于 2024-10-02 18:17:20

主要问题是您的regex模式对于Python是错误的。在

r'/^<!DOCTYPE.+no year<\/span>/s'中,前导/和尾随{}被认为是模式的一部分,而不是其行为的修饰符。这看起来像PCRE regex语法la PHP,Python不支持它。相反,要使.匹配任何字符(包括换行符),需要设置re.DOTALL标志,如下所示。在

另一个问题是来自create_linebreaks()clean_up()的返回值没有分配回data,因此更改丢失。在

另外,您不希望为create_linebreaks()中的换行符使用原始字符串,普通字符串是可以的(否则您将用\\n替换{})。在

import re

def create_linebreaks(l):
    l = l.replace('<br />', '\n')
    return l

def clean_up(line):
    line = re.sub(r'^<!DOCTYPE.+no year<\/span>', '', line, flags=re.DOTALL)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!  google_ad_section_start(weight=ignore)  ><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

data = create_linebreaks(data)
data = clean_up(data)

>>> print data

  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions:  -  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  


>>> 

相关问题 更多 >