我正在把一本书从PDF转换成电子印刷版。但是标题不在header标记中,因此尝试使用regex替换python函数。你知道吗
示例文本:
<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
我试着用regex和
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
list = re.findall(pattern, match.group())
for x in list:
x = "</a>(?i)chapter [0-9]+: [\w\s]+(.?)<br>"
x = s.split("</a>", 1)[0] + '</a><h2>' + s.split("a>",1)[1]
x = s.split("<br>", 1)[0] + '</h2><br>' + s.split("<br>",1)[1]
return match.group()
以及
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
s.replace(re.match(pattern, s), r'<h2>$0')
但仍未达到预期效果。我想要的是。。。你知道吗
</a>Chapter 370: Slamming straight on</p>
</a><h2>Chapter 370: Slamming straight on</h2></p>
在所有类似情况下都要添加h2标签
Jean-François的评论会更好,但是如果我们不得不这么做,我猜我们应该从以下表达式开始:
替换为:
Demo 1
Demo 2
测试
regex
不应用于解析xml。请参见: Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms (Why shouldn't you..
会是一个更好的标题)但是,您可以改用BeautifulSoup:
输出
相关问题 更多 >
编程相关推荐