使用Python从页面中删除HTML标记内容

<!DOCTYPE HTML> <html> <head> <title>Sezione microbiologia</title> <link rel="stylesheet" src="./style.css"> </head> <body> <div id="content"> <section id="main">  <h1>Prima diluizione</h1> <p>Some content including "prima diluizione"...</p> <h1>Seconda diluizione</h1> <p>Some content including "seconda diluizione"...</p> <h1>Terza diluizione</h1> <p>Some content including "terza diluizione"...</p> </section> <section id="second">  </section> <section id="third">  </section> <section id="footer">  </section> </div> </body> </html>

<div id="content"> <section id="main">  <h1>Prima diluizione seriale</h1> <p>Some content including "prima diluizione seriale"...</p> <h1>Seconda diluizione seriale</h1> <p>Some content including "seconda diluizione seriale"...</p> <h1>Terza diluizione seriale</h1> <p>Some content including "terza diluizione seriale"...</p> </section>

<div id="content"> <section id="main">  <h1>Diluizione seriale</h1> <p>Some content including "prima diluizione"...</p> <h1>Diluizione seriale</h1> <p>Some content including "seconda diluizione"...</p> <h1>Diluizione seriale</h1> <p>Some content including "terza diluizione"...</p> </section>

2条回答

网友
1楼 · 编辑于 2024-09-28 01:28:53

看看html.parser。与其尝试进行sting插值，不如将HTML解析为一个结构，然后从那里遍历它

网友
2楼 · 编辑于 2024-09-28 01:28:53

您可以通过Pythonsre模块使用正则表达式来实现这一点。为了只过滤h1标记中的文本，可以使用positive lookbehind和positive lookahead策略
代码：
import re with open("path/to/home.html") as file: text = file.read() text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text) print(text)
说明：
正则表达式(?<=<h1>)\w+ \w+(?=</h1>)匹配包含在<h1>和</h1>之间的两个连续单词字符
输出：
<! SOME CONTENT... > <h1>Diluizione seriale</h1> <p>Some content including "prima diluizione"...</p> <h1>Diluizione seriale</h1> <p>Some content including "seconda diluizione"...</p> <h1>Diluizione seriale</h1> <p>Some content including "terza diluizione"...</p>

相关问题更多 >

编程相关推荐

热门问题

热门文章