擅长:python、mysql、java
<p>您可以编写一个函数,该函数接受原始html并删除所有html标记</p>
<pre><code>def cleanhtml(raw_html):
cleanr = re.compile("<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
cleantext = re.sub(cleanr, " ", raw_html)
return cleantext
</code></pre>
<p>其他一些清洁剂选项:</p>
<ul>
<li><code>cleanr = re.compile("<[A-Za-z\/][^>]*>")</code></li>
<li><code>cleanr = re.compile("<[^>]*>")</code></li>
<li><code>cleanr = re.compile("<\/?\w+\s*[^>]*?\/?>")</code></li>
</ul>
<p>但是有一个更好更简单的方法来使用Beautifulsoup</p>
<pre><code>from bs4 import BeautifulSoup
def clean_with_soup(url: str) -> str:
r = requests.get(url).text
soup = BeautifulSoup(r, "html.parser")
return soup.get_text()
</code></pre>