擅长:python、mysql、java
<p>下面是一个简单的示例,演示如何使用<a href="https://www.crummy.com/software/BeautifulSoup/" rel="nofollow noreferrer">BeautifulSoup</a>提取HTML正文文本,<a href="https://github.com/Mimino666/langdetect" rel="nofollow noreferrer">langdetect</a>用于语言检测:</p>
<pre><code>from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))
</code></pre>