如何从网页中的书中提取文本？

import requests import string import re result=requests.get('https://www.hplovecraft.com/writings/texts/fiction/bws.aspx') src=result.content soup=BeautifulSoup(src,'lxml') oldbook=soup.find("div",{"align":"justify"}) book=oldbook.text.replace('s%_'," ")

2条回答

网友

1楼 · 编辑于 2024-09-30 01:25:39

要清除文本，可以将.get_text()与strip=True和separator='\n'一起使用：

import requests
from bs4 import BeautifulSoup

result = requests.get(
    "https://www.hplovecraft.com/writings/texts/fiction/bws.aspx"
)
soup = BeautifulSoup(result.content, "lxml")

oldbook = soup.find("div", {"align": "justify"})
print(oldbook.get_text(strip=True, separator="\n"))

印刷品：

“I have an exposition of sleep come upon me.”
—Shakespeare.
I have frequently wondered if the majority of mankind ever pause to reflect upon the occasionally
titanic significance of dreams, and of the obscure world to which they belong. Whilst the greater
number of our nocturnal visions are perhaps no more than faint and fantastic reflections of
our waking experiences—Freud to the contrary with his puerile symbolism—there are

...

网友

2楼 · 编辑于 2024-09-30 01:25:39

\n存在是因为它存在于HTML中。（换行符）

浏览者单独设置页面样式，不一定使用“额外的空格”。实际上，在您提到的句子中，文本中没有“额外的whitepcaes”，而是包含在一个单独的HTML标记中，并使用CSS设置样式

您可以尝试识别这些特殊标记并分别从中提取文本，并使用regex去除多个连续的空白块

相关问题更多 >

编程相关推荐

热门问题

热门文章