如何从网页中的书中提取文本?

2024-09-30 01:25:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我如何从网页中提取这本书的文本,并且文本应该包含额外的空格和标点符号 (https://www.hplovecraft.com/writings/texts/fiction/bws.aspx

我写了这段代码

import requests
import string
import re
result=requests.get('https://www.hplovecraft.com/writings/texts/fiction/bws.aspx')
src=result.content
soup=BeautifulSoup(src,'lxml')
oldbook=soup.find("div",{"align":"justify"})
book=oldbook.text.replace('s%_'," ")

但是输出包含\n\n\n\n而不是额外的空格和\r\n

来自输出的部分:

“\n\n\n\n\n\n\n“我有一个关于睡眠的解释,来吧。”\n-莎士比亚。\n\n\n\n\n\r\n\r\n我经常想知道,大多数人是否会停下来思考梦的偶然意义,以及梦所属的模糊世界

我如何解决这个问题?


Tags: https文本importsrccomwwwresultrequests
2条回答

要清除文本,可以将.get_text()strip=Trueseparator='\n'一起使用:

import requests
from bs4 import BeautifulSoup

result = requests.get(
    "https://www.hplovecraft.com/writings/texts/fiction/bws.aspx"
)
soup = BeautifulSoup(result.content, "lxml")

oldbook = soup.find("div", {"align": "justify"})
print(oldbook.get_text(strip=True, separator="\n"))

印刷品:

“I have an exposition of sleep come upon me.”
—Shakespeare.
I have frequently wondered if the majority of mankind ever pause to reflect upon the occasionally
titanic significance of dreams, and of the obscure world to which they belong. Whilst the greater
number of our nocturnal visions are perhaps no more than faint and fantastic reflections of
our waking experiences—Freud to the contrary with his puerile symbolism—there are

...

\n存在是因为它存在于HTML中。(换行符)

浏览者单独设置页面样式,不一定使用“额外的空格”。 实际上,在您提到的句子中,文本中没有“额外的whitepcaes”,而是包含在一个单独的HTML标记中,并使用CSS设置样式

您可以尝试识别这些特殊标记并分别从中提取文本,并使用regex去除多个连续的空白块

相关问题 更多 >

    热门问题