使用beauthoulsoup从text/html文档获取干净的文本

2024-10-03 02:43:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我的文档有两种内容类型:text/xml和text/html。我想用beauthoulsoup来解析文档,最后得到一个干净的文本版本。文档以元组形式开始,因此我一直在使用repr将其转换为beauthulsoup识别的内容,然后使用find\u all通过搜索div来查找文档的text/html位,如下所示:

soup = BeautifulSoup(repr(msg_data))
text = soup.html.find_all("div")

然后,我将文本转换回一个字符串,将其保存到一个变量中,然后将其转换回soup对象并对其调用get_text,如下所示:

^{pr2}$

但是,这会将编码更改为unicode,如下所示:

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17     
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

当我尝试将其重新编码为UTF-8时,如下所示:

soup.encode('utf-8')

我又回到了无与伦比的类型。在

我想达到这样一个点,我有一个干净的文本保存为一个字符串,然后我可以在文本中找到具体的东西(例如,上面的文本中的“puppies”)。在

基本上,我在这里绕圈子。有人能帮忙吗?一如既往,非常感谢您的帮助。在


Tags: totext文档文本类型内容htmlall
1条回答
网友
1楼 · 发布于 2024-10-03 02:43:20

编码没有被破坏,它正是它应该有的。'\xa0'是非中断空格的Unicode。在

如果要将此(Unicode)字符串编码为ASCII,可以告诉编解码器忽略它不理解的任何字符:

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do,  9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while  browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic,  \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives  them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'
>>> x.encode('ascii', 'ignore')
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do,  9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while  browsing their site, me: srsly, Erica: unless of course your writing is magic,  me: My writing saves drowning puppies, Just plucks him right out and gives  them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

如果你有时间,你应该看看奈德·巴切尔德最近的视频Pragmatic Unicode。它会让一切变得简单明了!在

相关问题 更多 >