使用beauthoulsoup从text/html文档获取干净的文本

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

1条回答

网友

1楼 · 发布于 2024-10-03 02:43:20

编码没有被破坏，它正是它应该有的。'\xa0'是非中断空格的Unicode。在

如果要将此（Unicode）字符串编码为ASCII，可以告诉编解码器忽略它不理解的任何字符：

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do,  9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while  browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic,  \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives  them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'
>>> x.encode('ascii', 'ignore')
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do,  9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while  browsing their site, me: srsly, Erica: unless of course your writing is magic,  me: My writing saves drowning puppies, Just plucks him right out and gives  them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

如果你有时间，你应该看看奈德·巴切尔德最近的视频Pragmatic Unicode。它会让一切变得简单明了！在

相关问题更多 >

编程相关推荐

热门问题

热门文章