无法将HTML从网站正确转换为文本

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'} request = urllib2.Request('SomeURL',None,useragent) myreq = urllib2.urlopen(request, timeout = 5) html = myreq.read() #get paragraphs soup = BeautifulSoup(html) textList = soup.find_all('p') mytext = "" for par in textList: if len(str(par))<2000: print par mytext +=" " + str(par) print "the text is ", mytext

3条回答

网友

1楼 · 编辑于 2024-09-25 00:34:24

我相信问题出在你的系统输出编码，它不能正确地输出编码字符，因为它超出了显示的字符范围。在

beauthoulsoup4旨在完全支持HTML实体。在

注意这些命令的奇怪行为：

>python temp.py
...
ed a blackhead. The plural of ÔÇ£comedoÔÇØ is comedomesÔÇØ.</p>
...

>python temp.py > temp.txt

>cat temp.txt
....
ed a blackhead. The plural of "comedo" is comedomes".</p> <p> </p> <p>Blackheads is an open and wide
....

我建议将输出写入文本文件，或者使用其他终端/更改终端设置以支持更广泛的字符范围。在

网友

2楼 · 编辑于 2024-09-25 00:34:24

由于这是Python2，urllib.urlopen().read()调用返回一个字节字符串，很可能是用UTF-8编码的——您可以查看HTTP报头来查看是否包含了编码。我假设是UTF-8。在

在你开始处理内容之前，你不能解码这个外部的表现，这只会导致你流泪。一般规则：立即解码输入，仅对输出进行编码。在

以下是您的代码，只有两个修改

import urllib2
from BeautifulSoup import BeautifulSoup

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = unicode(myreq.read(), "UTF-8")

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.findAll('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
      print par
      mytext +=" " +  str(par)

print "the text is ", mytext

我所做的只是添加了html的unicode解码，并使用了soup.findAll()，而不是{}。在

网友

3楼 · 编辑于 2024-09-25 00:34:24

这是一个基于人们的答案和我的研究的解决方案。在

import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
   if len(str(par))<2000: 
    par = re.sub('<[^<]+?>', '', par)
    mytext +=" " +  html2text.html2text(par)

 print "the text is ", mytext

相关问题更多 >

编程相关推荐

热门问题

热门文章