我正在从网页提取文本并将内容写入文件:
import requests
from inscriptis import get_text
from bs4 import BeautifulSoup
page = requests.get(r'http://www3.asiainsurancereview.com//News/View-NewsLetter-Article/id/42528/Type/eDaily/Technology-First-round-of-the-pre-launch-of-the-Ydentity-ICO-starts-today')
soup = BeautifulSoup(page.text, 'lxml')
html = soup.find(class_='article-wrap')
text = get_text(html.text)
print(text)
articleFile = open('test.txt', 'w')
articleFile.write(text)
articleFile.close()
它可以很好地在屏幕上打印内容,但在将内容写入文件时会出现unicode错误:
^{pr2}$在将内容打印到控制台后,我看到文章中有一些商标(TM)符号。所以,我试着这么做:
text=text.encode("utf-8")
但我仍然得到一个错误,尽管是另一个错误:
TypeError Traceback (most recent call last)
<ipython-input-68-3f30355ab29c> in <module>()
12 text=text.encode("utf-8")
13
---> 14 articleFile.write(text)
15
16 articleFile.close()
TypeError: write() argument must be str, not bytes
我试过以下几种方法,但没有效果:
text = get_text(html.text)
from unidecode import unidecode
def remove_non_ascii(text):
return unidecode(str(text, encoding = "utf-8"))
articleFile = open('test.txt', 'w')
articleFile.write(text)
articleFile.close()
它给出以下错误:
TypeError Traceback (most recent call last)
<ipython-input-70-ff7e6a098308> in <module>()
20
21
---> 22 articleFile.write(remove_non_ascii(text))
23
24 articleFile.close()
<ipython-input-70-ff7e6a098308> in remove_non_ascii(text)
9 from unidecode import unidecode
10 def remove_non_ascii(text):
---> 11 return unidecode(str(text, encoding = "utf-8"))
12
13 articleFile = open('test.txt', 'w')
TypeError: decoding str is not supported
我也试过了:
if isinstance(text, str):
text = text
else:
text = text.decode(encoding)
decoded = True
articleFile.write(text)
articleFile.close()
这给出了最初的错误(所以,基本上,它什么也不做):
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-71-f0c817f013af> in <module>()
20
21
---> 22 articleFile.write(text)
23
24 articleFile.close()
UnicodeEncodeError: 'cp950' codec can't encode character '\u2122' in position 51: illegal multibyte sequence
我怎么修?在
我发现解决方案是以二进制模式打开要写入的文件,然后对unicode字符进行编码:
显然,Python无法将编码的unicode文本写入文件,除非正在写入的文件以二进制模式打开。在
相关问题 更多 >
编程相关推荐