在python3中编写文件时,如何修复导致“cp950”错误的商标符号?

2024-05-05 05:24:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在从网页提取文本并将内容写入文件:

import requests
from inscriptis import get_text
from bs4 import BeautifulSoup

page = requests.get(r'http://www3.asiainsurancereview.com//News/View-NewsLetter-Article/id/42528/Type/eDaily/Technology-First-round-of-the-pre-launch-of-the-Ydentity-ICO-starts-today')
soup = BeautifulSoup(page.text, 'lxml')
html = soup.find(class_='article-wrap')
text = get_text(html.text)
print(text)

articleFile = open('test.txt', 'w')
articleFile.write(text)
articleFile.close()

它可以很好地在屏幕上打印内容,但在将内容写入文件时会出现unicode错误:

^{pr2}$

在将内容打印到控制台后,我看到文章中有一些商标(TM)符号。所以,我试着这么做:

text=text.encode("utf-8")

但我仍然得到一个错误,尽管是另一个错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-68-3f30355ab29c> in <module>()
     12 text=text.encode("utf-8")
     13 
---> 14 articleFile.write(text)
     15 
     16 articleFile.close()

TypeError: write() argument must be str, not bytes

我试过以下几种方法,但没有效果:

text = get_text(html.text)

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(str(text, encoding = "utf-8"))

articleFile = open('test.txt', 'w')
articleFile.write(text)
articleFile.close()

它给出以下错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-70-ff7e6a098308> in <module>()
     20 
     21 
---> 22 articleFile.write(remove_non_ascii(text))
     23 
     24 articleFile.close()

<ipython-input-70-ff7e6a098308> in remove_non_ascii(text)
      9 from unidecode import unidecode
     10 def remove_non_ascii(text):
---> 11     return unidecode(str(text, encoding = "utf-8"))
     12 
     13 articleFile = open('test.txt', 'w')

TypeError: decoding str is not supported

我也试过了:

if isinstance(text, str):
    text = text
else:
    text = text.decode(encoding)
    decoded = True

articleFile.write(text)
articleFile.close()

这给出了最初的错误(所以,基本上,它什么也不做):

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-71-f0c817f013af> in <module>()
     20 
     21 
---> 22 articleFile.write(text)
     23 
     24 articleFile.close()

UnicodeEncodeError: 'cp950' codec can't encode character '\u2122' in position 51: illegal multibyte sequence

我怎么修?在


Tags: textinfromimport内容closeget错误
1条回答
网友
1楼 · 发布于 2024-05-05 05:24:38

我发现解决方案是以二进制模式打开要写入的文件,然后对unicode字符进行编码:

articleFile = open('test.txt', 'wb')
text=text.encode("utf-8")
articleFile.write(text)
articleFile.close()

显然,Python无法将编码的unicode文本写入文件,除非正在写入的文件以二进制模式打开。在

相关问题 更多 >