如何将unicode转换为拉丁字符python

2024-10-02 20:38:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用python将unicode转换成它的拉丁字符,我有一个大的文本文件,其中包含包含unicode的tweets。我只想替换4个unicode,如\u00f6、\u015f、,。。 我只想知道tweet是如何被tweet转发的(原始语言)。这里是实际收集tweets并保存到文本文件中的代码。我添加了“#”!/usr/bin/python

*编码:ISO 8859-9\u*\ u….“I get this error”非ASCII字符'\xf6'在土耳其文件中_代码.py在第21行,但没有声明编码;有关详细信息,请参见http://www.python.org/peps/pep-0263.html

class listener(StreamListener):

    def on_data(self,data):
        try:
            dirty = open('turkeyjson28.txt','a')
            encode = data.encode('ascii','ignore')
            dirty.write(encode)
            good = tweet.decode("utf-8") """
            better = good.decode("utf=8").replace(u"\u00f6", "ö")
            print better    
            dirty.write('\n')
            dirty.close()
            tweet = data.split(',"text":"')[1].split('","source')[0]
            #saveThis = str(time.time())+'::'+tweet
            saveFile = open('turkey_clean28.txt','a')
            saveFile.write(better)
            saveFile.write('\n')
            saveFile.write('\n')
            saveFile.close()
            return True
        except BaseException, e:
            print 'failed ondata,',str(e)
            time.sleep(5)
    def on_error(self, status):
        print status

auth = OAuthHandler(ckey,csecret)
auth.set_access_token(atoken,asecret)
twitterStream = Stream(auth,listener())
twitterStream.filter(track = ["turkey"])

Tags: authdatatimeunicode字符tweetstweetencode
1条回答
网友
1楼 · 发布于 2024-10-02 20:38:27
better = good.decode("utf-8").replace(u"\u00f6", "ö")

更改为

^{pr2}$

或者作为你需要的文件的第一行

#!/usr/bin/python
# -*- coding: utf8 -*-

一般来说,我会避免使用编码解决方案,而只是用您想要的unicode字符编码来代替它

我会经常写一对助手函数来协助这项工作

def decode(byte_str,encodings=["latin1","utf8","cp1252"]):
    if not isinstance(byte_str,str) and isinstance(byte_str,unicode):
       byte_str = encode(byte_str,encodings)
    for enc in encodings:
       try:
          return byte_str.decode(enc)
       except UnicodeDecodeError:
          continue

def encode(unicode_txt,encodings=["latin1","utf8","cp1252"]):
    if not isinstance(unicode_txt,unicode) and isinstance(unicode_txt,str):
       unicode_txt = decode(unicode_txt,encodings)
    for enc in encodings:
       try:
          return unicode_txt.encode(enc)
       except UnicodeDecodeError:
          continue

#then you can just do something like
decode(good).replace(u"\u00f6",decode(u"\u00f6",encodings=["utf8","latin1","ascii"]))

相关问题 更多 >