Python读取unicode替换的csv文件

1条回答

网友

1楼 · 发布于 2024-09-28 01:29:25

我不认为你的问题真的存在：

Ok, now self.mapping[example][0] = u'\xe0'. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'.

这些只是同一字符串的不同表示。您可以自己测试：

>>> u'\xe0' == u'\u00e0'
True

实际的问题是你没有做任何替换。在本规范中：

^{pr2}$

您只是反复调用string.replace，它返回一个新字符串，但对string本身没有任何作用。（它不能对string本身做任何事情；字符串是不可变的）

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string = string.replace(old, new)
    return string

但是，如果string真的是一个UTF-8编码的str，正如函数名所暗示的那样，这仍然行不通。当你用UTF-8编码u'\u00e0'时，你得到的是'\xce\xa0'。里面没有要替换的\u00e0。所以，你真正需要做的是解码，替换，然后重新编码。像这样：

def replace_UTF8(self, string):
    u = string.decode('utf-8')
    for old, new in self.mapping:
        print new
        u = u.replace(old, new)
    return u.encode('utf-8')

或者，更好的方法是，在整个程序中保持unicode而不是编码的str，这样你就不必担心这些东西了。在

最后，当字符串（无论是str和unicode）都有一个内置的^{}方法来完成您想要的操作时，这是一种非常缓慢和复杂的替换方法。在

与其将表构建为Unicode字符串对的列表，不如将其构建为将序号映射到序号的dict映射：

mapping = {}
for row in reader:
    mapping[ord(row[0].decode("unicode_escape"))] = ord(row[1])

现在，整个过程都是一行代码，即使你的编码一团糟：

def replace_UTF8(self, string):
    return string.decode('utf-8').translate(self.mapping).encode('utf-8')

相关问题更多 >

编程相关推荐

热门问题

热门文章