使用python解析tweets中的unicode

399491624029274112,Kyle aka K-LO,I unlocked 2 Xbox Live achievements in WWE 2K14! http://t.co/wRIxZTjYWg,False,0,Raptr,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,0 399491626584014848,Dots Group LLC,GeekWire Radio: Amazon vs. author Xbox One first take and favorite iPad apps - GeekWire http://t.co/jbbryoHpHe,False,0,IFTTT,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,2 399491630149169152,BETTINGGENIUS!,RT @xJohn69: Sergio Ramos giveaway!; XBOX + PS3; ; -RT; -Follow me and @NeillWagers; -S/Os appreciated; ; Goodluck http://t.co/D997faGSB5,False,0,Twitter for iPad,,,,,2013,11,10,11,0,1,1,0,1,0,0,0,0,2 399491635735953408,Princess of TV,Toy Story of Terror is amaze balls. Thanks Xbox for the free NowTV #disneyweekend,False,0,Twitter for iPhone,,,,,2013,11,10,11,0,2,0,0,1,0,0,0,0,2 399491654136369152,Sam Hambre,'9 Things You Should Know Before Buying a PlayStation 4' http://t.co/Q3Ma1R83cF,False,0,Buffer,,,,,2013,11,10,11,0,7,0,1,0,0,0,0,0,0 399491655780167680,Rhi ✌,@Escape2theMoon that's done what? im not on rn obvs i dont even have access to an xbox :c ?,False,0,web,399490703761223680,Escape2theMoon,1404625770,,2013,11,10,11,0,7,0,0,1,0,0,0,0,0

UnicodeEncodeError Traceback (most recent call last) <ipython-input-114-fd9b136abd74> in <module>() ----> 1 for record in data: 2 tweets = tweets + ' ' + record[2].encode('utf-8', 'replace') UnicodeEncodeError: 'ascii' codec can't encode character u'\u270c' in position 23: ordinal not in range(128)

1条回答

网友

1楼 · 发布于 2024-05-18 14:30:05

问题在于csv.reader它试图将unicode转换回ascii。来自csv docs的注释：

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

按照建议，您可以使用这个配方from the docs examples：

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

使用unicode_csv_readerhelper实用程序，您的代码可以如下所示（稍微修改以使用闭包和循环的join-istead）：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章