Pandas无法加载数据，csv编码神秘

import pandas import chardet import os #this is what I tried to start data = pandas.read_csv('week1.csv', encoding="utf-8") #spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte #Code to check encoding -- this spits out ascii bytes = min(32, os.path.getsize('week1.csv')) raw = open('week1.csv', 'rb').read(bytes) chardet.detect(raw) #so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway data = pandas.read_csv('week1.csv', encoding="ascii") #spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128) #for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars data = pandas.read_csv('week1.csv', encoding="latin1")

1条回答

网友

1楼 · 发布于 2024-09-29 00:12:44

输入文件似乎有点不对劲。始终存在编码错误。在

你可以做的一件事是将CSV文件作为二进制文件读取，解码二进制字符串并替换错误的字符。在

示例（source表示块读取代码）：

in_filename = 'week1.csv'
out_filename = 'repaired.csv'

from functools import partial
chunksize = 100*1024*1024 # read 100MB at a time

# Decode with UTF-8 and replace errors with "?"
with open(in_filename, 'rb') as in_file:
    with open(out_filename, 'w') as out_file:
        for byte_fragment in iter(partial(in_file.read, chunksize), b''):
            out_file.write(byte_fragment.decode(encoding='utf_8', errors='replace'))

# Now read the repaired file into a dataframe
import pandas as pd
df = pd.read_csv(out_filename)

df.shape
>> (4790108, 11)

df.head()

相关问题更多 >

编程相关推荐

热门问题

热门文章