问题陈述
如何使用区块之间有重叠的熊猫来区块读取csv文件
例如,假设列表indexes
表示我希望读入的某个数据帧的索引
indexes = [0,1,2,3,4,5,6,7,8,9]
读取\u csv(文件名,chunksize=None):
indexes = [0,1,2,3,4,5,6,7,8,9] # read in all indexes at once
读取\u csv(文件名,chunksize=5):
indexes = [[0,1,2,3,4], [5,6,7,8,9]] # iteratively read in mutually exclusive index sets
读取csv(文件名,chunksize=5,重叠=2):
indexes = [[0,1,2,3,4], [3,4,5,6,7], [6,7,8,9]] # iteratively read in indexes sets with overlap size 2
工作解决方案
我有一个使用skiprows和nrows的黑客解决方案,但它在读取csv文件时会逐渐变慢
indexes = [*range(10)]
chunksize = 5
overlap_count = 2
row_count = len(indexes) # this I can work out before reading the whole file in rather cheaply
chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)] # final chunk here may be janky, assume it works for now (it's more about the logic)
for chunk in chunked_indexes:
skiprows = [*range(chunk[0], chunk[1])]
pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)
有人对这个问题有什么见解或改进的解决方案吗
我认为您应该向
skiprow
传递一个数字,而不是列表,请尝试:相关问题 更多 >
编程相关推荐