Pandas区块读取\u csv,区块之间有重叠

2024-04-19 16:13:31 发布

您现在位置:Python中文网/ 问答频道 /正文

问题陈述

如何使用区块之间有重叠的熊猫来区块读取csv文件

例如,假设列表indexes表示我希望读入的某个数据帧的索引

indexes = [0,1,2,3,4,5,6,7,8,9]

读取\u csv(文件名,chunksize=None):

indexes = [0,1,2,3,4,5,6,7,8,9]  # read in all indexes at once

读取\u csv(文件名,chunksize=5):

indexes = [[0,1,2,3,4], [5,6,7,8,9]]  # iteratively read in mutually exclusive index sets

读取csv(文件名,chunksize=5,重叠=2):

indexes = [[0,1,2,3,4], [3,4,5,6,7], [6,7,8,9]]  # iteratively read in indexes sets with overlap size 2

工作解决方案

我有一个使用skiprowsnrows的黑客解决方案,但它在读取csv文件时会逐渐变慢

indexes = [*range(10)]
chunksize = 5
overlap_count = 2
row_count = len(indexes)  # this I can work out before reading the whole file in rather cheaply

chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)]  # final chunk here may be janky, assume it works for now (it's more about the logic)
for chunk in chunked_indexes:
    skiprows = [*range(chunk[0], chunk[1])]
    pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)

有人对这个问题有什么见解或改进的解决方案吗


Tags: 文件csvinforread文件名countrange
1条回答
网友
1楼 · 发布于 2024-04-19 16:13:31

我认为您应该向skiprow传递一个数字,而不是列表,请尝试:

for i in list(range(0, row_count-overlap_count, chunksize - overlap_count)):
    print (pd.read_csv('test.csv', 
                       skiprows=i+1, #here it is +1 because the first row was header 
                       nrows=chunksize, 
                       index_col=0, # this was how I save my csv
                       header=None) # you may need to read header before
             .index)
Int64Index([0, 1, 2, 3, 4], dtype='int64', name=0)
Int64Index([3, 4, 5, 6, 7], dtype='int64', name=0)
Int64Index([6, 7, 8, 9], dtype='int64', name=0)

相关问题 更多 >