Pandas区块读取\u csv，区块之间有重叠

indexes = [*range(10)] chunksize = 5 overlap_count = 2 row_count = len(indexes) # this I can work out before reading the whole file in rather cheaply chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)] # final chunk here may be janky, assume it works for now (it's more about the logic) for chunk in chunked_indexes: skiprows = [*range(chunk[0], chunk[1])] pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)

1条回答

网友

1楼 · 发布于 2024-04-19 16:13:31

我认为您应该向skiprow传递一个数字，而不是列表，请尝试：

for i in list(range(0, row_count-overlap_count, chunksize - overlap_count)):
    print (pd.read_csv('test.csv', 
                       skiprows=i+1, #here it is +1 because the first row was header 
                       nrows=chunksize, 
                       index_col=0, # this was how I save my csv
                       header=None) # you may need to read header before
             .index)
Int64Index([0, 1, 2, 3, 4], dtype='int64', name=0)
Int64Index([3, 4, 5, 6, 7], dtype='int64', name=0)
Int64Index([6, 7, 8, 9], dtype='int64', name=0)

相关问题更多 >

编程相关推荐

热门问题

热门文章