如何将X个先前的数据拖入CSV行问题的回答

如何将X个先前的数据拖入CSV行

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个非常大的CSV数据，我需要为第2列中每个名称的每一行添加先前的数据，这些数据是在第2列中的当前日期之前的。我认为表示这个问题最简单的方法是提供一个与我的实际数据类似的详细示例，但明显缩小了： <pre><code>Datatitle,Date,Name,Score,Parameter data,01/09/13,george,219,dataa,text data,01/09/13,fred,219,datab,text data,01/09/13,tom,219,datac,text data,02/09/13,george,229,datad,text data,02/09/13,fred,239,datae,text data,02/09/13,tom,219,dataf,text data,03/09/13,george,209,datag,text data,03/09/13,fred,217,datah,text data,03/09/13,tom,213,datai,text data,04/09/13,george,219,dataj,text data,04/09/13,fred,212,datak,text data,04/09/13,tom,222,datal,text data,05/09/13,george,319,datam,text data,05/09/13,fred,225,datan,text data,05/09/13,tom,220,datao,text data,06/09/13,george,202,datap,text data,06/09/13,fred,226,dataq,text data,06/09/13,tom,223,datar,text data,06/09/13,george,219,dataae,text </code></pre> 所以对于这个csv的前三行，没有以前的数据。因此，如果我们说我们想把乔治（第1排）最后3次出现在当前日期之前的第3列和第4列，那么它应该是： ^{pr2}$ 但是，当以前的数据开始可用时，我们希望生成一个csv，如： <pre><code>Datatitle,Date,Name,Score,Parameter,LTscore,LTParameter,LTscore+1,LTParameter+1,LTscore+2,LTParameter+3, data,01/09/13,george,219,dataa,text,x,y,x,y,x,y data,01/09/13,fred,219,datab,text,x,y,x,y,x,y data,01/09/13,tom,219,datac,text,x,y,x,y,x,y data,02/09/13,george,229,datad,text,219,dataa,x,y,x,y data,02/09/13,fred,239,datae,text,219,datab,x,y,x,y data,02/09/13,tom,219,dataf,text,219,datac,x,y,x,y data,03/09/13,george,209,datag,text,229,datad,219,dataa,x,y data,03/09/13,fred,217,datah,text,239,datae,219,datab,x,y data,03/09/13,tom,213,datai,text,219,dataf,219,datac,x,y data,04/09/13,george,219,dataj,text,209,datag,229,datad,219,dataa data,04/09/13,fred,212,datak,text,217,datah,239,datae,219,datab data,04/09/13,tom,222,datal,text,213,datai,219,dataf,219,datac data,05/09/13,george,319,datam,text,219,dataj,209,datag,229,datad data,05/09/13,fred,225,datan,text,212,datak,217,datah,239,datae data,05/09/13,tom,220,datao,text,222,datal,213,datai,219,dataf data,06/09/13,george,202,datap,text,319,datam,219,dataj,209,datag data,06/09/13,fred,226,dataq,text,225,datan,212,datak,217,datah data,06/09/13,tom,223,datar,text,220,datao,222,datal,213,datai data,06/09/13,george,219,datas,text,319,datam,219,dataj,209,datag </code></pre> 您会注意到，对于06/09/13，george出现了两次，两次他都在他的行中附加了相同的字符串<code>319,datam,219,dataj,209,datag</code>。乔治第二次出现时，他得到了同样的字符串，因为上面的george 3行在同一日期。（这只是强调“在当前日期之前的日期。” 从列标题中可以看到，我们收集了最后3个分数和相关的3个参数，并将它们附加到每一行。请注意，这是一个非常简单的例子。事实上，每个日期都会包含几千行，在实际数据中，名字也没有模式，所以我们不希望看到弗雷德、汤姆、乔治在重复的模式上挨在一起。如果有人能帮我解决如何最好地实现这一点（最有效），我将非常感谢。如果有什么不清楚的请告诉我，我会补充更多的细节。任何建设性的意见都很感谢。谢谢你

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

你的文件似乎是按日期顺序排列的。如果我们为每个日期的每个名称取最后一个条目，并将其添加到每个名称的大小deque中，同时写出每一行，那么就可以做到： <pre><code>import csv from collections import deque, defaultdict from itertools import chain, islice, groupby from operator import itemgetter # defaultdict whose first access of a key will create a deque of size 3 # defaulting to [['x', 'y'], ['x', 'y'], ['x' ,'y']] # Since deques are efficient at head/tail manipulation, then an insert to # the start is efficient, and when the size is fixed it will cause extra # elements to "fall off" the end... names_previous = defaultdict(lambda: deque([['x', 'y']] * 3, 3)) with open('sample.csv', 'rb') as fin, open('sample_new.csv', 'wb') as fout: csvin = csv.reader(fin) csvout = csv.writer(fout) # Use groupby to detect changes in the date column. Since the data is always # asending, the items within the same data are contigious in the data. We use # this to identify the rows within the *same* date. # date=date we're looking at, rows=an iterable of rows that are in that date... for date, rows in groupby(islice(csvin, 1, None), itemgetter(1)): # After we've processed entries in this date, we need to know what items of data should # be considered for the names we've seen inside this date. Currently the data # is taken from the last occurring row for the name. to_add = {} for row in rows: # Output the row present in the file with a *flattened* version of the extra data # (previous items) that we wish to apply. eg: # [['x, 'y'], ['x', 'y'], ['x', 'y']] becomes ['x', 'y', 'x', 'y', 'x', y'] # So we're easily able to store 3 pairs of data, but flatten it into one long # list of 6 items... # If the name (row[2]) doesn't exist yet, then by trying to do this, defaultdict # will automatically create the default key as above. csvout.writerow(row + list(chain.from_iterable(names_previous[row[2]]))) # Here, we store for the name any additional data that should be included for the name # on the next date group. In this instance we store the information seen for the last # occurrence of that name in this date. eg: If we've seen it more than once, then # we only include data from the last occurrence. # NB: If you wanted to include more than one item of data for the name, then you could # utilise a deque again by building it within this date group to_add[row[2]] = row[3:5] for key, val in to_add.iteritems(): # We've finished the date, so before processing the next one, update the previous data # for the names. In this case, we push a single item of data to the front of the deck. # If, we were storing multiple items in the data loop, then we could .extendleft() instead # to insert > 1 set of data from above. names_previous[key].appendleft(val) </code></pre> 这将在运行期间只在内存中保留名称和最后3个值。在 可能希望调整以包含正确的/写入新的标头，而不是在输入时跳过这些标头。在

如何将X个先前的数据拖入CSV行

1 个回答

相关Python问题