如何将X个先前的数据拖入CSV行

2024-09-27 07:22:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个非常大的CSV数据,我需要为第2列中每个名称的每一行添加先前的数据,这些数据是在第2列中的当前日期之前的。我认为表示这个问题最简单的方法是提供一个与我的实际数据类似的详细示例,但明显缩小了:

Datatitle,Date,Name,Score,Parameter
data,01/09/13,george,219,dataa,text
data,01/09/13,fred,219,datab,text
data,01/09/13,tom,219,datac,text
data,02/09/13,george,229,datad,text
data,02/09/13,fred,239,datae,text
data,02/09/13,tom,219,dataf,text
data,03/09/13,george,209,datag,text
data,03/09/13,fred,217,datah,text
data,03/09/13,tom,213,datai,text
data,04/09/13,george,219,dataj,text
data,04/09/13,fred,212,datak,text
data,04/09/13,tom,222,datal,text
data,05/09/13,george,319,datam,text
data,05/09/13,fred,225,datan,text
data,05/09/13,tom,220,datao,text
data,06/09/13,george,202,datap,text
data,06/09/13,fred,226,dataq,text
data,06/09/13,tom,223,datar,text
data,06/09/13,george,219,dataae,text

所以对于这个csv的前三行,没有以前的数据。因此,如果我们说我们想把乔治(第1排)最后3次出现在当前日期之前的第3列和第4列,那么它应该是:

^{pr2}$

但是,当以前的数据开始可用时,我们希望生成一个csv,如:

Datatitle,Date,Name,Score,Parameter,LTscore,LTParameter,LTscore+1,LTParameter+1,LTscore+2,LTParameter+3,
data,01/09/13,george,219,dataa,text,x,y,x,y,x,y
data,01/09/13,fred,219,datab,text,x,y,x,y,x,y
data,01/09/13,tom,219,datac,text,x,y,x,y,x,y
data,02/09/13,george,229,datad,text,219,dataa,x,y,x,y
data,02/09/13,fred,239,datae,text,219,datab,x,y,x,y
data,02/09/13,tom,219,dataf,text,219,datac,x,y,x,y
data,03/09/13,george,209,datag,text,229,datad,219,dataa,x,y
data,03/09/13,fred,217,datah,text,239,datae,219,datab,x,y
data,03/09/13,tom,213,datai,text,219,dataf,219,datac,x,y
data,04/09/13,george,219,dataj,text,209,datag,229,datad,219,dataa
data,04/09/13,fred,212,datak,text,217,datah,239,datae,219,datab
data,04/09/13,tom,222,datal,text,213,datai,219,dataf,219,datac
data,05/09/13,george,319,datam,text,219,dataj,209,datag,229,datad
data,05/09/13,fred,225,datan,text,212,datak,217,datah,239,datae
data,05/09/13,tom,220,datao,text,222,datal,213,datai,219,dataf
data,06/09/13,george,202,datap,text,319,datam,219,dataj,209,datag
data,06/09/13,fred,226,dataq,text,225,datan,212,datak,217,datah
data,06/09/13,tom,223,datar,text,220,datao,222,datal,213,datai
data,06/09/13,george,219,datas,text,319,datam,219,dataj,209,datag

您会注意到,对于06/09/13,george出现了两次,两次他都在他的行中附加了相同的字符串319,datam,219,dataj,209,datag。乔治第二次出现时,他得到了同样的字符串,因为上面的george 3行在同一日期。(这只是强调“在当前日期之前的日期。”

从列标题中可以看到,我们收集了最后3个分数和相关的3个参数,并将它们附加到每一行。请注意,这是一个非常简单的例子。事实上,每个日期都会包含几千行,在实际数据中,名字也没有模式,所以我们不希望看到弗雷德、汤姆、乔治在重复的模式上挨在一起。如果有人能帮我解决如何最好地实现这一点(最有效),我将非常感谢。如果有什么不清楚的请告诉我,我会补充更多的细节。任何建设性的意见都很感谢。谢谢你


Tags: 数据textdatafredtomgeorgedatabdataf
3条回答

你的文件似乎是按日期顺序排列的。如果我们为每个日期的每个名称取最后一个条目,并将其添加到每个名称的大小deque中,同时写出每一行,那么就可以做到:

import csv
from collections import deque, defaultdict
from itertools import chain, islice, groupby
from operator import itemgetter

# defaultdict whose first access of a key will create a deque of size 3
# defaulting to [['x', 'y'], ['x', 'y'], ['x' ,'y']]
# Since deques are efficient at head/tail manipulation, then an insert to
# the start is efficient, and when the size is fixed it will cause extra
# elements to "fall off" the end... 
names_previous = defaultdict(lambda: deque([['x', 'y']] * 3, 3))
with open('sample.csv', 'rb') as fin, open('sample_new.csv', 'wb') as fout:
    csvin = csv.reader(fin)
    csvout = csv.writer(fout)
    # Use groupby to detect changes in the date column. Since the data is always
    # asending, the items within the same data are contigious in the data. We use
    # this to identify the rows within the *same* date.
    # date=date we're looking at, rows=an iterable of rows that are in that date...
    for date, rows in groupby(islice(csvin, 1, None), itemgetter(1)):
        # After we've processed entries in this date, we need to know what items of data should
        # be considered for the names we've seen inside this date. Currently the data
        # is taken from the last occurring row for the name.
        to_add = {}
        for row in rows:
            # Output the row present in the file with a *flattened* version of the extra data
            # (previous items) that we wish to apply. eg:
            # [['x, 'y'], ['x', 'y'], ['x', 'y']] becomes ['x', 'y', 'x', 'y', 'x', y'] 
            # So we're easily able to store 3 pairs of data, but flatten it into one long
            # list of 6 items...
            # If the name (row[2]) doesn't exist yet, then by trying to do this, defaultdict
            # will automatically create the default key as above.
            csvout.writerow(row + list(chain.from_iterable(names_previous[row[2]])))
            # Here, we store for the name any additional data that should be included for the name
            # on the next date group. In this instance we store the information seen for the last
            # occurrence of that name in this date. eg: If we've seen it more than once, then
            # we only include data from the last occurrence. 
            # NB: If you wanted to include more than one item of data for the name, then you could
            # utilise a deque again by building it within this date group
            to_add[row[2]] = row[3:5]            
        for key, val in to_add.iteritems():
            # We've finished the date, so before processing the next one, update the previous data
            # for the names. In this case, we push a single item of data to the front of the deck.
            # If, we were storing multiple items in the data loop, then we could .extendleft() instead
            # to insert > 1 set of data from above.
            names_previous[key].appendleft(val)

这将在运行期间只在内存中保留名称和最后3个值。在

可能希望调整以包含正确的/写入新的标头,而不是在输入时跳过这些标头。在

我的两分钱:
-Python 2.7.5
-我使用defaultdict保存每个名称的前一行
-我使用有界长度deques来保存之前的行,因为我喜欢完整deque的fifo行为。这让我很容易思考它-只要不断地往里面塞东西。
-我用过运算符.itemgetter()用于索引和切片,因为它读起来更好。在

from collections import deque, defaultdict
import csv
from functools import partial
from operator import itemgetter

# use a 3 item deque to hold the 
# previous three rows for each name
deck3 = partial(deque, maxlen = 3)
data = defaultdict(deck3)


name = itemgetter(2)
date = itemgetter(1)
sixplus = itemgetter(slice(6,None))

fields = ['Datatitle', 'Date', 'Name', 'Score', 'Parameter',
          'LTscore', 'LTParameter', 'LTscore+1', 'LTParameter+1',
          'LTscore+2', 'LTParameter+3']
with open('data.txt') as infile, open('processed.txt', 'wb') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    writer.writerow(fields)
    # comment out the next line if the data file does not have a header row
    reader.next()
    for row in reader:
        default = deque(['x', 'y', 'x', 'y', 'x', 'y'], maxlen = 6)
        try:
            previous_row = data[name(row)][-1]
            previous_date = date(previous_row)
        except IndexError:
            previous_date = None
        if  previous_date == date(row):
            # use the xtra stuff from last time
            row.extend(sixplus(previous_row))
            # discard the previous row because
            # there is a new row with the same date
            data[name(row)].pop()
        else:
            # add columns 3 and 4 from each previous row
            for deck in data[name(row)]:
                # adding new items to a full deque causes
                # items to drop off the other end
                default.appendleft(deck[4])
                default.appendleft(deck[3])
            row.extend(default)
        writer.writerow(row)
        data[name(row)].append(row)

在一杯波尔图葡萄酒中思考了一下这个解决方案之后,我意识到它太复杂了——当我试图变得花哨时,这种情况往往会发生。对协议不太确定,所以我就不谈了——它确实有一个可能的优势,即为每个名称保留前3行。在

下面是一个使用切片和常规字典的解决方案。它只保留先前处理过的行。简单得多。我保留了itemgetters,同样是为了可读性。在

^{pr2}$

我发现,对于类似类型的处理,积累行并将它们分块写入,而不是单独地写入,可以大大提高性能。另外,如果可能,一次读取整个数据文件也会有所帮助。在

下面是一个代码示例,它将演示您在问题随附的示例数据中要查找的内容。我命名了我的输入文件“输入.csv从工作目录读/写输出.csv“转到同一文件夹。我在代码中使用注释来解释,将以前的记录存储在字典中,并按名称查找,并为每个记录存储一个分数列表—将当前日期记录存储在新的缓冲字典中,并在每次输入日期更改时将其添加到主词典中。如果您有任何问题,请告诉我,代码有点粗糙-只是一个简单的例子。[:6]片段给出了当前名称的最近6个列表项(前三个分数/参数对)。在

import csv

myInput = open('input.csv','rb')
myOutput = open('output.csv','wb')
myFields = ['Datatitle','Date','Name','Score','Parameter','Text',
            'LTscore','LTParameter','LTscore+1','LTParameter+1',
            'LTscore+2','LTParameter+2']
inCsv = csv.DictReader(myInput,myFields)
outCsv = csv.writer(myOutput)
outCsv.writerow(myFields) # Write header row

previous_dict = dict() # store scores from previous dates
new_dict = dict() # buffer for records on current-date only

def add_new():
    # merge new_dict into previous_dict
    global new_dict, previous_dict
    for k in new_dict:
        if not previous_dict.has_key(k):
            previous_dict[k] = list()
        # put new items first
        previous_dict[k] = new_dict[k] + previous_dict[k]
    new_dict = dict() # reset buffer

old_date = '00/00/00' # start with bogus *oldest* date string
inCsv.next() # skip header row
for row in inCsv:
    myTitle = row['Datatitle']
    myDate = row['Date']
    myName = row['Name']
    myScore = row['Score']
    myParameter = row['Parameter']
    myText = row['Text']
    if old_date != myDate:
        add_new() # store new_dict buffer with previous data
        old_date = myDate
    if not new_dict.has_key(myName):
        new_dict[myName] = []
    # put new scores first
    new_dict[myName] = [myScore,myParameter] + new_dict[myName]
    if not previous_dict.has_key(myName):
        previous_dict[myName] = []
    outCsv.writerow([myTitle,myDate,myName,myScore,myParameter,myText] \
                     + previous_dict[myName][:6])
# end loop for each row

myInput.close()
myOutput.close()

我的解决方案应该适合于大型数据集。如果内存消耗是一个问题,每个名字列表的长度可以限制为3个分数-目前我保留所有以前的分数,只是显示3个,以防你将来需要更多。如果数据的大小难以处理,您可以始终使用sqlite文件数据库而不是dict来临时查找磁盘上的数据,而不是全部存储在内存中。对于8G的RAM和2G的数据,您应该可以使用这里使用的内存python字典。确保在64位操作系统上使用64位版本的Python。我的示例没有向屏幕输出任何内容,但是对于一个大文件,您可能需要放置一个print语句,它每隔N行显示一次进度(比如每100、1000行,根据您的系统速度进行选择)。请注意,屏幕输出会减慢处理文件数据的速度。在

相关问题 更多 >

    热门问题