如何将X个先前的数据拖入CSV行

3条回答

网友

1楼 · 编辑于 2024-09-27 07:22:33

你的文件似乎是按日期顺序排列的。如果我们为每个日期的每个名称取最后一个条目，并将其添加到每个名称的大小deque中，同时写出每一行，那么就可以做到：

import csv
from collections import deque, defaultdict
from itertools import chain, islice, groupby
from operator import itemgetter

# defaultdict whose first access of a key will create a deque of size 3
# defaulting to [['x', 'y'], ['x', 'y'], ['x' ,'y']]
# Since deques are efficient at head/tail manipulation, then an insert to
# the start is efficient, and when the size is fixed it will cause extra
# elements to "fall off" the end... 
names_previous = defaultdict(lambda: deque([['x', 'y']] * 3, 3))
with open('sample.csv', 'rb') as fin, open('sample_new.csv', 'wb') as fout:
    csvin = csv.reader(fin)
    csvout = csv.writer(fout)
    # Use groupby to detect changes in the date column. Since the data is always
    # asending, the items within the same data are contigious in the data. We use
    # this to identify the rows within the *same* date.
    # date=date we're looking at, rows=an iterable of rows that are in that date...
    for date, rows in groupby(islice(csvin, 1, None), itemgetter(1)):
        # After we've processed entries in this date, we need to know what items of data should
        # be considered for the names we've seen inside this date. Currently the data
        # is taken from the last occurring row for the name.
        to_add = {}
        for row in rows:
            # Output the row present in the file with a *flattened* version of the extra data
            # (previous items) that we wish to apply. eg:
            # [['x, 'y'], ['x', 'y'], ['x', 'y']] becomes ['x', 'y', 'x', 'y', 'x', y'] 
            # So we're easily able to store 3 pairs of data, but flatten it into one long
            # list of 6 items...
            # If the name (row[2]) doesn't exist yet, then by trying to do this, defaultdict
            # will automatically create the default key as above.
            csvout.writerow(row + list(chain.from_iterable(names_previous[row[2]])))
            # Here, we store for the name any additional data that should be included for the name
            # on the next date group. In this instance we store the information seen for the last
            # occurrence of that name in this date. eg: If we've seen it more than once, then
            # we only include data from the last occurrence. 
            # NB: If you wanted to include more than one item of data for the name, then you could
            # utilise a deque again by building it within this date group
            to_add[row[2]] = row[3:5]            
        for key, val in to_add.iteritems():
            # We've finished the date, so before processing the next one, update the previous data
            # for the names. In this case, we push a single item of data to the front of the deck.
            # If, we were storing multiple items in the data loop, then we could .extendleft() instead
            # to insert > 1 set of data from above.
            names_previous[key].appendleft(val)

这将在运行期间只在内存中保留名称和最后3个值。在

可能希望调整以包含正确的/写入新的标头，而不是在输入时跳过这些标头。在

网友

2楼 · 编辑于 2024-09-27 07:22:33

我的两分钱：
-Python 2.7.5
-我使用defaultdict保存每个名称的前一行
-我使用有界长度deques来保存之前的行，因为我喜欢完整deque的fifo行为。这让我很容易思考它-只要不断地往里面塞东西。
-我用过运算符.itemgetter（）用于索引和切片，因为它读起来更好。在

from collections import deque, defaultdict
import csv
from functools import partial
from operator import itemgetter

# use a 3 item deque to hold the 
# previous three rows for each name
deck3 = partial(deque, maxlen = 3)
data = defaultdict(deck3)


name = itemgetter(2)
date = itemgetter(1)
sixplus = itemgetter(slice(6,None))

fields = ['Datatitle', 'Date', 'Name', 'Score', 'Parameter',
          'LTscore', 'LTParameter', 'LTscore+1', 'LTParameter+1',
          'LTscore+2', 'LTParameter+3']
with open('data.txt') as infile, open('processed.txt', 'wb') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    writer.writerow(fields)
    # comment out the next line if the data file does not have a header row
    reader.next()
    for row in reader:
        default = deque(['x', 'y', 'x', 'y', 'x', 'y'], maxlen = 6)
        try:
            previous_row = data[name(row)][-1]
            previous_date = date(previous_row)
        except IndexError:
            previous_date = None
        if  previous_date == date(row):
            # use the xtra stuff from last time
            row.extend(sixplus(previous_row))
            # discard the previous row because
            # there is a new row with the same date
            data[name(row)].pop()
        else:
            # add columns 3 and 4 from each previous row
            for deck in data[name(row)]:
                # adding new items to a full deque causes
                # items to drop off the other end
                default.appendleft(deck[4])
                default.appendleft(deck[3])
            row.extend(default)
        writer.writerow(row)
        data[name(row)].append(row)

在一杯波尔图葡萄酒中思考了一下这个解决方案之后，我意识到它太复杂了——当我试图变得花哨时，这种情况往往会发生。对协议不太确定，所以我就不谈了——它确实有一个可能的优势，即为每个名称保留前3行。在

下面是一个使用切片和常规字典的解决方案。它只保留先前处理过的行。简单得多。我保留了itemgetters，同样是为了可读性。在

^{pr2}$

我发现，对于类似类型的处理，积累行并将它们分块写入，而不是单独地写入，可以大大提高性能。另外，如果可能，一次读取整个数据文件也会有所帮助。在

网友

3楼 · 编辑于 2024-09-27 07:22:33

下面是一个代码示例，它将演示您在问题随附的示例数据中要查找的内容。我命名了我的输入文件“输入.csv从工作目录读/写输出.csv“转到同一文件夹。我在代码中使用注释来解释，将以前的记录存储在字典中，并按名称查找，并为每个记录存储一个分数列表—将当前日期记录存储在新的缓冲字典中，并在每次输入日期更改时将其添加到主词典中。如果您有任何问题，请告诉我，代码有点粗糙-只是一个简单的例子。[：6]片段给出了当前名称的最近6个列表项（前三个分数/参数对）。在

import csv

myInput = open('input.csv','rb')
myOutput = open('output.csv','wb')
myFields = ['Datatitle','Date','Name','Score','Parameter','Text',
            'LTscore','LTParameter','LTscore+1','LTParameter+1',
            'LTscore+2','LTParameter+2']
inCsv = csv.DictReader(myInput,myFields)
outCsv = csv.writer(myOutput)
outCsv.writerow(myFields) # Write header row

previous_dict = dict() # store scores from previous dates
new_dict = dict() # buffer for records on current-date only

def add_new():
    # merge new_dict into previous_dict
    global new_dict, previous_dict
    for k in new_dict:
        if not previous_dict.has_key(k):
            previous_dict[k] = list()
        # put new items first
        previous_dict[k] = new_dict[k] + previous_dict[k]
    new_dict = dict() # reset buffer

old_date = '00/00/00' # start with bogus *oldest* date string
inCsv.next() # skip header row
for row in inCsv:
    myTitle = row['Datatitle']
    myDate = row['Date']
    myName = row['Name']
    myScore = row['Score']
    myParameter = row['Parameter']
    myText = row['Text']
    if old_date != myDate:
        add_new() # store new_dict buffer with previous data
        old_date = myDate
    if not new_dict.has_key(myName):
        new_dict[myName] = []
    # put new scores first
    new_dict[myName] = [myScore,myParameter] + new_dict[myName]
    if not previous_dict.has_key(myName):
        previous_dict[myName] = []
    outCsv.writerow([myTitle,myDate,myName,myScore,myParameter,myText] \
                     + previous_dict[myName][:6])
# end loop for each row

myInput.close()
myOutput.close()

我的解决方案应该适合于大型数据集。如果内存消耗是一个问题，每个名字列表的长度可以限制为3个分数-目前我保留所有以前的分数，只是显示3个，以防你将来需要更多。如果数据的大小难以处理，您可以始终使用sqlite文件数据库而不是dict来临时查找磁盘上的数据，而不是全部存储在内存中。对于8G的RAM和2G的数据，您应该可以使用这里使用的内存python字典。确保在64位操作系统上使用64位版本的Python。我的示例没有向屏幕输出任何内容，但是对于一个大文件，您可能需要放置一个print语句，它每隔N行显示一次进度（比如每100、1000行，根据您的系统速度进行选择）。请注意，屏幕输出会减慢处理文件数据的速度。在

相关问题更多 >

编程相关推荐

热门问题

热门文章