<p>确保你有0.11,阅读这些文档:<a href="http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables" rel="noreferrer">http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables</a>,和这些配方:<a href="http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore" rel="noreferrer">http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore</a>(尤其是“合并数百万行”</p>
<p>这里有一个似乎有效的解决方案。工作流程如下:</p>
<p>1)从csv中按块读取数据并附加到HDF存储
2) 对存储区的迭代,它创建另一个执行合并器的存储区</p>
<p>本质上,我们从表中获取一个块,并与文件其他部分的一个块进行组合。combiner函数不会减少,而是计算该块中所有元素之间的函数(以天为单位的差异),消除重复项,并在每次循环后获取最新数据。有点像递归的reduce。在</p>
<p>这应该是O(num_of_chunks**2)内存和计算时间
在你的情况下,chunksize可以说是1m(或更多)</p>
<pre><code>processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
count date diff email
4 1 2011-06-24 00:00:00 0 0000.ANU@GMAIL.COM
1 1 2011-06-24 00:00:00 0 00000.POO@GMAIL.COM
0 1 2010-07-26 00:00:00 0 00000000@11111.COM
2 1 2013-01-01 00:00:00 0 0000650000@YAHOO.COM
3 1 2013-01-26 00:00:00 0 00009.GAURAV@GMAIL.COM
5 1 2011-10-29 00:00:00 0 0000MANNU@GMAIL.COM
6 1 2011-11-21 00:00:00 0 0000PRANNOY0000@GMAIL.COM
7 1 2011-06-26 00:00:00 0 0000PRANNOY0000@YAHOO.CO.IN
8 1 2012-10-25 00:00:00 0 0000RAHUL@GMAIL.COM
9 1 2011-05-10 00:00:00 0 0000SS0@GMAIL.COM
12 1 2010-12-09 00:00:00 0 0001HARISH@GMAIL.COM
11 2 2010-12-12 00:00:00 3 0001HARISH@GMAIL.COM
10 3 2010-12-22 00:00:00 13 0001HARISH@GMAIL.COM
14 1 2012-11-28 00:00:00 0 000AYUSH@GMAIL.COM
15 2 2012-11-29 00:00:00 1 000AYUSH@GMAIL.COM
17 3 2012-12-08 00:00:00 10 000AYUSH@GMAIL.COM
18 4 2012-12-12 00:00:00 14 000AYUSH@GMAIL.COM
13 5 2013-01-25 00:00:00 58 000AYUSH@GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime
# your data
data = """
"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""
# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')
def dp(x, **kwargs):
return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]
chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
header=0,usecols=['email','date'],parse_dates=['date'],
date_parser=dp, chunksize=chunksize)
for i, chunk in enumerate(reader):
chunk['indexer'] = chunk.index + i*chunksize
# create the global index, and keep it in the frame too
df = chunk.set_index('indexer')
# need to set a minimum size for the email column
store.append('data',df,min_itemsize={'email' : 100})
store.close()
# define the combiner function
def combiner(x):
# given a group of emails (the same), return a combination
# with the new data
# sort by the date
y = x.sort('date')
# calc the diff in days (an integer)
y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')
return y
# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)
# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
print "processing [%s] [%s]" % (chunki,in_store_file)
out_store_file = 'datastore_%s.h5' % chunki
out_store = pd.HDFStore(out_store_file,'w')
# iter on store 2
in_store2 = pd.HDFStore(in_store_file)
for df2 in in_store2.select('data',chunksize=chunksize):
# concat & drop dups
df = pd.concat([df1,df2]).drop_duplicates(['email','date'])
# group and combine
result = df.groupby('email').apply(combiner)
# remove the mi (that we created in the groupby)
result = result.reset_index('email',drop=True)
# only store those rows which are in df2!
result = result.reindex(index=df2.index).dropna()
# store to the out_store
out_store.append('data',result,min_itemsize={'email' : 100})
in_store2.close()
out_store.close()
in_store_file = out_store_file
in_store1.close()
# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])
</code></pre>