"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
以上是样本数据。 数据是根据电子邮件地址排序的,文件非常大,大约1.5Gb
我想在另一个csv文件中输出类似这样的东西
^{pr2}$也就是说,如果条目第一次出现,我需要追加1,如果第二次出现,我需要追加2,同样,我的意思是我需要计算文件中电子邮件地址的出现次数,如果电子邮件存在两次或更多,我想要日期之间的差异,请记住日期没有排序,因此我们还必须根据特定的电子邮件地址,我正在寻找一个在python中使用numpy或pandas库或任何其他库的解决方案,可以处理这种类型的海量数据,而不会出现内存溢出异常。我有双核处理器,带有centos 6.3,ram为4GB
另一种可能的(系统管理)方法,避免数据库和SQL查询以及运行时进程和硬件资源的大量需求。在
sort
,使用email和这个新字段(即:sort -k2 -k4 -n -t, < converted_input_file > output_file
)EMAIL
,PREV_TIME
和{PREV_TIME=timestamp
,COUNT=1
,EMAIL=new_email
替代1。是添加一个新的字段时间戳,并在打印出行时将其删除。在
注意:如果1.5GB太大,无法一次性排序,请将其拆分为更小的chuck,使用电子邮件作为拆分点。你可以在不同的机器上并行运行这些块
输出_文件.txt公司名称:
您可以将输出通过管道传输到Perl、Python或AWK脚本来处理步骤2。到4。在
使用内置的sqlite3数据库:可以根据需要插入数据、排序和分组,使用大于可用RAM的文件没有问题。在
确保你有0.11,阅读这些文档:http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables,和这些配方:http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore(尤其是“合并数百万行”
这里有一个似乎有效的解决方案。工作流程如下:
1)从csv中按块读取数据并附加到HDF存储 2) 对存储区的迭代,它创建另一个执行合并器的存储区
本质上,我们从表中获取一个块,并与文件其他部分的一个块进行组合。combiner函数不会减少,而是计算该块中所有元素之间的函数(以天为单位的差异),消除重复项,并在每次循环后获取最新数据。有点像递归的reduce。在
这应该是O(num_of_chunks**2)内存和计算时间 在你的情况下,chunksize可以说是1m(或更多)
相关问题 更多 >
编程相关推荐