<p>另一种可能的(系统管理)方法,避免数据库和SQL查询以及运行时进程和硬件资源的大量需求。在</p>
<p><strong/strong添加了<strong/strong更多简化代码方法:-在</p>
<ol>
<li><a href="https://stackoverflow.com/questions/6819601/convert-datetime-in-to-epoch-using-awk-piped-data">Convert the timestamp</a>到秒(从Epoch开始),并使用UNIX <code>sort</code>,使用email和这个新字段(即:<code>sort -k2 -k4 -n -t, < converted_input_file > output_file</code>)</li>
<li>初始化3个变量,<code>EMAIL</code>,<code>PREV_TIME</code>和{<cd5>}</li>
<li>在每一行进行交互,如果遇到新邮件,添加“1,0天”。更新<code>PREV_TIME=timestamp</code>,<code>COUNT=1</code>,<code>EMAIL=new_email</code></li>
<li>下一行:3种可能的情况
<ul>
<li>a) 如果相同的电子邮件,不同的时间戳:计算天数,增量计数=1,更新上一个时间,添加“COUNT,Difference_in_days”</li>
<li>b) 如果相同的电子邮件,相同的时间戳:increment COUNT,添加“COUNT,0 day”</li>
<li>c) 如果是新邮件,从3开始。在</li>
</ul></li>
</ol>
<p>替代1。是添加一个新的字段时间戳,并在打印出行时将其删除。在</p>
<p>注意:如果1.5GB太大,无法一次性排序,请将其拆分为更小的chuck,使用电子邮件作为拆分点。你可以在不同的机器上并行运行这些块</p>
<pre><code>/usr/bin/gawk -F'","' ' {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ");
for (i=1; i<=12; i++) mdigit[month[i]]=i;
print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt | /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
</code></pre>
<p>输出_文件.txt公司名称:</p>
<blockquote>
<p>"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000<br/>
...</p>
</blockquote>
<p>您可以将输出通过管道传输到Perl、Python或AWK脚本来处理步骤2。到4。在</p>