有了这个数据集,数据集是虚构的:
cat sample.csv
id,fname,lname,education,gradyear,attributes
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,mit,2003,qa
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,harvard,2007,"test|admin,test"
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,harvard,2007,"test|admin,test"
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,ft,2012,NULL
"6F9619FF-8B86-D011-B42D-00C04FC964F1",john,doe,htw,2000,dev
当我运行这个脚本时,它解析csv并找到唯一的行,当找到更多行时,在on列中合并行:
解析-csv.py文件在
^{pr2}$而不是另一个脚本,它确保每个列只有用“|”分隔的唯一值
在唯一.py在
import csv
import sys
from collections import OrderedDict
import argparse
csv.field_size_limit(sys.maxsize)
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='sql dump parser - unique')
parser.add_argument('-i','--input', help='input file', required=True)
parser.add_argument('-o','--output', help='output file', required=True)
args = parser.parse_args()
inputf = args.input
outputf = args.output
with open(inputf) as fin, open(outputf, 'wb') as fout:
csvin = csv.DictReader(fin)
csvout = csv.DictWriter(fout, fieldnames=csvin.fieldnames, quoting=csv.QUOTE_ALL,lineterminator='\n')
csvout.writeheader()
for row in csvin:
for k, v in row.items():
row[k] = '|'.join(OrderedDict.fromkeys(v.split('|')))
csvout.writerow(row)
它适用于示例.csv在
输出:
$ python parse-csv.py -i sample.csv -o sample-out.csv
$ python unique.py -i sample-out.csv -o sample-final.csv
$ cat sample-final.csv
"id","fname","lname","education","gradyear","attributes"
"6F9619FF-8B86-D011-B42D-00C04FC964FF","john","smith","mit|harvard|ft","2003|2007|2012","qa|test|admin,test|NULL"
"6F9619FF-8B86-D011-B42D-00C04FC964F1","john","doe","htw","2000","dev"
但当我这样做的时候:
(数据集是虚构的)
示例2.csv
id,lastname,firstname,middlename,address1,address2,city,zipcode,city2,zipcode2,emailaddress,website
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J",NULL,NULL,NULL,NULL,NULL,NULL,"",NULL
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.",NULL,NULL,NULL,NULL,NULL,NULL,NULL,"mait@yahoo.com",NULL
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J",NULL,NULL,NULL,NULL,NULL,NULL,"",NULL
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.",NULL,NULL,NULL,NULL,NULL,NULL,NULL,"mait@yahoo.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
输出为:
$ python parse-csv.py -i sample2.csv -o sample2-out.csv
$ python unique.py -i sample2-out.csv -o sample2-final.csv
$ cat sample2-final.csv
"id","lastname","firstname","middlename","address1","address2","city","zipcode","city2","zipcode2","emailaddress","website"
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J","NULL","NULL","NULL","NULL","NULL","NULL","","NULL"
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.","NULL","NULL","NULL","NULL","NULL","NULL","NULL","mait@yahoo.com","NULL"
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J","NULL","NULL","NULL","NULL","NULL","NULL","","NULL"
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.","NULL","NULL","NULL","NULL","NULL","NULL","NULL","mait@yahoo.com","NULL"
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul","NULL","","","","","NULL","NULL","psd@gmail.com","NULL"
为什么它不能像以前那样获得唯一的行和列示例.csv????在
有人有什么想法吗?在
提前谢谢!一直在想这个。。。。在
第一个文件已排序,而第二个文件则不排序。请参见this discussion
你只需要这样:
以下是我对你问题的简单解答(据我所知),使用字典:
这应该独立于已排序的数据集,等等。。。在
如果有帮助就告诉我!在
相关问题 更多 >
编程相关推荐