<p>这是一种使用标准库中的工具并维护列顺序的方法。<code>messy_data.txt</code>文件包含原始数据,<code>cleaner_data.txt</code>是保存清理器数据的位置:</p>
<pre><code>from collections import defaultdict, OrderedDict
with open('messy_data.txt') as infile, open('cleaner_data.txt','w') as outfile:
whole_data = [x.strip().split("||") for x in infile]
headers = []
for x in whole_data:
for k in [y.split("=")[0] for y in x]:
if k not in headers:
headers.append(k)
whole_data = [dict(y.split("=") for y in x) for x in whole_data]
output = defaultdict(list)
for header in headers:
for d in whole_data:
output[header].append(d.get(header,'NULL'))
output = OrderedDict((x,output.get(x)) for x in headers)
outfile.write("||".join(list(output.keys()))+"\n")
for row in zip(*output.values()):
outfile.write("||".join(row)+"\n")
</code></pre>
<p>这将产生:</p>
^{pr2}$
<h3>编辑:</h3>
<p>更易于调试的脚本:</p>
<pre><code>from collections import defaultdict, OrderedDict
with open('messy_data.txt') as infile, open('cleaner_data.txt','w') as outfile:
whole_data = [x.strip().split("||") for x in infile]
headers = []
for x in whole_data:
for k in [y.split("=")[0] for y in x]:
if k not in headers:
headers.append(k)
#whole_data = [dict(y.split("=") for y in x) for x in whole_data]
whole_data2 = []
for x in whole_data:
temp_list = [y.split("=") for y in x]
try:
temp_dict = dict(temp_list)
whole_data2.append(temp_dict)
except:
print(temp_list)
continue
output = defaultdict(list)
for header in headers:
for d in whole_data2:
output[header].append(d.get(header,'NULL'))
output = OrderedDict((x,output.get(x)) for x in headers)
print(output)
outfile.write("||".join(list(output.keys()))+"\n")
for row in zip(*output.values()):
outfile.write("||".join(row)+"\n")
</code></pre>
<p>我希望这证明有用。在</p>