<p>这看起来是<code>pandas</code>的问题!不幸的是,熊猫只带我们走了这么远,然后我们不得不自己做一些操作。这既不是快速的,也不是特别高效的代码,但它可以完成任务。在</p>
<pre><code>import pandas as pd
import json
from collections import defaultdict
# here we import the tsv files as pandas df
f1 = pd.read_table('f1.tsv', delim_whitespace=True)
f2 = pd.read_table('f2.tsv', delim_whitespace=True)
# we then let pandas merge them
newframe = f1.merge(f2, how='outer', on=['gene', 'sample'])
# have pandas write them out to a json, and then read them back in as a
# python object (a list of dicts)
pythonList = json.loads(newframe.to_json(orient='records'))
newDict = {}
for d in pythonList:
gene = d['gene']
sample = d['sample']
sampleDict = {'sample':sample,
'extras':[]}
extrasdict = defaultdict(lambda:dict())
if gene not in newDict:
newDict[gene] = {'gene':gene, 'samples':[]}
for key, value in d.iteritems():
if 'other' not in key or value is None:
continue
else:
id = key.split('other')[-1]
if len(id) == 1:
extrasdict['1'][key] = value
else:
extrasdict['{}'.format(id[0])][key] = value
for value in extrasdict.values():
sampleDict['extras'].append(value)
newDict[gene]['samples'].append(sampleDict)
newList = [v for k, v in newDict.iteritems()]
print json.dumps(newList)
</code></pre>
<p>如果这看起来是一个解决方案,将为您工作,我很高兴花一些时间清理它,使其诱饵更加可读和高效。在</p>
<p>PS:如果你喜欢R,那么pandas就是最好的选择(它是为了给python中的数据提供一个类似R的接口)</p>