如何将嵌套的dict重铸成长格式的Pandas数据帧

2024-10-03 21:31:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图创建一个包含多个数据系列和类别的方框图,因此like this

我拥有的数据是多个文件,每个文件都包含一个序列(例如“high”和“low”)。对于每个文件,我有几千行元组,其中包含一个string和一个int,例如

('HHFRVEHAVAEGAK', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('IKEEAVKEKSPSLGK', '3')
('ALLHTVTSILPAEPEAE', '2')
('VAVPTGPTPLDSTPPGGAPHPLTGQEEARAVEK', '5')

我想画出这些序列中字符的出现分布。你知道吗

class MyObj(object):

    __slots__ = ['name', 'seqs', 'charges']

    def __init__(self, name, tuples):
        self.name = name
        self.seqs = set()

        seqs, zs = zip(*tuples)
        self.seqs.update(seqs)
        #self.charges = collections.Counter(zs)
        self.charges = zs

data = {}
inf = ['high_corr.txt', 'low_corr.txt']
names = ['high', 'low']
for i, somefile in enumerate(inf):
    with open(somefile, 'r') as f:
        entries = [literal_eval(line.strip()) for line in f]
        index = names[i] if names else f"File{i}"
        data[index] = MyObj(index, entries)

    def getCounts(seq):
        c = collections.Counter(seq)
        return {aa: c[aa] for aa in seq}

    d = {name: [getCounts(s) for s in pc.seqs] for name, pc in data.items()} # <- tried dict comprehension as well
    df = pd.DataFrame.from_dict(d, orient='index')
    df = df.transpose()

所以当我读完这些文件后,我会得到这样的结果: enter image description here

正如你所看到的,我不能把单个的字符取出来,它们被读作dict,因此不能被打印出来。你知道吗

有没有办法把这些字母分开,作为第三栏,就像链接问题中的例子一样?重申一下,我想要实现的是一个x轴上有字母的方框图,为每个字母画两个方框(highlow)。你知道吗


Tags: 文件nameinselffordataindexnames
1条回答
网友
1楼 · 发布于 2024-10-03 21:31:09

虽然我不确定这是否是最好的方法,但列表理解可能是一种可能性:

import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate your data
d = {'high': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
              {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
              {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}],
     'low': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
             {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
             {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}]}
df = pd.DataFrame(d)
print(df.head())

# “Unpivots” your data
l = [(col, letter, count) 
     for col, series in df.items() 
     for _, dd in series.to_dict().items() 
     for letter, count in dd.items()]
new_df = pd.DataFrame(l)
new_df.columns = ['variable', 'letter', 'count']
print(new_df.head())

# Boxplot with seaborn
sns.boxplot(x='letter',y='count',data=new_df,hue='variable')
plt.show()

对于您在这里描述的大问题,我认为在创建DataFrame之前“unpivot”可能更好,即在您注释的行中使用列表理解而不是dict理解。我没有你的data。我只能猜测可能是这样的:

d = [(name, letter, count)
     for name, pc in data.items()
     for s in pc.seqs
     for letter, count in getCounts(s)]
df = pd.DataFrame(d)
df.columns = ['variable', 'letter', 'count']

相关问题 更多 >