Python:从数据帧中的值(忽略NaN值)形成矩阵(2D数组)

2024-09-30 16:37:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含12列(药物类别)的数据框架,其中相同的值(药物类别名称)可能出现在不同的列中

                             DRG01                     DRG02  ...   DRG11 DRG12
0          AMOXYCILLIN ORAL SOLIDS   AMOEBICIDES ORAL SOLIDS  ...   NaN   NaN
1                    VITAMIN DROPS                       NaN  ...   NaN   NaN
2          AMOXYCILLIN ORAL SOLIDS   ANTIHISTAMINES ORAL LIQ  ...   NaN   NaN
3          AMOEBICIDES ORAL LIQUID                       NaN  ...   NaN   NaN
...                            ...                       ...  ...   ...   ...
81531                          NaN                       NaN  ...   NaN   NaN
[81532 rows x 12 columns]

我的目标是创建一个矩阵(2D数组)——行和列由唯一的药物类别名称组成(忽略/删除NaN值)。单元格的值是这些药物类别名称在一行中同时出现的次数。基本上,我正在努力实现以下目标:

                        AMOXYCILLIN ORAL SOLIDS  AMOEBICIDES ORAL SOLIDS  ANTIHISTAMINES ORALLIQ  VITAM..
AMOXYCILLIN ORAL SOLIDS      0                         1                       1                    0
AMOEBICIDES ORAL SOLIDS      1                         1                       0                    0
ANTIHISTAMINES ORAL LIQ      1                         0                       0                    0
VITAMIN DROPS                0                         0                       0                    1
.....
.....

Tags: 数据名称框架目标nan类别药物drops
2条回答

使用itertools.combinations和一些panda函数,您可以很好地完成这项工作:

pairs_df = pd.DataFrame(df.apply(lambda x: pd.Series(map(sorted, combinations(x, 2))), axis=1).stack().to_list())
# pairs_df has a row for every pair of drugs (in columns 0, 1).
pairs_df["occurrences"] = 1
pairs_df = pairs_df.groupby([0, 1]).sum()  # Group identical combinations and count occurences.
result_df = pairs_df.reset_index(level=1).pivot(columns=1)  # Pivot to create the requested shape.

像这样

from collections import Counter
from collections import defaultdict as dd
import pandas as pd

connection_counter = dd(lambda: Counter()) # count for every drug the time it appears with every other drug
def to_counter(row): #send each row to the connection_counter and add a connection to each value in the row with all other drugs in row  
    for drug_name in row:
        connection_counter[drug_name].update(row)
        connection_counter[drug_name].pop(drug_name,None) # so it won't count an appearance with itself

df.apply(lambda x: to_counter(x), axis = 1)  #df is the table you have 

df1 = pd.DataFrame()  # the table you want

for drug_name in connection_counter:
    df1 = df1.append(pd.DataFrame(connection_counter[drug_name],index = [drug_name]))

相关问题 更多 >