来自Pandas数据帧的成对矩阵

Al01 BBR60 CA07 NL219 AAEAMEVAT MP NaN MP MP AAFEDLRLL NaN NaN NaN NaN AAGAAVKGV NP NaN NP NP ADRGLLRDI NaN NP NaN NaN AEIMKICST PB1 NaN NaN PB1 AFDERRAGK NaN NaN NP NP AFDERRAGK NP NaN NaN NaN

2条回答

网友

1楼 · 编辑于 2024-05-19 12:51:44

只是矩阵乘法：

import pandas as pd
df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True)
df2 = df.applymap(lambda x: int(not pd.isnull(x)))
print df2.T.dot(df2)

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-05-19 12:51:44

您正在执行的操作可以表示为^{}的应用程序，它是每对列之间的内积：

import numpy as np
import pandas as pd

df = pd.read_table('data', sep='\s+')
print(df)
#   Al01 BBR60 CA07 NL219
# 0   MP   NaN   MP    MP
# 1  NaN   NaN  NaN   NaN
# 2   NP   NaN   NP    NP
# 3  NaN    NP  NaN   NaN
# 4  PB1   NaN  NaN   PB1
# 5  NaN   NaN   NP    NP
# 6   NP   NaN  NaN   NaN

arr = (~df.isnull()).values.astype('int')
print(arr)
# [[1 0 1 1]
#  [0 0 0 0]
#  [1 0 1 1]
#  [0 1 0 0]
#  [1 0 0 1]
#  [0 0 1 1]
#  [1 0 0 0]]

result = pd.DataFrame(np.einsum('ij,ik', arr, arr),
                      columns=df.columns, index=df.columns)
print(result)

收益率

^{pr2}$

通常，当计算归结为一个与指数无关的数值运算时，用NumPy比用Pandas计算要快。这里似乎就是这样：

In [130]: %timeit df2 = df.applymap(lambda x: int(not pd.isnull(x)));  df2.T.dot(df2)
1000 loops, best of 3: 1.12 ms per loop

In [132]: %timeit arr = (~df.isnull()).values.astype('int'); pd.DataFrame(np.einsum('ij,ik', arr, arr), columns=df.columns, index=df.columns)
10000 loops, best of 3: 132 µs per loop

相关问题更多 >

编程相关推荐

热门问题

热门文章

来自Pandas数据帧的成对矩阵

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >