确定缺少值的数据帧是否是另一个数据帧的子集

1条回答

网友

1楼 · 发布于 2024-10-04 07:39:07

感谢@user202729建议查找最大匹配问题。这是我最终使用的解决方案

TL；医生：

import pandas as pd
import numpy as np
from scipy.sparse.csgraph import maximum_bipartite_matching
from scipy.sparse import csr_matrix

def is_match(df_partial, df_full):
    full = df_full.to_numpy()
    partial = df_partial.to_numpy()
    nans = df_partial.isna().to_numpy()
    matches = (full[:, np.newaxis, :] == partial) | nans
    adjacency_matrix = matches.all(axis=2)
    matching = maximum_bipartite_matching(csr_matrix(adjacency_matrix))
    return (matching >= 0).all()

下面，我将使用问题中给出的第一个示例更详细地介绍这些步骤

首先，我们创建一个矩阵，其中元素i，j是True，如果full_df的行i与partial_df的行j匹配，否则为false

full = df_full.to_numpy()
partial = df_partial.to_numpy()
nans = df_partial.isna().to_numpy()

# Use numpy broadcasting to get a pairwise row comparison
matches = (full[:, np.newaxis, :] == partial) | nans
adjacency_matrix = matches.all(axis=2)

[[ True  True]
 [ True  True]
 [False False]]

我们可以将其视为二部图的邻接矩阵，其中顶点是数据帧中的行，边位于匹配的行之间。我们想知道是否可以将df_partial中的每一行与df_full中的一行进行匹配。一个更一般的问题是，在df_partial中我们可以匹配的最大行数是多少

这个问题称为二部最大匹配问题，可以使用Hopcroft–Karp算法解决。据我所知，这是解决这个问题最有效的方法。在scipy中有一个实现

from scipy.sparse.csgraph import maximum_bipartite_matching
from scipy.sparse import csr_matrix


matching = maximum_bipartite_matching(csr_matrix(adjacency_matrix))

[0 1]

scipy函数maximum_bipartite_matching使用-1表示无法匹配的顶点，因此如果没有-1值，则df_partial是df_full的“子集”

is_subset = (matching >= 0).all()

True

相关问题更多 >

编程相关推荐

热门问题

热门文章

确定缺少值的数据帧是否是另一个数据帧的子集

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >