在数据帧字符串中查找单词的交集仅限整个单词

Bus # DESCRIPTION Bus1 RICE MILLS MANUFACTURER Bus2 LICORICE CANDY RETAIL Bus3 LICORICE CANDY WHOLESALE Bus4 RICE RETAIL

df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[0])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[1])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][0].split()[2])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[0])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[1])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][1].split()[2])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[0])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[1])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][2].split()[2])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[0])] df[df['DESCRIPTION'].str.contains(df['DESCRIPTION'][3].split()[1])]

2条回答

网友

1楼 · 编辑于 2024-09-30 08:36:48

这仍然是O（n^2），但是，它是高度矢量化的。在

# get values of DESCRIPTION
s = df.DESCRIPTION.values.astype(str)

# parse strings and turn into sets
sets = np.array([set(l) for l in np.core.defchararray.split(s).tolist()])

# get upper triangle indices for all combinations of DESCRIPTION
r, c = np.triu_indices(len(sets), 1)

# use set operations to replicate intersection
i = sets[r] - sets[c] < sets[r]

# grab indices where intersections happen
r, c = r[i], c[i]
r, c = np.append(r, c), np.append(c, r)

结果

^{pr2}$

比较计时

# build truth matrix
t = np.empty((s.size, s.size), dtype=np.bool)
t.fill(False)

t[r, c] = True

pd.DataFrame(t, df.index, df.index)

       0      1      2      3
0  False  False  False   True
1  False  False   True   True
2  False   True  False  False
3   True   True  False  False

定时

网友

2楼 · 编辑于 2024-09-30 08:36:48

def match_word(ref_row,series):
    """
     inputs
    ref_row (str): this is the string of reference
    series (pandas.series): this a series containing all other strings you want to cross-check
     outputs:
    series (pandas.series): this will be a series of booleans
    """
    #convert ref_row into a set of strings. Use strip to remove whitespaces before and after the initial string
    ref_row = set(ref_row.strip().split(' '))
    #convert strings into set of strings 
    series = series.apply(lambda x:set(x.strip().split(' ')))
    #now cross check each row with the reference row.
    #find the size (number of words) of the intersection
    series = series.apply(lambda x:len(list(x.intersection(ref_row))))
    #if the size of the intersection set is greater than zero. Then there is a common word between ref_row and all the series
    series = series>0
    return series

现在，您可以按如下方式调用上述函数：

^{pr2}$

请注意，这不是最好的优化算法，但它是快速和肮脏的方法。这是一个O（n2）。在

结果

比较计时

相关问题更多 >

编程相关推荐

热门问题

热门文章