在pandas中,检查主字符串是否包含列表中的字符串,是否从主字符串中删除子字符串并将其添加到新列中

2024-06-14 10:52:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧:

df1=
    A    
0   Black Prada zebra leather Large   
1   green Gucci striped Canvas small   
2   blue Prada Monogram calf leather XL

df2=
    color    pattern   material     size
0   black    zebra     leather      small
1   green    striped   canvas       xl
2   yellow   checkered calf leather medium
3   orange   monogram
4   white    plain
5            pinstripe

我想把df2中的列与df1(控制非均匀大小写和空格)进行比较,如果有匹配项,将匹配项放入df1中的新列中,并从a中删除它。它应该是完全匹配的,这样“小牛皮革”就不会错误地与“leather”匹配,所以结果只剩下不匹配的子字符串答:

^{pr2}$

我尝试过使用for循环,但我的数据集相当大,我觉得这并没有充分利用pandas。我也尝试过contains和isin,但没有成功。是唯一可使用的解决方案。提取df2列并将其转换为正则表达式?谢谢!在


Tags: 数据greenbluesmallcanvasblackdf1large
1条回答
网友
1楼 · 发布于 2024-06-14 10:52:49

更新

{{cdm>{1}你可能想从下面的列中找到。在

在这里,它检查search字符串中与df2列中的单词匹配的最大百分比。如果它满足某个要求的阈值,则将其删除。在

我已经测试过了,它正在工作,但是您可能需要使用一些正则表达式匹配。在

import pandas

def perc_match(src, s):
    '''Return percentage of words in s found in src'''
    # http://stackoverflow.com/a/26985301/943773
    import re
    s = ' | '.join([r'\b{}\b'.format(x) for x in s.split()])
    r = re.compile(s, flags=re.I | re.X)

    return len(r.findall(src))/len(src)


search = ['Black Prada zebra leather Large',
          'green Gucci striped Canvas small',
          'blue Prada Monogram calf leather XL']

d2 = {'color':['black', 'green', 'yellow', 'orange', 'white',''],
      'pattern':['zebra', 'striped', 'checkered', 'monogram', 'plain',
                 'pinstripe'],
      'material':['leather', 'canvas', 'calf leather','','',''],
      'size':['small', 'xl', 'medium','','','']}

df2 = pandas.DataFrame(d2)

# Strip whitespace and make all lower case
strip_lower = lambda x: x.strip().lower()
search = list(map(strip_lower, search))
df2 = df2.applymap(strip_lower)

# Combine all columns to single string for each row
df2['full_str'] = df2.apply(lambda row: ' '.join(row), axis=1)

# Min percent matching
min_thresh = 0.1

# Calculate the percentage match for each row of dataframe
rm_ind = list()
for i in range(len(search)):
    s = search[i]
    # If you want you could save these `perc_matches` for later
    perc_matches = df2['full_str'].apply(perc_match, args=(s,))
    # Mark for removal if above threshold
    if perc_matches.max() > min_thresh:
        rm_ind.append(i)

# Remove indices from `search`
for i in rm_ind:
    del search[i]

相关问题 更多 >