检查特殊字符和字符大小中的其他列中是否有列值

2024-09-22 14:32:58 发布

您现在位置:Python中文网/ 问答频道 /正文

这个问题有一个细微的变化Check if column value is in other columns in pandas

我有一个叫做test的数据帧

name_0        name_1    overall_name
Asda          Nan       Tesco
Asda          Nan       ASDA
LIDL 1        Asda      Lidl
AAA           Asda      ASDA
AAA           Asda      ASDA
Sainsbury     Nan       Lidl

如何检查test.overall_name是否位于任何其他列['name_0', 'name_1' etc]中,忽略字符大小(小写/大写)和任何特殊字符

因此,我的理想数据框架应该如下所示:

name_0        name_1    overall_name   namematch 
Asda          Nan       Tesco          no match 
Asda          Nan       ASDA           match
LIDL 1        Asda      Lidl           match
AAA           Asda      ASDA           match
AAA           Asda      ASDA           match
Sainsbury     Nan       Lidl           no match

Tags: 数据nonameintestmatchnanaaa
3条回答

IICU

使df有一个常见的情况。然后使用布尔索引和np.where检查和属性

使用布尔索引结合np.where检查和属性

使用的数据帧

enter image description here

df["namematch "] = np.where((df.drop("overall_name", 1).apply(lambda x:x.str.lower())).isin(df["overall_name"].str.lower()).any(1),'match','nomatch')

#Make df have a common case. In this situation made it lower case
#df=df.apply(lambda x:x.str.lower())
# df["namematch "] = np.where(df.drop("overall_name", 1).isin(df["overall_name"]).any(1),'match','nomatch')

结果

enter image description here

重新创建示例数据框:

df=pd.DataFrame({'name_0':['Asda','AS-DA','Asda','LIDL1','AAA','AAA','Sainsbury'],
                 'name_1':[np.nan,np.nan,'Asda','As da','Asda','Asda',np.nan],
                 'overall_name':['Tesco','ASDA','Lidl1','ASDA','ASDA','Lid1','As da']})

将浮点转换为字符串:

df=df.fillna('nan')

删除特殊字符:“-”和“”,注意:需要导入“regex”库

import re

df = df.applymap(lambda x: re.sub(r'-','', x))
df=df.applymap(lambda x: re.sub(r' ','',x))

创建一个列表:

name_0=df['name_0'].tolist()
name_1=df['name_1'].tolist()
name_concat=name_0+name_1

取得成果:

df['namematch']=df['overall_name'].str.lower().isin([x.lower() for x in name_concat])
df['namematch']=np.where(df['namematch']==True,'match','nomatch')

看看这个:

此方法转换并比较以下值:

import pandas as pd 
import re

def match (first, second, overall):
    f = re.sub(r"[^a-zA-Z]"," ", first.lower()).strip()
    s = re.sub(r"[^a-zA-Z]"," ", second.lower()).strip()
    o = re.sub(r"[^a-zA-Z]"," ", overal.lower()).strip()
    if f == o:
        return 1
    elif s == o:
        return 1
    else:
        return 0

这行代码添加匹配列并将函数应用于每一行:

df['match'] = df.apply(lambda x: match(x['name_0'],x['name_1'],x['overall_name']),axis=1)

结果是这样的:

    name_0  name_1  overall_name    match
  0 Asda    Nan     Tesco             0
  1 Asda    Nan     ASDA              1
  2 LIDL 1  Asda    Lidl              1
  3 AAA     Asda    ASDA              1
  4 AAA     Asda    ASDA              1
  5 Sainsbury   Nan Lidl              0

让我知道它是否适合你

相关问题 更多 >