如何检查dataframe列是否包含来自另一个dataframe列的字符串并返回python中的相邻单元格?

2024-09-28 23:37:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧,一个包含一列字符串(df=data),我需要对其进行分类,另一个包含可能的类别和搜索词(df=categories)。我想在“数据”数据框中添加一列,返回基于搜索词的类别。例如:

数据:

**RepairName**
A/C is not cold
flat tyre is c
the tyre needs a repair on left side
the aircon is not cold

类别:

**Category**      **SearchTerm**
A/C               aircon
A/C               A/C
Tyre              repair
Tyre              flat

期望结果 数据:

**RepairName**                        **Category**
A/C is not cold                         A/C
flat tyre is c                          Tyre
the tyre needs a repair on left side    Tyre
the aircon is not cold                  A/C

我用apply尝试了以下lambda函数。我不确定我的列引用是否位于正确的位置:

data['Category'] = data['RepairName'].apply(lambda x: categories['Category'] if categories['SearchTerm'] in x else "")
data['Category'] = [categories['Category'] if categories['SearchTerm'] in data['RepairName'] else 0]

但我一直收到错误消息:

TypeError: 'in <string>' requires string as left operand, not Series

这提供了基于SearchTerm的类别是否存在的正确/错误信息,但是我无法返回与搜索术语关联的类别:

data['containName']=data['RepairName'].str.contains('|'.join(categories['SearchTerm']),case=False)

这两种方法有时都有效,但并非一直有效(可能是因为我的一些搜索词不止一个词?)

data['Category'] = [
    next((c for c, k in categories.values if k in s), None) for s in data['RepairName']] 

d = dict(zip(categories['SearchTerm'], categories['Category']))
data['CategoryCheck'] = [next((d[y] for y in x.split() if y in d), None) for x in data['RepairName']]


Tags: the数据indataifisnot类别
2条回答

我们做str.findall然后map

s=df.RepairName.str.findall('|'.join(cat.SearchTerm.tolist())).str[0].\
    map(cat.set_index('SearchTerm').Category)
0     A/C
1    Tyre
2    Tyre
3     A/C
Name: RepairName, dtype: object
df['Category']=s

在我确保所有列都是小写的情况下(为了更好地度量,我还删除了连字符和括号),这种方法就奏效了:

print("All lowercase")
data = data.apply(lambda x: x.astype(str).str.lower())
categories = categories.apply(lambda x: x.astype(str).str.lower())

print("Remove double spacing")
data = data.replace('\s+', ' ', regex=True)

print('Remove hyphens')
data["RepairName"] = data["RepairName"].str.replace('-', '')

print('Remove brackets')
data["RepairName"] = data["RepairName"].str.replace('(', '')
data["RepairName"] = data["RepairName"].str.replace(')', '')

data['Category'] = [
    next((c for c, k in categories.values if k in s), None) for s in data['RepairName']]

相关问题 更多 >