查询dataframe中展开包含的列中的确切单词

2024-09-29 20:23:55 发布

您现在位置:Python中文网/ 问答频道 /正文

具有具有以下列的数据帧df:

Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')

我感兴趣的是只获取包含在synonyms_text中的行,只获取单词food,而不是seafood,例如:

df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]

具有以下结果(包括海鲜、foodlocker和其他不需要的产品):

           category   synonyms_text  \
130          Fishing  seafarm, seafood, shellfish, sportfish   
141   Refrigeration   coldstorage, foodlocker, freeze, fridge, ice, refrigeration   
183     Food Service  cook, fastfood, foodserve, foodservice, foodtruck, mealprep   
200       Restaurant  expresso, food, galley, gastropub, grill, java, kitchen
377         fastfood  carryout, fastfood, takeout
379  Animal Supplies  feed, fodder, grain, hay, petfood   
613            store  convenience, food, grocer, grocery, market

然后,我将结果发送到一个列表中,只获取单词中的食物:

food_l=df_text['synonyms_text'].str.split().tolist()

但是,我得到的列表值如下:

['carryout,', 'fastfood,', 'takeout']

所以,我去掉逗号:

food_l= [[x.replace(",","") for x in l]for l in food_l]

最后,我将从列表列表中得到单词food

food_l= [[l for x in l if "food"==x]for l in food_l]

之后,我去掉了空列表:

food_l= [x for x in food_l if x != []]

最后,我将列表列表展平以获得最终结果:

food_l = [item for sublist in food_l for item in sublist]

最终结果如下:

[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]

@Erfan此数据帧可用作测试:

df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})

两者都是空的:

df_tmp=  df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]

你知道一个更好的方法来获得只有一个单词food的行,而不必经历所有这些痛苦的过程吗?我们是否有其他不同于contains的函数在dataframe中查找dataframe值的精确匹配

谢谢


Tags: textindf列表forfood单词loc
1条回答
网友
1楼 · 发布于 2024-09-29 20:23:55

数据帧示例:

df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
                   'synonyms_text':['seafood','foodlocker','food']})

print(df)
        category synonyms_text
0        Fishing       seafood
1  Refrigeration    foodlocker
2          store          food # <  we want only the rows with exact "food"

我们有三种方法可以做到这一点:

  1. str.match
  2. str.contains
  3. str.extract(这里不是很有用)
# 1
df['synonyms_text'].str.match(r'\bfood\b')
# 2 
df['synonyms_text'].str.match(r'\bfood\b')
# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')

输出

0    False
1    False
2     True
Name: synonyms_text, dtype: bool

最后我们使用boolean序列过滤掉数据帧.loc

m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]

输出

  category synonyms_text
2    store          food

奖金

要匹配不区分大小写请使用?i

例如:

df['synonyms_text'].str.match(r'\b(?i)food\b')

哪个匹配:foodFoodFOODfOoD

相关问题 更多 >

    热门问题