用正则表达式标记dataframe列

2024-10-01 07:38:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧(original_df)和列description,我想通过使用正则表达式搜索描述中的关键字来创建另一列Label

description                   Label

fund trf 0049614823          transfers
alat transfer                transfers
data purchase via            airtime
alat pos buy                  pos
alat web buy                 others
atm wd rch debit money       withdrawals
alat pos buy                  pos
salar alert charges          salary
mtn purchase via             airtime
top- up purchase via         airtime

我想出的密码是

  1. 输入-说明列和正则表达式
  2. 使用正则表达式在description列中搜索模式
  3. 循环通过description并基于 描述关键字
  4. 返回带标签列的完整数据帧

我尝试在这里实现,但我没有得到正确的逻辑,我得到一个关键字错误 我也尝试了我目前可能做的一切,但仍然不能想出正确的逻辑

df = original_df['description'].sample(100)

position = 0


while position < len(df):
    
    if any(re.search(r"(tnf|trsf|trtr|trf|transfer)",df[position])):
        original_df['Label'] == 'transfers'
    
    
        
    elif any(re.search(r'(airtime|data|vtu|recharge|mtn|glo|top-up)',df[position])):
         original_df['Label'] == 'aitime
    
    
    elif  any(re.search(r'(pos|web pos|)',df[position])):
        original_df['Label'] == 'pos
                   
    elif  any(re.search(r'(salary|sal|salar|allow|allowance)',df[position])):
         original_df['Label'] == 'salary'
    
                    
    elif  any(re.search(r'(loan|repayment|lend|borrow)',df[position])):
        original_df['Label'] == 'loan'
        
                    
    elif  any(re.search(r'(withdrawal|cshw|wdr|wd|wdl|withdraw|cwdr|cwd|cdwl|csw)',df[position])):
        return 'withdrawals'
    
    position += 1

    return others

                    
print(df_sample)

Tags: posredfsearchpositionany关键字description
1条回答
网友
1楼 · 发布于 2024-10-01 07:38:14

您可以将正则表达式逻辑放入函数中,然后apply将其放入数据帧。这样可以避免手动循环伪代码

代码:

import pandas as pd
df = pd.DataFrame({ 'description': [
    'fund trf 0049614823',
    'alat transfer',
    'data purchase via',
    'alat pos buy',
    'alat web buy',
    'atm wd rch debit money',
    'alat pos buy',
    'salar alert charges',
    'mtn purchase via',
    'top- up purchase via',
]})
^{tb1}$

根据正则表达式代码创建label()函数:

import re
def label(row):
    if re.search(r'(tnf|trsf|trtr|trf|transfer)', row.description):
        result = 'transfers'
    elif re.search(r'(airtime|data|vtu|recharge|mtn|glo|top-up)', row.description):
        result = 'airtime'
    elif re.search(r'(pos|web pos)', row.description):
        result = 'pos'
    elif re.search(r'(salary|sal|salar|allow|allowance)', row.description):
        result = 'salary'
    elif re.search(r'(loan|repayment|lend|borrow)', row.description):
        result = 'loan'
    elif re.search(r'(withdrawal|cshw|wdr|wd|wdl|withdraw|cwdr|cwd|cdwl|csw)', row.description):
        result = 'withdrawals'
    else:
        result = 'other'
    return result

然后applylabel()函数应用于df的行:

df['label'] = df.apply(label, axis=1)
^{tb2}$

相关问题 更多 >