Python:对事务进行分类的最有效方法

2024-06-26 02:27:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个要分类的事务的大列表。 看起来是这样的:

transactions: [
     {
        "id": "20200117-16045-0",
        "date": "2020-01-17",
        "creationTime": null,
        "text": "SuperB Vesterbro T 74637",
        "originalText": "SuperB Vesterbro T 74637",
        "details": null,
        "category": null,
        "amount": {
            "value": -160.45,
            "currency": "DKK"
        },
        "balance": {
            "value": 12572.68,
            "currency": "DKK"
        },
        "type": "Card",
        "state": "Booked"
    },
    {
        "id": "20200117-4800-0",
        "date": "2020-01-17",
        "creationTime": null,
        "text": "Rent        45228",
        "originalText": "Rent        45228",
        "details": null,
        "category": null,
        "amount": {
            "value": -48.00,
            "currency": "DKK"
        },
        "balance": {
            "value": 12733.13,
            "currency": "DKK"
        },
        "type": "Card",
        "state": "Booked"
    },
    {
        "id": "20200114-1200-0",
        "date": "2020-01-14",
        "creationTime": null,
        "text": "Superbest          86125",
        "originalText": "SUPERBEST          86125",
        "details": null,
        "category": null,
        "amount": {
            "value": -12.00,
            "currency": "DKK"
        },
        "balance": {
            "value": 12781.13,
            "currency": "DKK"
        },
        "type": "Card",
        "state": "Booked"
    }
]

我像这样加载数据:

with open('transactions.json') as transactions:
    file = json.load(transactions)

data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)

到目前为止,我有以下类别,我想按以下方式对交易进行分组:

CATEGORIES = {
    'Groceries': ['SuperB', 'Superbest'],
    'Housing': ['Insurance', 'Rent']
}

现在,我想循环遍历数据帧中的每一行,并对每个事务进行分组。 我想通过检查text是否包含CATEGORIES字典中的一个值来实现这一点

如果是这样,该事务应该被分类为CATEGORIES字典的键-例如Groceries

我如何才能最有效地做到这一点


Tags: textiddatevaluedetails事务amountnull
1条回答
网友
1楼 · 发布于 2024-06-26 02:27:57

IIUC

我们可以从字典中创建管道分隔列表,并使用.loc进行赋值

print(df)
for k,v in CATEGORIES.items():
    pat = '|'.join(v)
    df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
                       text   category
0  SuperB Vesterbro T 74637  Groceries
1         Rent        45228    Housing
2  Superbest          86125  Groceries

更有效的解决方案:

我们创建一个包含所有值的列表,并在重新创建字典的同时使用str.extract提取它们,因此每个值现在都是我们将映射到目标数据帧的键

words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
    for item in v:
        words.append(item)
        mapping_dict[item] = k


ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
                       text   category
0  SuperB Vesterbro T 74637  Groceries
1         Rent        45228    Housing
2  Superbest          86125  Groceries

相关问题 更多 >