替代Pandas数据帧中的嵌套for循环,用于大数据处理

2024-09-29 01:35:19 发布

您现在位置:Python中文网/ 问答频道 /正文

共有2个.csv文件-文件1(带有表头订单号、金额、佣金、GST等)和文件2(带有表头hbx号、金额、状态等)

我必须指出,文件2中的“金额”与文件1中的(“金额”–“佣金”–“GST”)相匹配。必须根据文件1中的“订单ID”和文件2中的“hbx ID”执行此操作。 我被要求将状态设置为“匹配”与匹配的项目相对应,否则设置为“不匹配”。 对于文件2中不在文件1中的项目,我应该在文件2中的状态列中输入状态“not present”

我使用嵌套for循环解决了这个问题,我的代码如下所示-

for i in file2.index:
    flag = 0
    for j in file1.index:
        if (file1['Order_ID'][j]).count(file2['hbx_id'][i])>0:
            flag = 1
            if file1['Amount'][j]-file1['Commission'][j]-file1['GST'][j]==file2['amount'][i]:
                file2.at[i,'Status'] = 'matched'
            else:
                file2.at[i,'Status'] = 'not matched'
    if flag == 0:
        file2.at[i,'Status'] = 'not present'

这很好,但我的问题是,如果数据集很大(因为for循环迭代),这将不是一个可行的选择。这个方法、pandas函数或任何其他库或其他东西有什么替代品吗

编辑:-预期输出如下所示-

       hbx_id      tran_date    amount  Status
0   AMBLEY127928'   01-09-2019  36.00   matched
1   SIPTBT18469'    01-09-2019  65.50   not matched
2   'KITJUJ272150'  01-09-2019  37.28   matched
3   'LEGNOE40184'   01-09-2019  20.00   matched
4   hb-pay_9492291-1'   01-09-2019  313.77  not matched
5   'SF3C145362'    01-09-2019  25.00   matched
6   hb-AMBLEY127931-1'  01-09-2019  75.50   not matched
7   'HEANEM26099'   01-09-2019  19.62   matched
8   hb-COOINT242360-1'  01-09-2019  44.15   not present
9   'MUNDCT3880'    01-09-2019  142.00  matched
10  'KDA178133' 01-09-2019  24.53   matched
11  'KSCA1492699'   01-09-2019  9.81    matched
12  'FOROCH372249'  01-09-2019  14.72   matched
13  'CHAMXN202625'  01-09-2019  29.43   matched
14  'DOSFSX48229'   01-09-2019  49.06   matched
15  'ARAEZG242324'  01-09-2019  25.00   matched
16  hb-SHRWOG391499-1'  01-09-2019  19.56   matched
17  'MALSJU13377'   01-09-2019  16.50   matched
18  hb-OMKXDT420930-1'  01-09-2019  10.00   matched
19  'SCCA528997'    01-09-2019  20.00   not matched

file1.head()

       Transaction_ID                          Order_ID Transaction_Date    Updated_Date    Transaction_Type    Status  MID Customer_ID Customer_Nickname   Customer_Phone_No.  ... Commission_Rate Product_Code    GMV_Tier    Transaction_Slab    Request_Type    Refund_Type Refund_Actor    Split_Flag  Split_MID   Split_Id
0   '20190901111212800100168044885072100'   hb-SIPJZO32656-1'   '2019-09-01 00:54:14'   '2019-09-01 00:54:33'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN 9804817853' ... NaN '51051000100000000010'  NaN NaN 'SEAMLESS'  NaN NaN NaN NaN NaN
1   '20190901111212800110168956985465456'   'SHRWOG391493'  '2019-09-01 02:14:43'   '2019-09-01 02:14:46'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN NaN ... NaN '51051000100000000001'  NaN NaN 'SEAMLESS_3D_FORM'  NaN NaN NaN NaN NaN
2   '20190901111212800110168835885639302'   'MIDYEY245461'  '2019-09-01 00:23:36'   '2019-09-01 00:23:39'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN NaN ... NaN '51051000100000000001'  NaN NaN 'SEAMLESS_3D_FORM'  NaN NaN NaN NaN NaN
3   '20190901111212800110168061185352899'   'KITJUJ272194'  '2019-09-01 03:40:00'   '2019-09-01 03:40:03'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN NaN ... NaN '51051000100000000001'  NaN NaN 'SEAMLESS_3D_FORM'  NaN NaN NaN NaN NaN
4   '20190901111212800110168140285515665'   'EXLRYB3995'    '2019-09-01 00:08:21'   '2019-09-01 00:08:21'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  '216810000026794940140' NaN '9515053146'    ... NaN '51051000100000000001'  NaN NaN 'AUTO_DEBIT'    NaN NaN NaN NaN NaN

file2.head()

        hbx_id      tran_date   amount
0   AMBLEY127928'   01-09-2019  36.00
1   SIPTBT18469'    01-09-2019  65.50
2   'KITJUJ272150'  01-09-2019  37.28
3   'LEGNOE40184'   01-09-2019  20.00
4   hb-pay_9492291-1'   01-09-2019  313.77

Tags: 文件id状态statusnotnan金额file1
1条回答
网友
1楼 · 发布于 2024-09-29 01:35:19

首先,我会简化您的数据,以便为file1创建一个新的数据框,它只是您正在使用的列

df1 = file1[['Order_ID', 'Amount', 'Commission', 'GST']]

有了它,您就可以在第一个if语句中进行计算,并用这些值创建一个新列

df1['amount_check'] = df1['Amount'] - df1['Commission'] - df1['GST']

假设df1['Order_ID']是唯一的值,其中一些在file2['hbx_id']中,您可以合并这两列上的两个数据帧,保持所有内容,这样您就有一个数据帧,所有数据都对齐

df1 = df1.merge(file1, left_on='Order_ID', right_on='hbx_id') # might need to add more parameters...

您可以简单地创建一个新列df1['Status'],其中如果df1['amount_check'] == df1['amount'],那么您可以给出值'matched'等等

相关问题 更多 >