当索引仅存在于一个数据帧中时,两个数据帧之间存在差异

2024-09-29 06:23:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图创建一个基于单个数据框的报告,该报告显示数据之间的数学差异

单个数据帧: single df

差异报告 diff report

我使用EvaluatePoint列将数据拆分为两个数据帧,然后使用一天结束数据帧创建一个差异报告数据帧

endofday_df = df[df.EvaluatePoint == 'EndOfDay']
intraday_df = df[df.EvaluatePoint == 'IntraDay']
report_df = endofday_df[['EvaluateDate', 'InvoiceNumber', 'InvoiceItem']]

我将所有3个上的索引设置为EvaluateDate、InvoiceNumber、InvoiceItem

endofday_df = endofday_df.set_index(['EvaluateDate', 'InvoiceNumber', 'InvoiceItem'])
intraday_df = endofday_df.set_index(['EvaluateDate', 'InvoiceNumber', 'InvoiceItem'])
report_df = endofday_df.set_index(['EvaluateDate', 'InvoiceNumber', 'InvoiceItem'])

从这里我可以创建附加的差异列

report_df['CostDiff'] = endofday_df['Cost'] - intraday_df['Cost']

除了索引(EvaluateDate、InvoiceNumber、InvoiceItem)仅存在于endofday_df或intraday_df中的情况外,这在大多数情况下都有效

  1. 这是创建报告的最佳方法吗
  2. 当我减去数据时,如何将索引不存在于两个数据帧中的情况包括在内?我是否应该在索引只存在于一个数据帧中的情况下创建另外两个数据帧

谢谢

编辑

在下面添加完整代码。希望这能让我更清楚地知道我在做什么,以及我想做什么

问题是:

  1. endofday_df-发票号123699在日内_df中不存在,因此这些返回的差异为NaN
  2. 当日日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内日内

我的想法是在创建endofday_df和intrady_df之后再创建4个数据帧

  1. endofday_common_df/intrady_common_df-这些将只包含两个数据帧共用的索引(“EvaluateDate”、“InvoiceNumber”、“InvoiceItem”)。这样任何数学运算都能正常工作
  2. endofday_only_df/盘中_only_df-这些将只包含仅在endofday和仅在盘中的索引

对于如何做到这一点有点困惑,欢迎提出更好的方法

import pandas as pd

Invoices = {'EvaluatePoint': ['EndOfDay', 'EndOfDay', 'EndOfDay', 'EndOfDay', 'EndOfDay', 'EndOfDay', 'IntraDay', 'IntraDay', 'IntraDay', 'IntraDay', 'IntraDay'],
            'EvaluateDate':  ['08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021'],
            'InvoiceNumber': [123697, 123697, 123697, 123698, 123699, 123699, 123696, 123697, 123697, 123697, 123698],
            'InvoiceItem': [0,1,2,0,0,1,0,0,1,2,0],
            'Cost': [-3569,-3745,-3921,-4097,-4273,-4449,-4625,-3569,-3745,-4678,-5329],
            'Proceeds': [7000,7569,8138,8707,9276,9845,10414,7000,7569,8138,12690],
            'NetAmount': [3431,3824,4217,4610,5003,5396,5789,3431,3824,3460,7361]
        }

df = pd.DataFrame(Invoices, columns = ['EvaluatePoint', 'EvaluateDate','InvoiceNumber','InvoiceItem','Cost','Proceeds','NetAmount'])
#print(df)

endofday_df = df[df.EvaluatePoint == 'EndOfDay']
intraday_df = df[df.EvaluatePoint == 'IntraDay']
#print(endofday_df)
#print(intraday_df)

#create base output report from endofday_df
report_df = endofday_df[['EvaluateDate','InvoiceNumber','InvoiceItem']]
#print(report_df)

df_index = ['EvaluateDate','InvoiceNumber','InvoiceItem']

endofday_df = endofday_df.set_index(df_index)
intraday_df = intraday_df.set_index(df_index)
report_df = report_df.set_index(df_index)

DiffColumns = ['Cost','Proceeds','NetAmount']

for col in DiffColumns:
    report_df['DiffOf' + str(col)] = endofday_df[col] - intraday_df[col]

print(report_df)
report_df.to_csv("DiffReport.csv",index=True, header=True)

Tags: 数据reportdfindex报告差异setintraday
2条回答

3个数据帧适合您的情况。这两个具有相同的行:endofday_common_dfintraday_common_df

通过分别执行对齐步骤,可以分隔3组行:

eod, intrad = endofday_df.align(intraday_df)

获取公共行,只需选择任何带有数字的列,不管是哪一列,这里我选择了“成本”:

common = (~eod["Cost"].isna()) & (~intrad["Cost"].isna())

那么具有公共行的报告是:

DiffColumns = ["Cost", "Proceeds", "NetAmount"]
report_df = pd.DataFrame(
    {f"DiffOf{col}": eod[common][col] - intrad[common][col] for col in DiffColumns}
)

对于给定的数据,它如下所示:

                                        DiffOfCost  DiffOfProceeds  DiffOfNetAmount
EvaluateDate InvoiceNumber InvoiceItem                                             
08/06/2021   123697        0                   0.0             0.0              0.0
                           1                   0.0             0.0              0.0
                           2                 757.0             0.0            757.0
             123698        0                1232.0         -3983.0          -2751.0

仅限EOD的行是(同样,请选择数字列):

only_in_eod = (~eod["Cost"].isna()) & (intrad["Cost"].isna())
print(eod[only_in_eod])

看起来是这样的:

                                       EvaluatePoint    Cost  Proceeds  NetAmount
EvaluateDate InvoiceNumber InvoiceItem                                           
08/06/2021   123699        0                EndOfDay -4273.0    9276.0     5003.0
                           1                EndOfDay -4449.0    9845.0     5396.0

只有当天的行是:

only_in_intrad = (eod["Cost"].isna()) & (~intrad["Cost"].isna())
print(intrad[only_in_intrad])

这看起来像:

                                       EvaluatePoint    Cost  Proceeds  NetAmount
EvaluateDate InvoiceNumber InvoiceItem                                           
08/06/2021   123696        0                IntraDay -4625.0   10414.0     5789.0

谢谢你的帮助,在让它完全工作时遇到了问题。通过进一步搜索,我最终使用了intersection()和difference()来分割索引。最终代码如下

我还尝试在reset_index和set_index上使用for循环,但它不会停留在for循环之外,我应该在这里做什么来修复它

DF_Set_Index_List = [report_same_df, report_eod_df, report_intra_df, endofday_same_df, intraday_same_df, endofday_only_df, intraday_only_df]
for df_sil in DF_Set_Index_List:
    print(df_sil,'\n\n\n')
    df_sil = df_sil.set_index(df_default_index)
    print('INDEX: \n', df_sil.index)

print('\n\n', 'Outside Loop')
print(report_same_df)
print('\n\n')
print(report_same_df.index)

关于以下最终代码、清理、更有效的方法等的任何反馈

#!/usr/bin/python

import pandas as pd
from sys import exit

Invoices = {'EvaluatePoint': ['EndOfDay', 'EndOfDay', 'EndOfDay', 'EndOfDay', 'EndOfDay', 'EndOfDay', 'IntraDay', 'IntraDay', 'IntraDay', 'IntraDay', 'IntraDay'],
            'EvaluateDate':  ['08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021','08/06/2021'],
            'InvoiceNumber': [123697, 123697, 123697, 123698, 123699, 123699, 123696, 123697, 123697, 123697, 123698],
            'InvoiceItem': [0,1,2,0,0,1,0,0,1,2,0],
            'Cost': [-3569,-3745,-3921,-4097,-4273,-4449,-4625,-3569,-3745,-4678,-5329],
            'Proceeds': [7000,7569,8138,8707,9276,9845,10414,7000,7569,8138,12690],
            'NetAmount': [3431,3824,4217,4610,5003,5396,5789,3431,3824,3460,7361]
        }

df = pd.DataFrame(Invoices, columns = ['EvaluatePoint', 'EvaluateDate','InvoiceNumber','InvoiceItem','Cost','Proceeds','NetAmount'])

endofday_df = df[df.EvaluatePoint == 'EndOfDay']
intraday_df = df[df.EvaluatePoint == 'IntraDay']

# Set Default Index to compare DataFrames
df_default_index = ['EvaluateDate','InvoiceNumber','InvoiceItem']
endofday_df = endofday_df.set_index(df_default_index)
intraday_df = intraday_df.set_index(df_default_index)

# Create var where index exists in both EndOfDay and IntraDay
idx_same = endofday_df.index.intersection(intraday_df.index)
# Create var where index exists only in EndOfDay or IntraDay
idx_eod_only = endofday_df.index.difference(intraday_df.index)
idx_intra_only = intraday_df.index.difference(endofday_df.index)

#create DataFrames where index is the same between EndOfDay and IntraDay
endofday_same_df = endofday_df.loc[idx_same]
intraday_same_df = intraday_df.loc[idx_same]
#create DataFrames where index is the only in EndOfDay or IntraDay
endofday_only_df = endofday_df.loc[idx_eod_only]
intraday_only_df = intraday_df.loc[idx_intra_only]

# Reset Index so we have only raw data
endofday_same_df = endofday_same_df.reset_index()
intraday_same_df = intraday_same_df.reset_index()
endofday_only_df = endofday_only_df.reset_index()
intraday_only_df = intraday_only_df.reset_index()

# Create Base Report
report_same_df = endofday_same_df[['EvaluateDate','InvoiceNumber','InvoiceItem']]
report_eod_df = endofday_only_df[['EvaluateDate','InvoiceNumber','InvoiceItem']]
report_intra_df = intraday_only_df[['EvaluateDate','InvoiceNumber','InvoiceItem']]

report_same_df = report_same_df.set_index(df_default_index)
report_eod_df = report_eod_df.set_index(df_default_index)
report_intra_df = report_intra_df.set_index(df_default_index)
endofday_same_df = endofday_same_df.set_index(df_default_index)
intraday_same_df = intraday_same_df.set_index(df_default_index)
endofday_only_df = endofday_only_df.set_index(df_default_index)
intraday_only_df = intraday_only_df.set_index(df_default_index)

DiffColumns = ['Cost','Proceeds','NetAmount']
for col in DiffColumns:
    report_same_df['DiffOf' + str(col)] = endofday_same_df[col] - intraday_same_df[col]
    report_eod_df['DiffOf' + str(col)] = endofday_only_df[col]
    report_intra_df['DiffOf' + str(col)] = -intraday_only_df[col]

report_df = pd.concat([report_same_df, report_eod_df, report_intra_df])
print(report_df)
report_df.to_csv("DiffReport.csv", index=True, header=True)

相关问题 更多 >