Python打印2个数据帧的比较结果的最快方法

2024-09-28 17:31:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧,每个数据帧有300万条记录。我需要比较并显示每个匹配索引、发生不匹配的列名以及这些不匹配列中的数据

下面是dataframe和预期输出的示例:

df1=pd.DataFrame({
    'Date':['2013-11-23','2013-11-24','2013-11-25','2013-11-26','2013-11-27'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple'],
    'Num':[10.2,22.1,8.6,7.6,10.2],
    'Color':['Green','Yellow','Orange','Green','Green']
    })

df2=pd.DataFrame({
    'Date':['2013-11-23','2013-11-24','2013-11-25','2013-11-26','2013-11-27'],
    'Fruit':['Banana1','Orange','Apple','Celery','Apple'],
    'Num':[10.2,22.12,8.60,7.6,10.2],
    'Color':['Green','Yellow1','Orange','Green','Green']
    })

df1.set_index("Date",inplace = True)
df2.set_index("Date",inplace = True)

Dataframe 1:
             Fruit   Num   Color
Date                            
2013-11-23  Banana  10.2   Green
2013-11-24  Orange  22.1  Yellow
2013-11-25   Apple   8.6  Orange
2013-11-26  Celery   7.6   Green
2013-11-27   Apple  10.2   Green

DataFrame 2:
              Fruit    Num    Color
Date                               
2013-11-23  Banana1  10.21    Green
2013-11-24   Orange  22.12  Yellow1
2013-11-25    Apple   8.60   Orange
2013-11-26   Celery   7.60    Green
2013-11-27    Apple  10.20    Green

Expected OUTPUT:
Date 2013-11-23 has mismatch in columns ['Fruit' , 'Num'].
DF1: ['Banana',10.2]
DF2: ['Banana1', 10.21]
--------------------------------------------------------
Date 2013-11-24 has mismatch in columns ['Num' , 'Color'].
DF1: [22.1,'Yellow']
DF2: [22.12,'Yellow1']
----------------------------------------------------------
And so on for every Date index

Tags: 数据appledataframedateindexgreennumcolor
1条回答
网友
1楼 · 发布于 2024-09-28 17:31:31

将文件另存为.py并运行它

import multiprocessing as mp
import pandas as pd


def compare(data):
    df1_data, df2_data = data
    date = df1_data[0]
    d1 = df1_data[1]
    d2 = df2_data[1]

    # print(d1)
    # print(d2)
    cmp = d1 != d2
    if any(cmp):
        print('Date ', date, end=' ')

        diff_cols = cmp[cmp].index.tolist()
        d1_diff = d1[diff_cols]
        d2_diff = d2[diff_cols]

        print('has mismatch in columns', diff_cols, '.')
        print('DF1:', d1_diff.tolist())
        print('DF2:', d2_diff.tolist())
        print('-'*20)


if __name__ == "__main__":
    df1 = pd.DataFrame({
        'Date': ['2013-11-23', '2013-11-24', '2013-11-25', '2013-11-26', '2013-11-27'],
        'Fruit': ['Banana', 'Orange', 'Apple', 'Celery', 'Apple'],
        'Num': [10.2, 22.1, 8.6, 7.6, 10.2],
        'Color': ['Green', 'Yellow', 'Orange', 'Green', 'Green']
    })

    df2 = pd.DataFrame({
        'Date': ['2013-11-23', '2013-11-24', '2013-11-25', '2013-11-26', '2013-11-27'],
        'Fruit': ['Banana1', 'Orange', 'Apple', 'Celery', 'Apple'],
        'Num': [10.2, 22.12, 8.60, 7.6, 10.2],
        'Color': ['Green', 'Yellow1', 'Orange', 'Green', 'Green']
    })

    df1.set_index("Date", inplace=True)
    df2.set_index("Date", inplace=True)

    n_cpu = mp.cpu_count()
    pool = mp.Pool(n_cpu-2)
    results = pool.map(compare, zip(df1.iterrows(), df2.iterrows()))

输出

Date  2013-11-23 has mismatch in columns ['Fruit'] .
DF1: ['Banana']
DF2: ['Banana1']
          
Date  2013-11-24 has mismatch in columns ['Num', 'Color'] .
DF1: [22.1, 'Yellow']
DF2: [22.12, 'Yellow1']
          

相关问题 更多 >