Python/pandas数据帧中的数值模糊比较连接

MarketID Origin Dest FltNum DepCentral ArrCentral PK 1DFWSJC DFW SJC 444 0645 1015 A 2DFWSJC DFW SJC 555 1300 1630 X 3DFWSJC DFW SJC 666 2040 2410 B

def schdChanges(base, new): merged = pd.merge(base, new, on = ['MktSegID']) diffDf = pd.DataFrame() diffDf['MktSegID'] = merged['MktSegID'] diffDf['Cdep'] = merged['Cdep_y'] - merged['Cdep_x'] diffDf['Carr'] = merged['Carr_y'] - merged['Carr_x'] diffDf['Block'] = merged['Block_y'] - merged['Block_x'] diffDf['Turn'] = merged['Turn_y'] - merged['Turn_x'] return diffDf

Draft1Cap = Draft1.groupby('DirMkt')['MktSegID'].nunique() FinalCap = Final.groupby( 'DirMkt')['MktSegID'].nunique() FinalCap.subtract(Draft1Cap, fill_value = 0)[FinalCap.subtract(Draft1Cap, fill_value = 0) != 0]

1条回答

网友

1楼 · 发布于 2024-06-24 13:51:41

你有6列。让我们按它们在这种情况下有多有用来细分它们。你知道吗

MarketID和FltNum可以任意更改，这样对我们没有帮助。 Origin和Dest我想肯定是一样的，不能改变，所以我们来检查一下 DepCentral和ArrCentral是最重要的，每个问题陈述最多可以更改50分钟。你知道吗

您有一些复杂的业务逻辑，因此可能没有简单的解决方案。所以这里的乐趣来了，编码的东西来处理这个逻辑！你知道吗

此程序将查找匹配项，我将由您决定如何获得您想要的输出

import pandas as pd

如果你有日期，或者确定你在23:55不会有需要与00:04匹配的东西，那么你可以简化或替换这个逻辑

def time_change(old_time, new_time):
    old_hrs, new_hrs = int(old_time[0:2]), int(new_time[0:2])
    old_mins, new_mins = int(old_time[2:]), int(new_time[2:])

    old_total = 60 * old_hrs + old_mins
    new_total = 60 * new_hrs + new_mins

    # note, this may make incorrect assumptions since we don't have the day. 
    # If you have the day in your actual data, there are better ways of comparing the times
    return abs((new_total - old_total)) % (24 * 60)

现在，到问题的核心，检查匹配。你列出了你想要的，所以这只是一些逻辑来实现它。此函数接受任意两行进行比较。你知道吗

def check_match(old, new):
    #["MarketID", "Origin", "Dest", "FltNum", "DepCentral", "ArrCentral"]
    if old['Origin'] != new['Origin']:
        return False, "", ""
    if old['Dest'] != new['Dest']:
        return False, "", ""
    total_time_change = time_change(old["DepCentral"], new["DepCentral"]) + \
        time_change(old["ArrCentral"], new["ArrCentral"])
    if abs(total_time_change) <= 50:
        return True, total_time_change, "other fields that changed"
    else:
        return False, "", ""

循环遍历新旧表中的所有行并进行比较。根据你提到的数据的大小，查看所有内容应该是可以的。你知道吗

def compare_tables(old, new_df):
    if 'PK' not in old.columns:
        # use the current index as the starting PK
        old['PK'] = old.index
    for _, old_row in old.iterrows():
        print("Looking at row:")
        print(old_row.T)
        pk = old_row['PK']
        best_match = None
        best_match_time_change = float('inf')
        for _, new_row in new_df.iterrows():
            is_match, time_change, old_changes = check_match(old_row, new_row)
            if is_match and (time_change < best_match_time_change):
                best_match_time_change = time_change
                best_match = new_row
        print("The best match is:")
        if best_match is not None:
            print(best_match.T, best_match_time_change)
        else:
            print("no matches found")
        print()
        print()

遍历所有表对：

all_tables = [table_1, table_2, table_3, table_4]
for old, new_df in zip(all_tables, all_tables[1:]):
    compare_tables(old, new_df)

相关问题更多 >

编程相关推荐

热门问题

热门文章