通过基于具有不同索引的唯一值将值从第一个数据帧更新到第二个数据帧来迭代每一行，否则追加并分配新ID

unique_value Status Price ID 0 xyz123 good 1.25 1000 1 xyz123 good 1.25 1000 2 xyz123 good 1.25 1000 3 xyz123 good 1.25 1000 4 xyz985 bad 1.31 1001 5 abc987 okay 4.56 1002 6 eff987 good 9.85 1003 7 asd541 excellent 8.85 1004

unique_value Status Price ID 0 xyz123 bad 6.67 1000 <-updated 1 xyz123 bad 6.67 1000 <-updated 2 xyz123 bad 6.67 1000 <-updated 3 xyz123 bad 6.67 1000 <-updated 4 xyz985 bad 1.31 1001 5 abc987 okay 4.56 1002 6 eff987 bad 1.75 1003 <-updated 7 asd541 excellent 8.85 1004 8 efg125 okay 5.77 1005 <-appended

for i in range(0, len(df1)): if df1['unique_value'].isin(df2['unique_value'])[i] == True: ... update row in df2 else: df2 = df2.append(i) ... assign row with new ID using pd.factorize and ID value at df2['ID'].max()+1

1条回答

网友

1楼 · 发布于 2024-09-28 17:19:53

我实施这两部分的策略解释如下

更新现有行：df2可以通过broadcasting更新，前提是df1中的行的形状正确地重塑为(1, 3)。{}中的广播概念与{}中的广播概念相同
追加新行：假设连续索引从0开始计数，可以通过直接调用df2.loc[len(df2), :] = ...轻松追加新行，其中len(df2)是索引列的下一个未使用的自然数。例如：this answer

此外，在我的解决方案中还构造了两个额外的状态变量，因为我认为它们比每次搜索整个df2更有效。如果这不是一个问题，它们当然可以被丢弃

代码：

# additional state variables
# 1. for the ID to be added
current_max_id = df2["ID"].max()
# 2. for matching unique_values, avoiding searching df2["unique_value"] every time
current_value_set = set(df2["unique_value"].values)

# match unique_value's using the state variable instead of `df2`
mask = df1["unique_value"].isin(current_value_set)

for i in range(len(df1)):
    
    # current unique_value from df1
    uv1 = df1["unique_value"][i]
    
    # 1. update existing
    if mask[i]:
        
        # broadcast df1 into the matched rows in df2 (mind the shape)
        df2.loc[df2["unique_value"] == uv1, ["unique_value", "Status", "Price"]] = df1.iloc[i, :].values.reshape((1, 3))
        
    # 2. append new
    else:
        # update state variables
        current_max_id += 1
        current_value_set.add(uv1)
        # append the row (assumes df2.index=[0,1,2,3,...])
        df2.loc[len(df2), :] = [df1.iloc[i, 0], df1.iloc[i, 1], df1.iloc[i, 2], current_max_id]

输出：

df2
Out[45]: 
  unique_value     Status  Price      ID
0       xyz123        bad   6.67  1000.0
1       xyz123        bad   6.67  1000.0
2       xyz123        bad   6.67  1000.0
3       xyz123        bad   6.67  1000.0
4       xyz985        bad   1.31  1001.0
5       abc987       okay   4.56  1002.0
6       eff987        bad   1.75  1003.0
7       asd541  excellent   8.85  1004.0
8       efg125       okay   5.77  1005.0

使用python 3.7、1.1.2、OS=debian 10 64位进行测试

相关问题更多 >

编程相关推荐

热门问题

热门文章