熊猫:通过重复的ID条件合并/连接数据框架

2024-06-01 09:15:55 发布

您现在位置:Python中文网/ 问答频道 /正文

请原谅标题-我不确定如何最好地描述我的问题。我相信我所追求的可能是某种类似于条件外连接/合并的东西。我认为要么在开始时设置条件,要么合并所有内容,然后删除不必要的信息。我有一个例子,希望能帮助解释我的情况。你知道吗

我从以下数据帧开始:

数据帧1

+--------+------------+
| GlobID | Issue      |
+--------+------------+
| 1      | Building M |
+--------+------------+
| 2      | Building V |
+--------+------------+
| 3      | Building H |
+--------+------------+

数据帧2

+----+---------+---------+------------+---------+---------+------------+
| ID | Issue_A | Note_A  | Location_A | Issue_B | Note_B  | Location_B |
+----+---------+---------+------------+---------+---------+------------+
| 1  | Y       | broken  | bathroom   | N       |         |            |
+----+---------+---------+------------+---------+---------+------------+
| 2  | Y       | stained | bedroom    | Y       | rusty   | basement   |
+----+---------+---------+------------+---------+---------+------------+
| 3  | Y       | missing | kitchen    | Y       | cracked | attic      |
+----+---------+---------+------------+---------+---------+------------+
  • 在数据帧2中,“Note\u A”和“Location\u A”的值取决于“Issue\u A”如果存在问题,则将填充这些值。如果不是,则“Issue_A”标记为“N”,其他列保持空。基本上,我想要的是合并数据,这样对于每个ID,问题被分解成各自的行。理想情况下,结果不包括未记录问题的记录:

期望结果:

+--------+------------+---------+----------+
| GlobID | Name       | Issue   | Location |
+--------+------------+---------+----------+
| 1      | Building M | broken  | bathroom |
+--------+------------+---------+----------+
| 2      | Building V | stained | bedroom  |
+--------+------------+---------+----------+
| 2      | Building V | rusty   | basement |
+--------+------------+---------+----------+
| 3      | Building H | missing | kitchen  |
+--------+------------+---------+----------+
| 3      | Building H | cracked | attic    |
+--------+------------+---------+----------+

正如我所提到的,我不确定外部连接是否是我想要在这里与ffill一起填写id的东西?任何帮助都将不胜感激。你知道吗

编辑:

忘了提一下,这是我现在的代码:

pd.merge(df1, df2.set_index('ID'), left_on='GlobID', right_index=True)

这只会让我加入df1和df2。我还是要把问题说清楚,让他们各自为政。你知道吗


Tags: 数据id情况locationissue条件notebuilding
2条回答

您可以使用这样的算法:

df1 = pd.DataFrame([[1,"Building M"],[2,"Building V"], [3, "Building H"]], columns=["GlobID","Issue"])
df2 = pd.DataFrame([[1,"Y","broken","bathroom","N","",""],
                    [2,"Y","stained","bedroom","Y","rusty","basement"],
                    [3,"Y","missing","kitchen","Y","cracked","attic"]], 
                   columns=["ID","Issue_A","Note_A", "Location_A", "Issue_B", "Note_B", "Location_B"])

df1 = df1.set_index("GlobID")
df2 = df2.set_index("ID")

# divide our df2 to list of data frames
issues = ["A", "B"]
description = ["Issue", "Note", "Location"]
delimiter = "_"
issues_df_list = []
for issue in issues:
    # prepare concrete issue description fields
    issue_labels = [descr + delimiter + issue for descr in description]
    # select sub df for each issue
    df = df2[issue_labels]
    # rename and unify columns labels
    df.columns = description
    # then add sub df to the df list
    issues_df_list.append(df)

# then concat list of dfs to one big df
issues_df = pd.concat(issues_df_list,sort=False) # some kind of reshaping

# drop rows with "N" values
issues_df = issues_df[issues_df["Issue"] != "N"]

# drop Issue column
issues_df = issues_df.loc[:,issues_df.columns != "Issue"]

# rename Note column label to the Issue 
issues_df = issues_df.rename(columns={"Note":"Issue"})

issues_df

它给你:

+  +    -+     +
|    |  Issue  | Location |
+  +    -+     +
| ID |         |          |
| 1  | broken  | bathroom |
| 2  | stained | bedroom  |
| 3  | missing | kitchen  |
| 2  | rusty   | basement |
| 3  | cracked | attic    |
+  +    -+     +

然后你可以做一个简单的合并:

pd.merge(df1.rename(columns={"Issue":"Name"}), issues_df, left_index=True, right_index=True)

+ -+      +    -+     +
|   |    Name    |  Issue  | Location |
+ -+      +    -+     +
| 1 | Building M | broken  | bathroom |
| 2 | Building V | stained | bedroom  |
| 2 | Building V | rusty   | basement |
| 3 | Building H | missing | kitchen  |
| 3 | Building H | cracked | attic    |
+ -+      +    -+     +

这是解决问题的简单方法:

df1 = pd.DataFrame([[1, "Building M"], [2, "Building V"], [3, "Building H"]], columns=["id", "Issue"])
df2 = pd.DataFrame([[1, "Y", "broken", "bathroom", "N", np.nan, np.nan], [2,"Y", "stained", "bedroom", "Y", "rusty", "basement"], [3, "Y", "missing", "kitchen", "Y", "cracked", "attic"]], columns=["id", "Issue_A", "Note_A", "Location_A", "Issue_B", "Note_B", "Location_B"])

df2 = pd.concat([df2[["id", "Issue_A", "Location_A"]], df2[["id", "Issue_B", "Location_B"]].rename(columns={"Issue_B" : "Issue_A", "Location_B" : "Location_A" })]).dropna()

df_result = pd.merge(df1, df2, how="left")

print(df_result)

相关问题 更多 >