如何在两个数据帧中查找值并在一个数据帧中的匹配行上添加新值

2024-10-03 23:26:02 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是我试图实现的目标:我希望在df1和df2中找到匹配的“记录ID”,当找到匹配项时,在df1中创建一个新列“new Record ID”,并将“new Record ID”从df2复制到具有匹配“Record ID”的行中

df1
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+
|   | ID | Start time          | Completion time     | Email   | Name       | Client Name | Date Due   | Narrative          | Role        | Name ID   | Role ID    | Record ID                         |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+
| 0 | 1  | 2020-05-07 19:35:13 | 2020-05-07 19:36:29 | a@a.com | John Smith | HPOT        | 2020-05-01 | HPOT TS Narrative  | Supervisor  | JohnSmith | Supervisor | hpotjohnsmithsupervisor2020-05-01 |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+
| 1 | 2  | 2020-05-07 19:36:31 | 2020-05-07 19:37:23 | b@b.com | John Smith | SMAN        | 2020-05-01 | SMAN IP Narrative  | Line Worker | JohnSmith | LineWorker | smanjohnsmithlineworker2020-05-01 |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+
| 2 | 3  | 2020-05-07 19:37:25 | 2020-05-07 19:38:57 | c@c.com | John Smith | THUL        | 2020-05-01 | THUL TS Narrative  | Supervisor  | JohnSmith | Supervisor | thuljohnsmithsupervisor2020-05-01 |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+
| 3 | 4  | 2020-05-07 19:38:59 | 2020-05-07 19:40:30 | d@d.com | John Smith | HPOT        | 2020-05-01 | HPOT IP Narrative  | Line Worker | JohnSmith | LineWorker | hpotjohnsmithlineworker2020-05-01 |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+
| 4 | 5  | 2020-05-07 19:40:32 | 2020-05-07 19:41:11 | e@e.com | John Smith | SMAN        | 2020-05-01 | SMAN TS Narrative  | Supervisor  | JohnSmith | Supervisor | smanjohnsmithsupervisor2020-05-01 |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+


df2

+---+----+------------+---------------------+---------+------------+-----------------------------------+-------------+------------+--------------------------------+------------+----------------------------------------------+
|   | ID | Start time | Completion time     | Email   | Name       | Record ID                         | Client Name | Due Date   | Revised Narrative              | Role       | New Record ID                                |
+---+----+------------+---------------------+---------+------------+-----------------------------------+-------------+------------+--------------------------------+------------+----------------------------------------------+
| 0 | 1  | 2020-05-09 | 2020-05-09 10:57:24 | a@a.com | John Smith | hpotjohnsmithsupervisor2020-05-01 | HPOT        | 2020-05-01 | HPOT Revised Summary Narrative | Supervisor | hpotjohnsmithsupervisor2020-05-01-2020-05-06           |
+---+----+------------+---------------------+---------+------------+-----------------------------------+-------------+------------+--------------------------------+------------+----------------------------------------------+
| 1 | 2  | 2020-05-09 | 2020-05-07 19:38:57 | c@c.com | John Smith | thuljohnsmithsupervisor2020-05-01 | THUL        | 2020-05-01 | THUL Revised Summary Narrative | Supervisor | thuljohnsmithsupervisor2020-05-01-2020-05-06 |
+---+----+------------+---------------------+---------+------------+-----------------------------------+-------------+------------+--------------------------------+------------+----------------------------------------------+

所需输出:df1

+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+
|   | ID | Start time          | Completion time     | Email   | Name       | Client Name | Date Due   | Narrative          | Role        | Name ID   | Role ID    | Record ID                         | New Record ID                      |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+
| 0 | 1  | 2020-05-07 19:35:13 | 2020-05-07 19:36:29 | a@a.com | John Smith | HPOT        | 2020-05-01 | HPOT TS Narrative  | Supervisor  | JohnSmith | Supervisor | hpotjohnsmithsupervisor2020-05-01 | hpotjohnsmith2020-05-01-2020-05-06 |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+
| 1 | 2  | 2020-05-07 19:36:31 | 2020-05-07 19:37:23 | b@b.com | John Smith | SMAN        | 2020-05-01 | SMAN IP Narrative  | Line Worker | JohnSmith | LineWorker | smanjohnsmithlineworker2020-05-01 |                                    |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+
| 2 | 3  | 2020-05-07 19:37:25 | 2020-05-07 19:38:57 | c@c.com | John Smith | THUL        | 2020-05-01 | THUL TS Narrative  | Supervisor  | JohnSmith | Supervisor | thuljohnsmithsupervisor2020-05-01 |                                    |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+
| 3 | 4  | 2020-05-07 19:38:59 | 2020-05-07 19:40:30 | d@d.com | John Smith | HPOT        | 2020-05-01 | HPOT IP Narrative  | Line Worker | JohnSmith | LineWorker | hpotjohnsmithlineworker2020-05-01 |                                    |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+
| 4 | 5  | 2020-05-07 19:40:32 | 2020-05-07 19:41:11 | e@e.com | John Smith | SMAN        | 2020-05-01 | SMAN TS Narrative  | Supervisor  | JohnSmith | Supervisor | smanjohnsmithsupervisor2020-05-01 |                                    |
+---+----+---------------------+---------------------+---------+------------+-------------+------------+--------------------+-------------+-----------+------------+-----------------------------------+------------------------------------+

我试过使用isin和loc,但都没能让它工作。最近的尝试是:

df2.loc[df2['Record ID'] == df1['Record ID']]

但我得到以下错误:“ValueError:只能比较标记相同的系列对象”

提前感谢您提供的任何帮助

根据要求,以下是剪贴板中的数据: df1

ID,Start time,Completion time,Email,Name,Client Name,Date Due,Narrative,Role,Name ID,Role ID,Record ID
1,2020-05-07 19:35:13,2020-05-07 19:36:29,johns@abcdefcompany.com,John Smith,HPOT,2020-05-01,"HPOT TS Narrative
Sed pretium, nunc sit amet finibus finibus, sem tortor fringilla arcu, ac vehicula massa tellus id purus. Proin luctus sodales lacus sed tempus.",Supervisor,JohnSmith,Supervisor,hpotjohnsmithsupervisor2020-05-01
2,2020-05-07 19:36:31,2020-05-07 19:37:23,johns@abcdefcompany.com,John Smith,SMAN,2020-05-01,"SMAN IP Narrative - Sed pretium, nunc sit amet finibus finibus, sem tortor fringilla arcu, ac vehicula massa tellus id purus. Proin luctus sodales lacus sed tempus.
",Worker,JohnSmith,Worker,smanJohnSmithWorker2020-05-01
3,2020-05-07 19:37:25,2020-05-07 19:38:57,johns@abcdefcompany.com,John Smith,THUL,2020-05-01,"THUL TS Narrative
Entry 3 - Sed pretium, nunc sit amet finibus finibus, sem tortor fringilla arcu, ac vehicula massa tellus id purus. Proin luctus sodales lacus sed tempus..
",Supervisor,JohnSmith,Supervisor,thulJohnSmithSupervisor2020-05-01
4,2020-05-07 19:38:59,2020-05-07 19:40:30,johns@abcdefcompany.com,John Smith,HPOT,2020-05-01,HPOT IP Narrative,Worker,JohnSmith,Worker,hpotJohnSmithWorker2020-05-01
5,2020-05-07 19:40:32,2020-05-07 19:41:11,johns@abcdefcompany.com,John Smith,SMAN,2020-05-01,SMAN TS Narrative,Supervisor,JohnSmith,Supervisor,smanJohnSmithSupervisor2020-05-01
6,2020-05-07 19:41:12,2020-05-07 19:41:40,johns@abcdefcompany.com,John Smith,THUL,2020-05-01,THUL IP Narrative,Worker,JohnSmith,Worker,thulJohnSmithWorker2020-05-01

df2

ID,Start time,Completion time,Email,Name,Record ID,Client Name,Due Date,Revised Narrative,Role,New Record ID
1,2020-05-09,2020-05-09 10:57:24,johns@abcdefcompany.com,John Smith,hpotjohnsmithsupervisor2020-05-01,HPOT,2020-05-01,HPOT Revised Summary Narrative,Supervisor,hpotjohnsmithsupervisor2020-05-012020-05-09

如果有人试图回答我的问题,可以将其复制到剪贴板,然后

df = pd.read_clipboard(sep=',')

尝试:所以我尝试了“加入”建议:

df1.set_index('Record ID').join(df2.set_index('Record ID'))

出现此错误:“ValueError:列重叠但未指定后缀:索引(['ID'、'Start time'、'Completion time'、'Email'、'Name'、'Client Name', “角色”], dtype='object')”


Tags: namecomidtimerecordjohnworkersmith
1条回答
网友
1楼 · 发布于 2024-10-03 23:26:02

使用pandas.DataFrame.merge

  • df1['Records ID']df2['Records ID']之间没有匹配值,因为一列包含大写,而另一列不包含大写。必须解决此问题才能找到匹配项
# fix Record ID for test data, this may not be required with the real data
df2['Record ID'] = df2['Record ID'].str.lower()
df['Record ID'] = df['Record ID'].str.lower()

# removed Narrative for cosmetic reasons on SO. Not necessary otherwise
df.drop(columns=['Narrative'], inplace=True)

# merge the two dataframes is the main thing
df3 = df.merge(df2[['Record ID', 'New Record ID']], how='left', on='Record ID')

print(df3)

 ID           Start time      Completion time                    Email        Name Client Name    Date Due        Role    Name ID     Role ID                          Record ID                                New Record ID
  1  2020-05-07 19:35:13  2020-05-07 19:36:29  johns@abcdefcompany.com  John Smith        HPOT  2020-05-01  Supervisor  JohnSmith  Supervisor  hpotjohnsmithsupervisor2020-05-01  hpotjohnsmithsupervisor2020-05-012020-05-09
  2  2020-05-07 19:36:31  2020-05-07 19:37:23  johns@abcdefcompany.com  John Smith        SMAN  2020-05-01      Worker  JohnSmith      Worker      smanJohnSmithWorker2020-05-01                                          NaN
  3  2020-05-07 19:37:25  2020-05-07 19:38:57  johns@abcdefcompany.com  John Smith        THUL  2020-05-01  Supervisor  JohnSmith  Supervisor  thulJohnSmithSupervisor2020-05-01                                          NaN
  4  2020-05-07 19:38:59  2020-05-07 19:40:30  johns@abcdefcompany.com  John Smith        HPOT  2020-05-01      Worker  JohnSmith      Worker      hpotJohnSmithWorker2020-05-01                                          NaN
  5  2020-05-07 19:40:32  2020-05-07 19:41:11  johns@abcdefcompany.com  John Smith        SMAN  2020-05-01  Supervisor  JohnSmith  Supervisor  smanJohnSmithSupervisor2020-05-01                                          NaN
  6  2020-05-07 19:41:12  2020-05-07 19:41:40  johns@abcdefcompany.com  John Smith        THUL  2020-05-01      Worker  JohnSmith      Worker      thulJohnSmithWorker2020-05-01                                          NaN                                     NaN

相关问题 更多 >