Pandas或SQL中不寻常的表缩减

2024-09-28 19:21:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据库表,类似于下面的伪示例(但有数千条记录):

LastName FirstName MiddleInit Company   Team             LogDate
   smith    joseph          a   compA  teama 2015-09-18 23:00:00
   smith    joseph          a   compA  teamb 2015-08-15 04:00:00
   smith    joseph          a   compA  teamc 2015-10-01 02:15:00
    lamb      mary          b   compA  teama 2015-09-15 23:00:00
    lamb      mary          b   compA  teamb 2015-10-04 01:15:00
    lamb      mary          b   compA  teamc 2015-05-01 07:00:00
    brady    thomas         c   compB  teama 2015-10-02 03:15:00
    brady    thomas         c   compB  teamb 2015-09-11 20:00:00
    brady    thomas         c   compB  teamc 2015-06-02 05:00:00
    smith      john         d   compB  teama 2015-09-12 08:00:00
    smith      john         d   compB  teamb 2015-10-05 09:15:00
    smith      john         d   compB  teamc 2015-07-03 15:00:00

被包括在内的人留在同一家公司,但他们从一个团队转移到另一个团队。我有兴趣根据LogDate字段找出每个人最近所在的团队。下面是我要生成的输出(名称的顺序无关紧要):

LastName FirstName MiddleInit Company   Team             LogDate
   smith    joseph          a   compA  teamc 2015-10-01 02:15:00
    lamb      mary          b   compA  teamb 2015-10-04 01:15:00
   brady    thomas          c   compB  teama 2015-10-02 03:15:00
   smith      john          d   compB  teamb 2015-10-05 09:15:00

我想知道是否有一种方法可以使用单个SQL查询或Pandas(最好不使用循环)导出此输出。你知道吗

尝试使用SQL:在编写本文时,我没有访问数据库的权限,但我必须连接两个表以获取团队字段,我认为我尝试了以下内容但没有成功(这可能不完全正确,但应该接近我尝试的内容):

SELECT a.LastName, a.FirstName, a.MiddleInit, a.Company, b.Team, max(b.LogDate) FROM table1 AS a JOIN table2 AS b ON a.LastName=b.LastName AND a.FirstName=b.FirstName AND a.MiddleInit=b.MiddleInit AND a.Company=b.Company GROUP BY a.LastName, a.FirstName, a.MiddleInit, a.Company ORDER BY a.LastName, a.FirstName, a.MiddleInit, a.Company;

尝试使用熊猫: 作为一个实验,我使用CSV加载了上面的表,并编写了一个快速的Python脚本:

import pandas as pd
td1 = pd.read_csv('teamdata.csv',parse_dates=['LogDate'])
td2 = td1.groupby(['LastName','FirstName','MiddleInit','Company']).max().reset_index()

但这总是返回“teamc”作为团队,而不是与具有最新LogDate的行相对应的团队:

In [1]: import pandas as pd

In [2]: td1 = pd.read_csv('teamdata.csv',parse_dates=['LogDate'])

In [3]: td2 = td1.groupby(['LastName','FirstName','MiddleInit','Company']).max().reset_index()

In [4]: td2
Out[4]:
  LastName FirstName MiddleInit Company   Team             LogDate
0    brady    thomas          c   compB  teamc 2015-10-02 03:15:00
1     lamb      mary          b   compA  teamc 2015-10-04 01:15:00
2    smith      john          d   compB  teamc 2015-10-05 09:15:00
3    smith    joseph          a   compA  teamc 2015-10-01 02:15:00

即使我对groupby使用自定义函数,我的理解是,它只对正在处理的列(LogDate)起作用,我不知道如何引用该特定记录中的团队。任何想法都是值得赞赏的,最好不要求助于循环或多个SQL查询,但如果这是唯一的方法,我会在这一点上采取任何措施。提前感谢您的帮助。你知道吗


Tags: firstname团队companysmithmarylastnamelambjoseph
2条回答

在熊猫中,您可以使用idxmax获得每组的最大值指标:

In [17]: df.loc[df.groupby(['LastName','FirstName','MiddleInit','Company'])['LogDate'].idxmax(), :]
Out[17]: 
   LastName FirstName MiddleInit Company   Team             LogDate
6     brady    thomas          c   compB  teama 2015-10-02 03:15:00
4      lamb      mary          b   compA  teamb 2015-10-04 01:15:00
10    smith      john          d   compB  teamb 2015-10-05 09:15:00
2     smith    joseph          a   compA  teamc 2015-10-01 02:15:00

在Postgres中做你想做的事情最简单的方法是使用distinct on

select distinct on (lastname, firstname, middleinit, company) t.*
from table1 t
order by lastname, firstname, middleinit, company, logdate desc;

相关问题 更多 >