如何将这个SQL代码转换为涉及延迟函数的等价pandas代码？

case -- mark the first hospital adm when dense_rank() over (partition by adm.subject_id order by adm.admittime) = 1 then true -- mark subsequent hospital adms if its been atleast a month since previous admission. when round((cast(extract(epoch from adm.admittime - lag(adm.admittime, 1) over (partition by adm.subject_id order by adm.admittime))/(60*60*24) as numeric)), 2) >= 30.0 then true else false end as include_adm

id admit_time note 0 30 2018-10-03 note_content1 1 30 2018-10-03 note_content2 2 30 2018-10-29 note_content1 3 30 2018-10-29 note_content2 4 13 2017-11-01 note_content1 5 13 2018-02-27 note_content2 6 13 2018-02-27 note_content2

2条回答

网友

1楼 · 编辑于 2024-10-01 22:37:38

我们使用^{}来计算排序数据帧上每个admit_time组每个id组的差异，并选择任何NaT差异（即每个组的第一行）或差异大于30天的行。最后，我们删除辅助列delta：

df['delta'] = df.sort_values(['id', 'admit_time']).groupby('id')['admit_time'].transform(lambda x: x.diff())
df = df[df.delta.isna() | (df.delta >= pd.Timedelta(days=30))].drop(columns='delta')

输出：

   id admit_time
0  30 2018-10-03
2  13 2017-11-01
3  13 2018-02-27

更新修改后的问题：

按['id','note']分组，而不是只按'id'：

df['delta'] = df.sort_values(['id', 'admit_time']).groupby(['id','note'])['admit_time'].transform(lambda x: x.diff())
df = df[df.delta.isna() | (df.delta >= pd.Timedelta(days=30))].drop(columns='delta')

结果：

   id admit_time           note
0  30 2018-10-03  note_content1
1  30 2018-10-03  note_content2
4  13 2017-11-01  note_content1
5  13 2018-02-27  note_content1
6  13 2018-02-27  note_content2

网友

2楼 · 编辑于 2024-10-01 22:37:38

试试这个：

>>> import pandas as pd
>>> import numpy as np
>>> df=df.sort_values(by=["id", "admit_time"]) #in case your data is not sorted
>>> df_2=df.join(df.groupby("id").min(), on="id", how="left", rsuffix="_min")
>>> df_2["time_diff"]=np.where(df_2["id"]==df_2["id"].shift(), (pd.to_datetime(df_2["admit_time"])-pd.to_datetime(df_2["admit_time"].shift())).astype('timedelta64[D]'), 0)
>>> df_2
   admit_time  id admit_time_min  time_diff
0  2018-10-03  30     2018-10-03        0.0
1  2018-10-29  30     2018-10-03       26.0
2  2017-11-01  13     2017-11-01        0.0
3  2018-02-27  13     2017-11-01      118.0
>>> df_2[(df_2["admit_time"]==df_2["admit_time_min"])  | (df_2["time_diff"]>=30)]
   admit_time  id admit_time_min  time_diff
0  2018-10-03  30     2018-10-03        0.0
2  2017-11-01  13     2017-11-01        0.0
3  2018-02-27  13     2017-11-01      118.0

2个注意事项：

（1）您需要首先按id, admit_time对数据进行排序

（2）我没有找到等价于dense_rank-所以它是做正规的rank

相关问题更多 >

编程相关推荐

热门问题

热门文章