如何使用groupby和filter数据框创建新列问题的回答

如何使用groupby和filter数据框创建新列

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

假设我有一个数据集，其中包含了入住重症监护病房的病人心率的时间序列。你知道吗 我想补充一些纳入标准，例如，我只想考虑心率大于等于90的患者至少一小时的ICU住院时间。如果一小时后第一次测量的心率（从&gt；=90开始）未知，我们假设它高于90，并包括ICU住院时间。你知道吗 ICU住院的条目应包括从与“至少1小时”时间间隔相对应的第一次测量开始。你知道吗 请注意，一旦ICU住院被包括在内，它就再也不会被排除，即使心率在某个时候降到90以下。你知道吗 因此，下面的数据框中，“Icustay”对应于在ICU停留的唯一ID，“Hours”表示自进入ICU以来在ICU停留的时间 <pre><code> Heart Rate Hours Icustay Inclusion Criteria 0 79 0.0 1001 0 1 91 1.5 1001 0 2 NaN 2.7 1001 0 3 85 3.4 1001 0 4 90 0.0 2010 0 5 94 29.4 2010 0 6 68 0.0 3005 0 </code></pre> 应该变成 <pre><code> Heart Rate Hours Icustay Inclusion Criteria 0 79 0.0 1001 0 1 91 1.5 1001 1 2 NaN 2.7 1001 1 3 85 3.4 1001 1 4 90 0.0 2010 1 5 94 29.4 2010 1 6 68 0.0 3005 0 </code></pre> 我已经为此编写了代码，而且很有效。但是它非常慢，在处理我的整个数据集时，每个患者最多需要几秒钟的时间（实际上，我的数据集包含的数据多于6个字段，但是为了更好的可读性，我简化了它）。既然有4万病人，我想加快速度。你知道吗 这是我目前正在使用的代码，以及我上面介绍的玩具数据集。你知道吗 <pre><code>import numpy as np import pandas as pd d = {'Icustay': [1001, 1001, 1001, 1001, 2010, 2010, 3005], 'Hours': [0, 1.5, 2.7, 3.4, 0, 29.4, 0], 'Heart Rate': [79, 91, np.NaN, 85, 90, 94, 68], 'Inclusion Criteria':[0, 0, 0, 0, 0, 0, 0]} all_records = pd.DataFrame(data=d) for curr in np.unique(all_records['Icustay']): print(curr) curr_stay = all_records[all_records['Icustay']==curr] indexes = curr_stay['Hours'].index heart_rate_flag = False heart_rate_begin_time = 0 heart_rate_begin_index = 0 for i in indexes: if(curr_stay['Heart Rate'][i] >= 90 and not heart_rate_flag): heart_rate_flag = True heart_rate_begin_time = curr_stay['Hours'][i] heart_rate_begin_index = i elif(curr_stay['Heart Rate'][i] < 90): heart_rate_flag = False elif(heart_rate_flag and curr_stay['Hours'][i]-heart_rate_begin_time >= 1.0): all_records['Inclusion Criteria'].iloc[indexes[indexes>=heart_rate_begin_index]] = 1 break </code></pre> 请注意，数据集是按患者和小时排序的。你知道吗 有没有办法加快速度？我考虑过像groupby这样的内置函数，但我不确定它们在这种特殊情况下是否有用。你知道吗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这看起来有点难看，但它避免了循环和<code>apply</code>（本质上只是引擎盖下的一个循环）。我还没有在一个大的数据集上进行测试，但我怀疑它会比您当前的代码快得多。你知道吗 首先，创建一些额外的列，其中包含下一行/上一行的详细信息，因为这可能与您的某些条件有关： <pre><code>all_records['PrevHeartRate'] = all_records['Heart Rate'].shift() all_records['NextHours'] = all_records['Hours'].shift(-1) all_records['PrevICU'] = all_records['Icustay'].shift() all_records['NextICU'] = all_records['Icustay'].shift(-1) </code></pre> 接下来，创建一个数据帧，其中包含每个id的第一条符合条件的记录（由于涉及大量的逻辑，这现在非常混乱）： <pre><code>first_per_id = (all_records[((all_records['Heart Rate'] >= 90) | ((all_records['Heart Rate'].isnull()) & (all_records['PrevHeartRate'] >= 90) & (all_records['Icustay'] == all_records['PrevICU']))) & ((all_records['Hours'] >= 1) | ((all_records['NextHours'] >= 1) & (all_records['NextICU'] == all_records['Icustay'])))] .drop_duplicates(subset='Icustay', keep='first')[['Icustay']] .reset_index() .rename(columns={'index': 'first_index'})) </code></pre> 这给了我们： <pre><code> first_index Icustay 0 1 1001 1 4 2010 </code></pre> 现在可以从原始数据帧中删除所有新列： <pre><code>all_records.drop(['PrevHeartRate', 'NextHours', 'PrevICU', 'NextICU'], axis=1, inplace=True) </code></pre> 然后我们可以将其与原始数据帧合并： <pre><code>new = pd.merge(all_records, first_per_id, how='left', on='Icustay') </code></pre> 给予： <pre><code> Heart Rate Hours Icustay Inclusion Criteria first_index 0 79.0 0.0 1001 0 1.0 1 91.0 1.5 1001 0 1.0 2 97.0 2.7 1001 0 1.0 3 NaN 3.4 1001 0 1.0 4 90.0 0.0 2010 0 4.0 5 94.0 29.4 2010 0 4.0 6 68.0 0.0 3005 0 NaN </code></pre> 从这里我们可以比较“第一个索引”（这是该id的第一个合格索引）和实际索引： <pre><code>new['Inclusion Criteria'] = new.index >= new['first_index'] </code></pre> 这将提供： <pre><code> Heart Rate Hours Icustay Inclusion Criteria first_index 0 79.0 0.0 1001 False 1.0 1 91.0 1.5 1001 True 1.0 2 97.0 2.7 1001 True 1.0 3 NaN 3.4 1001 True 1.0 4 90.0 0.0 2010 True 4.0 5 94.0 29.4 2010 True 4.0 6 68.0 0.0 3005 False NaN </code></pre> 从这里开始，我们只需要整理一下（将结果列转换为整数，并删除第一个索引列）： <pre><code>new.drop('first_index', axis=1, inplace=True) new['Inclusion Criteria'] = new['Inclusion Criteria'].astype(int) </code></pre> 给出最终预期结果： <pre><code> Heart Rate Hours Icustay Inclusion Criteria 0 79.0 0.0 1001 0 1 91.0 1.5 1001 1 2 97.0 2.7 1001 1 3 NaN 3.4 1001 1 4 90.0 0.0 2010 1 5 94.0 29.4 2010 1 6 68.0 0.0 3005 0 </code></pre>

如何使用groupby和filter数据框创建新列

1 个回答

相关Python问题