<p>这是一个非常好的问题</strong></em>我想强调事件的时间间隔的力量——基于发布的序列,人们对行为和可预测性有很多洞察。考虑到这一点,我写了一个很长的答案,希望能解释一些数据操作的核心原则</p>
<p><strong>1。创建自定义函数以执行计算:</strong><br/>
<em>(假设您只应用于一个列表–我建议在调试或测试时提取一个列表)</em></p>
<pre class="lang-py prettyprint-override"><code>def event_metrics(my_list, look_for = "Accept", exclude_zeros=True, simple=True):
"""
Simple mode:
Returns the average number of `items` before `look_for`
Non-Simple mode:
Returns a dictionary with the mean, median, and max number of `items`
before `look_for`
my_list: a list of values
look_for: An item in the list which constitutes the "event"
Example: "accept" from a list of "accept" and "reject"
exclude_zeros: exclude metrics for when `look_for` occurs back to back
simple: operate in simple mode or non-simple mode
"""
# Instantiate a counter list
my_counter = []
n = 0
# Loop through the list
for x in my_list:
# If a match, add n to the list and reset
if x==look_for:
my_counter.append(n)
n=0
# Otherwise, continue
else:
n+=1
# Sometimes you might want to append the final n at conclusion of the loop
# You could do that with the following code:
# if x!=look_for:
# my_counter.append(n)
# You may not want to include back-to-back events
if exclude_zeros:
my_counter = [x for x in my_counter if x>0]
# You can return a specific metric such as mean
if simple:
return np.mean(my_counter)
# Or you can pass several metrics as a dictionary and convert to a series
my_metrics = {
"mean":np.mean(my_counter),
"median":np.median(my_counter),
"max":np.max(my_counter)
}
return my_metrics
</code></pre>
<p><strong>2。将此自定义函数应用于df:</strong><br/></p>
<ul>
<li><strong>简单模式:</strong>返回单个值的数组–将其视为新列。<br/></li>
<li><strong>非简单模式:</strong>返回字典数组–使用<code>pd.to_Series</code>转换为多列。使用<code>pd.merge</code>添加到原始<code>df</code></李>
</ul>
<pre class="lang-py prettyprint-override"><code># Simple Mode
df["sequence_of_selection"].apply(event_metrics, simple=True)
# Non-Simple Mode
temp_df = df["sequence_of_selection"].apply(event_metrics, simple=False)\
.apply(pd.Series)\ # Convert to its own df
.add_prefix("rej_") # Add a prefix to your column names
df.merge(temp_df,left_index=True,right_index=True)
</code></pre>