<p>我编写了一个函数来选择本周的有效最高记录,这需要在每周groupby上使用:</p>
<pre><code>def last_valid_report(recs):
if len(recs) == 1:
return recs
recs = recs.copy()
# recs = recs[recs['dates'].dt.weekday <= 4].nlargest(1, recs['dates'].dt.weekday) # doesn't work
recs['weekday'] = recs['dates'].dt.weekday # because nlargest() needs a column name
recs = recs[recs['weekday'] <= 4].nlargest(1, 'weekday')
del recs['weekday']
return recs
# could have also done:
# return recs[recs['weekday'] <= 4].nlargest(1, 'weekday').drop('weekday', axis=1)
</code></pre>
<p>用正确的小组打电话,我得到:</p>
<pre><code>In [155]: df2 = df.groupby(df['dates'].dt.week).apply(last_valid_report)
In [156]: df2
Out[156]:
dates nums
dates
45 4 2018-11-09 63
46 8 2018-11-15 90
47 10 2018-11-19 80
48 11 2018-12-01 94
</code></pre>
<hr/>
<p>有几个问题:</p>
<ol>
<li><p>如果我不放<code>recs.copy()</code>,我得到<code>ValueError: Shape of passed values is (3, 12), indices imply (3, 4)</code></p></li>
<li><p><a href="https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.nlargest.html" rel="nofollow noreferrer">pandas' ^{<cd3>}</a>只使用列名,不使用表达式</p>
<ul>
<li>所以我需要在函数中创建一个额外的列,并在返回它之前删除它<我也可以在原始的df中创建它,并将它放在<code>.apply()</code>之后</em></li>
</ul></li>
<li><p>我从<a href="https://stackoverflow.com/a/12411852/1431750">groupby+apply</a>得到一个额外的索引列'dates',<em>,需要是<a href="https://stackoverflow.com/a/42124685/1431750">explicitly dropped</a></em>:</p>
<pre><code>In [157]: df2.index = df2.index.droplevel(); df2
Out[157]:
dates nums
4 2018-11-09 63
8 2018-11-15 90
10 2018-11-19 80
11 2018-12-01 94
</code></pre></li>
<li><p>如果我得到一个包含星期六和星期天数据(2天)的记录,我需要添加一个检查<code>recs[recs['weekday'] <= 4]</code>是否为空,然后只使用<code>.nlargest(1, 'weekday')</code>而不过滤<code>weekday <= 4</code>;但这不是问题的重点</p></li>
</ol>