回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我希望统计连续+重复日期(但唯一发票)的发票数量。我尝试将datetime转换为字符串,然后计算序列,但它只计算每个发票的唯一日期</p>
<pre><code># fix time format
df['Invoice Date'] = pd.to_datetime(df['Invoice Date'])
# column to find sequences
df = df.sort_values(['Vendor ID', 'Vendor Name', 'Invoice Date'])
df['continue'] = df['Invoice Date'].diff().dt.days.ne(1).cumsum()
# find time sequences
val_count = df['continue'].value_counts()
val_count = val_count[val_count.gt(1)].index.tolist()
#filter and group by Vendor
output = df[df['continue'].isin(val_count)].groupby(['Vendor ID'])['Vendor Name'].count()
</code></pre>
<p>电流输入:</p>
<div class="s-table-container">
^{tb1}$
</div>
<p>预期产出:</p>
<div class="s-table-container">
^{tb2}$
</div>
<p>**8月24日:*<strong><strong>我正在编辑我的输入和预期输出。每个人建议的代码都适用于我前面给出的示例,但不适用于我的实际数据。加上它,它反映了我遇到了什么。我已经使用了所有的代码建议,没有任何东西能给Macys 5、JC Penny 5带来任何好处。</strong></strong></p>
<ul>
<li>梅西百货总共有5英镑。发票1-5是连续的,包含重复项</李>
<li>JC Penny的总数是5。发票7-11包括连续两天的发票,并且包含副本</李>
</ul>
<p>**8月25日:*<strong><strong>添加一个新示例,其中包含当前代码(借助@correlian)和数据帧</p>
<pre><code>d = {'Vendor ID':
{0: 8555, 1: 8555, 2: 8555, 3: 8555, 4: 8555, 5: 8555, 6: 8555,7: 8555, 8: 8555,9: 8555, 10: 8555, 11: 8555, 12: 8555, 13: 8555,
14: 1044, 15: 1044, 16: 1044, 17: 1044, 18: 1044, 19: 1044, 20: 1044, 21: 1044},
'Vendor Name':
{0: 'Macys', 1: 'Macys', 2: 'Macys', 3: 'Macys', 4: 'Macys',5: 'Macys', 6: 'Macys', 7: 'Macys', 8: 'Macys', 9: 'Macys', 10: 'Macys', 11: 'Macys', 12: 'Macys', 13: 'Macys',
14: 'JC Penny', 15: 'JC Penny', 16: 'JC Penny', 17: 'JC Penny', 18: 'JC Penny', 19: 'JC Penny', 20: 'JC Penny',21: 'JC Penny'},
'Invoice Date':
{0: '7/31/2020', 1: '7/31/2020', 2: '7/31/2020', 3: '7/31/2020', 4: '7/31/2020', 5: '2/14/2020', 6: '12/31/2019',7: '12/23/2019', 8: '7/14/2020', 9: '6/30/2020' , 10: '6/30/2020',
11: '6/30/2020', 12: '6/30/2020', 13: '6/30/2020', 14: '3/30/2020', 15: '3/20/2020', 16: '3/20/2020', 17: '3/20/2020', 18: '3/20/2020', 19: '2/27/2020', 20: '2/26/2020', 21: '2/26/2020'},
'Invoice Number':
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}
df = pd.DataFrame(d)
# fix time format
df['Invoice Date'] = pd.to_datetime(df['Invoice Date'])
def count_consecutive_invoice(sr):
# Find days that are less than or equal to 1 day (m1)
m1 = sr.diff().le(pd.Timedelta(days=1))
# find first days of consecutive days (which is False in m1)
m2 = (m1.eq(False) & m1.shift(-1).eq(True))
return m1.sum() + m2.sum()
output = df.groupby(['Vendor ID', 'Vendor Name'])['Invoice Date'] \
.apply(count_consecutive_invoice) \
.rename('Count of Consecutive Invoices') \
.reset_index()
</code></pre>
<p>电流输出:</p>
<pre><code> Vendor ID Vendor Name Count of Consecutive Invoices
0 1044 JC Penny 8
1 8555 Macys 14
</code></pre>
<p>预期产量</p>
<pre><code> Vendor ID Vendor Name Count of Consecutive Invoices
0 1044 JC Penny 7
1 8555 Macys 10
</code></pre>