计数重复记录（但唯一记录）+连续记录问题的回答

计数重复记录（但唯一记录）+连续记录

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我希望统计连续+重复日期（但唯一发票）的发票数量。我尝试将datetime转换为字符串，然后计算序列，但它只计算每个发票的唯一日期 <pre><code># fix time format df['Invoice Date'] = pd.to_datetime(df['Invoice Date']) # column to find sequences df = df.sort_values(['Vendor ID', 'Vendor Name', 'Invoice Date']) df['continue'] = df['Invoice Date'].diff().dt.days.ne(1).cumsum() # find time sequences val_count = df['continue'].value_counts() val_count = val_count[val_count.gt(1)].index.tolist() #filter and group by Vendor output = df[df['continue'].isin(val_count)].groupby(['Vendor ID'])['Vendor Name'].count() </code></pre> 电流输入： <div class="s-table-container"> ^{tb1}$ </div> 预期产出： <div class="s-table-container"> ^{tb2}$ </div> **8月24日：*我正在编辑我的输入和预期输出。每个人建议的代码都适用于我前面给出的示例，但不适用于我的实际数据。加上它，它反映了我遇到了什么。我已经使用了所有的代码建议，没有任何东西能给Macys 5、JC Penny 5带来任何好处。 <ul> <li>梅西百货总共有5英镑。发票1-5是连续的，包含重复项</李> <li>JC Penny的总数是5。发票7-11包括连续两天的发票，并且包含副本</李> </ul> **8月25日：*添加一个新示例，其中包含当前代码（借助@correlian）和数据帧 <pre><code>d = {'Vendor ID': {0: 8555, 1: 8555, 2: 8555, 3: 8555, 4: 8555, 5: 8555, 6: 8555,7: 8555, 8: 8555,9: 8555, 10: 8555, 11: 8555, 12: 8555, 13: 8555, 14: 1044, 15: 1044, 16: 1044, 17: 1044, 18: 1044, 19: 1044, 20: 1044, 21: 1044}, 'Vendor Name': {0: 'Macys', 1: 'Macys', 2: 'Macys', 3: 'Macys', 4: 'Macys',5: 'Macys', 6: 'Macys', 7: 'Macys', 8: 'Macys', 9: 'Macys', 10: 'Macys', 11: 'Macys', 12: 'Macys', 13: 'Macys', 14: 'JC Penny', 15: 'JC Penny', 16: 'JC Penny', 17: 'JC Penny', 18: 'JC Penny', 19: 'JC Penny', 20: 'JC Penny',21: 'JC Penny'}, 'Invoice Date': {0: '7/31/2020', 1: '7/31/2020', 2: '7/31/2020', 3: '7/31/2020', 4: '7/31/2020', 5: '2/14/2020', 6: '12/31/2019',7: '12/23/2019', 8: '7/14/2020', 9: '6/30/2020' , 10: '6/30/2020', 11: '6/30/2020', 12: '6/30/2020', 13: '6/30/2020', 14: '3/30/2020', 15: '3/20/2020', 16: '3/20/2020', 17: '3/20/2020', 18: '3/20/2020', 19: '2/27/2020', 20: '2/26/2020', 21: '2/26/2020'}, 'Invoice Number': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}} df = pd.DataFrame(d) # fix time format df['Invoice Date'] = pd.to_datetime(df['Invoice Date']) def count_consecutive_invoice(sr): # Find days that are less than or equal to 1 day (m1) m1 = sr.diff().le(pd.Timedelta(days=1)) # find first days of consecutive days (which is False in m1) m2 = (m1.eq(False) & m1.shift(-1).eq(True)) return m1.sum() + m2.sum() output = df.groupby(['Vendor ID', 'Vendor Name'])['Invoice Date'] \ .apply(count_consecutive_invoice) \ .rename('Count of Consecutive Invoices') \ .reset_index() </code></pre> 电流输出： <pre><code> Vendor ID Vendor Name Count of Consecutive Invoices 0 1044 JC Penny 8 1 8555 Macys 14 </code></pre> 预期产量 <pre><code> Vendor ID Vendor Name Count of Consecutive Invoices 0 1044 JC Penny 7 1 8555 Macys 10 </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

计数重复记录（但唯一记录）+连续记录

1 个回答

相关Python问题