计数重复记录（但唯一记录）+连续记录

# fix time format df['Invoice Date'] = pd.to_datetime(df['Invoice Date']) # column to find sequences df = df.sort_values(['Vendor ID', 'Vendor Name', 'Invoice Date']) df['continue'] = df['Invoice Date'].diff().dt.days.ne(1).cumsum() # find time sequences val_count = df['continue'].value_counts() val_count = val_count[val_count.gt(1)].index.tolist() #filter and group by Vendor output = df[df['continue'].isin(val_count)].groupby(['Vendor ID'])['Vendor Name'].count()

d = {'Vendor ID': {0: 8555, 1: 8555, 2: 8555, 3: 8555, 4: 8555, 5: 8555, 6: 8555,7: 8555, 8: 8555,9: 8555, 10: 8555, 11: 8555, 12: 8555, 13: 8555, 14: 1044, 15: 1044, 16: 1044, 17: 1044, 18: 1044, 19: 1044, 20: 1044, 21: 1044}, 'Vendor Name': {0: 'Macys', 1: 'Macys', 2: 'Macys', 3: 'Macys', 4: 'Macys',5: 'Macys', 6: 'Macys', 7: 'Macys', 8: 'Macys', 9: 'Macys', 10: 'Macys', 11: 'Macys', 12: 'Macys', 13: 'Macys', 14: 'JC Penny', 15: 'JC Penny', 16: 'JC Penny', 17: 'JC Penny', 18: 'JC Penny', 19: 'JC Penny', 20: 'JC Penny',21: 'JC Penny'}, 'Invoice Date': {0: '7/31/2020', 1: '7/31/2020', 2: '7/31/2020', 3: '7/31/2020', 4: '7/31/2020', 5: '2/14/2020', 6: '12/31/2019',7: '12/23/2019', 8: '7/14/2020', 9: '6/30/2020' , 10: '6/30/2020', 11: '6/30/2020', 12: '6/30/2020', 13: '6/30/2020', 14: '3/30/2020', 15: '3/20/2020', 16: '3/20/2020', 17: '3/20/2020', 18: '3/20/2020', 19: '2/27/2020', 20: '2/26/2020', 21: '2/26/2020'}, 'Invoice Number': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}} df = pd.DataFrame(d) # fix time format df['Invoice Date'] = pd.to_datetime(df['Invoice Date']) def count_consecutive_invoice(sr): # Find days that are less than or equal to 1 day (m1) m1 = sr.diff().le(pd.Timedelta(days=1)) # find first days of consecutive days (which is False in m1) m2 = (m1.eq(False) & m1.shift(-1).eq(True)) return m1.sum() + m2.sum() output = df.groupby(['Vendor ID', 'Vendor Name'])['Invoice Date'] \ .apply(count_consecutive_invoice) \ .rename('Count of Consecutive Invoices') \ .reset_index()

3条回答

网友
1楼 · 编辑于 2024-10-03 02:38:17

看看这是否是您想要的：
与您的代码类似，下面的代码计数为[Vendor IDVendor Name]组中不同的Invoice Date：
(df.groupby(['Vendor ID', 'Vendor Name'], as_index=False)['Invoice Date'] .nunique() .rename({'Invoice Date': 'Count of Consecutive Invoices'}, axis=1) )
结果：
Vendor ID Vendor Name Count of Consecutive Invoices 0 1044 JC Penny 4 1 8555 Macys 3

网友
2楼 · 编辑于 2024-10-03 02:38:17

每组：
m1：查找天数小于或等于1天（m1）
m2：查找连续天数的前几天（在m1中为False）
然后对2个级数求和
def count_consecutive_invoice(sr): m1 = sr.sort_values().diff().le(pd.Timedelta(days=1)) m2 = (m1.eq(False) & m1.shift(-1).eq(True)) return m1.sum() + m2.sum()
# Ensure your 'Invoice Date' column is datetime64 # df['date'] = pd.to_datetime(df['date']) >>> df.groupby(['Vendor ID', 'Vendor Name'])['Invoice Date'] \ .apply(count_consecutive_invoice) \ .rename('Count of Consecutive Invoices') \ .reset_index() Vendor ID Vendor Name Count of Consecutive Invoices 0 1044 JC Penny 5 1 8555 Macys 5
根据您的实际数据，现在输出为：
Vendor ID Vendor Name Count of Consecutive Invoices 0 1044 JC Penny 7 1 8555 Macys 10

网友
3楼 · 编辑于 2024-10-03 02:38:17

我认为你需要两次，除非你想一行一行地做一个不丑的（也许表现不太好的）。主要问题是您没有将groupby与diff一起使用

df['Count of Consecutive Invoices'] = df.groupby('Vendor ID')['Invoice Date'].diff().dt.days.le(1)
df.groupby(['Vendor ID', 'Vendor Name'], sort=False)['Count of Consecutive Invoices'].sum().add(1).reset_index()
Out[1]: 
   Vendor ID Vendor Name  Count of Consecutive Invoices
0       1044    JC Penny                              4
1       8555       Macys                              3

相关问题更多 >

编程相关推荐

热门问题

热门文章