使用pandas查找重复值的数量

df = pd.DataFrame( {'id': ['1', '2', '3', '4', '5', '6', '7', '8'], 'datetime': ['24.06.2013 00:13:49', '24.06.2013 00:14:27', '24.06.2013 00:17:45', '24.06.2013 00:21:54', '24.06.2013 00:21:59', '24.06.2013 00:22:05', '24.06.2013 00:25:14', '24.06.2013 00:26:04'], 'card_num': ['10', '10', '27', '10', '34', '10', '7', '3'], 'type': ['cash_withdrawal', 'cash_withdrawal', 'refill', 'cash_withdrawal', 'payment', 'cash_withdrawal', 'payment', 'cash_withdrawal'], 'result': ['refusal', 'refusal', 'successful', 'refusal', 'successful', 'successful', 'successful', 'successful'], 'summ': [10000, 8000, 42431, 4000, 2347, 3500, 105, 999]})

df_report = df[(df.type != 'refill') & (df.result == 'successful')] # left those lines where the type is not equal refusal and the result is successful card = df_report.card_num # get an array of these card numbers suspicious = df[df.card_num.isin(card)] # apply a filter to the main dataframe according # to the condition that the cards of the main df are contained in the filtered cards

1条回答

网友

1楼 · 发布于 2024-09-29 23:18:39

要查找重复分类值的连续运行，可以执行以下操作：

import pandas as pd

# Set up data
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
df['datetime'] = pd.to_datetime(df['datetime'])

# Start filtering
df1 = df[df['type'] != 'refill']

# onehot encode 'type', then calculate the rolling rolling sum to find repeated sequences
df1 = pd.concat([df1, pd.get_dummies(df1['type'])], axis=1)
df1['withd>3'] = df1['cash_withdrawal'].rolling(4).sum()
df1['payt>3'] = df1['payment'].rolling(4).sum()

df1 output:
    id      datetime        card_num       type result      summ    cash_withdrawal payment withd>3 payt>3
0   1   2013-06-24 00:13:49 10  cash_withdrawal refusal    10000    1               0       NaN     NaN
1   2   2013-06-24 00:14:27 10  cash_withdrawal refusal     8000    1               0       NaN     NaN
2   3   2013-06-24 00:17:45 10  cash_withdrawal refusal     4000    1               0       NaN     NaN
3   4   2013-06-24 00:21:54 10  cash_withdrawal successful  3500    1               0       4.0     0.0
5   6   2013-06-24 00:42:05 34  payment         refusal   124125    0               1       3.0     1.0
6   7   2013-06-24 00:45:14 7   payment         successful   105    0               1       2.0     2.0
7   8   2013-06-24 00:49:04 3   cash_withdrawal successful   999    1               0       2.0     2.0

# filter for consecutive thresholds
df1 = df1[(df1['withd>3'] > 3) | (df1['payt>3'] > 3)]

df1 output:
    id  datetime            card_num       type     result  summ    cash_withdrawal payment withd>3 payt>3
3   4   2013-06-24 00:21:54 10  cash_withdrawal successful  3500    1               0       4.0      0.0

相关问题更多 >

编程相关推荐

热门问题

热门文章