使用pandas查找重复值的数量

2024-09-29 23:18:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框:

df = pd.DataFrame(
{'id': ['1', '2', '3', '4', '5', '6', '7', '8'],
 'datetime': ['24.06.2013 00:13:49',
  '24.06.2013 00:14:27',
  '24.06.2013 00:17:45',
  '24.06.2013 00:21:54',
  '24.06.2013 00:21:59',
  '24.06.2013 00:22:05',
  '24.06.2013 00:25:14',
  '24.06.2013 00:26:04'],
 'card_num': ['10', '10', '27', '10', '34', '10', '7', '3'],
 'type': ['cash_withdrawal',
  'cash_withdrawal',
  'refill',
  'cash_withdrawal',
  'payment',
  'cash_withdrawal',
  'payment',
  'cash_withdrawal'],
 'result': ['refusal',
  'refusal',
  'successful',
  'refusal',
  'successful',
  'successful',
  'successful',
  'successful'],
 'summ': [10000, 8000, 42431, 4000, 2347, 3500, 105, 999]})

要求发现类似欺诈交易,标准如下:

  • 20分钟内的信用卡交易
  • 提款或付款的卡交易
  • 信用卡交易>;三,
  • “拒绝”状态的前三个或更多卡交易,以及“成功”状态的第四个或更多卡交易
  • 每笔卡交易都少于上一笔

我已经做了以下工作:

df_report = df[(df.type != 'refill') & (df.result == 'successful')]
# left those lines where the type is not equal refusal and the result is successful
card = df_report.card_num
# get an array of these card numbers
suspicious = df[df.card_num.isin(card)]
# apply a filter to the main dataframe according 
# to the condition that the cards of the main df are contained in the filtered cards

接下来,我需要移除卡上的操作是<;我不知道怎么做,你能告诉我吗? 此外,此数据帧需要通过结果列进行过滤,以便这些卡保持成功和拒绝


Tags: the数据dftype交易cashresultpayment
1条回答
网友
1楼 · 发布于 2024-09-29 23:18:39

要查找重复分类值的连续运行,可以执行以下操作:

import pandas as pd

# Set up data
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
df['datetime'] = pd.to_datetime(df['datetime'])

# Start filtering
df1 = df[df['type'] != 'refill']

# onehot encode 'type', then calculate the rolling rolling sum to find repeated sequences
df1 = pd.concat([df1, pd.get_dummies(df1['type'])], axis=1)
df1['withd>3'] = df1['cash_withdrawal'].rolling(4).sum()
df1['payt>3'] = df1['payment'].rolling(4).sum()

df1 output:
    id      datetime        card_num       type result      summ    cash_withdrawal payment withd>3 payt>3
0   1   2013-06-24 00:13:49 10  cash_withdrawal refusal    10000    1               0       NaN     NaN
1   2   2013-06-24 00:14:27 10  cash_withdrawal refusal     8000    1               0       NaN     NaN
2   3   2013-06-24 00:17:45 10  cash_withdrawal refusal     4000    1               0       NaN     NaN
3   4   2013-06-24 00:21:54 10  cash_withdrawal successful  3500    1               0       4.0     0.0
5   6   2013-06-24 00:42:05 34  payment         refusal   124125    0               1       3.0     1.0
6   7   2013-06-24 00:45:14 7   payment         successful   105    0               1       2.0     2.0
7   8   2013-06-24 00:49:04 3   cash_withdrawal successful   999    1               0       2.0     2.0

# filter for consecutive thresholds
df1 = df1[(df1['withd>3'] > 3) | (df1['payt>3'] > 3)]

df1 output:
    id  datetime            card_num       type     result  summ    cash_withdrawal payment withd>3 payt>3
3   4   2013-06-24 00:21:54 10  cash_withdrawal successful  3500    1               0       4.0      0.0

相关问题 更多 >

    热门问题