本福德定律检验函数groupby.agg组

In [58]: df Out[58]: Send_Agent Send_Amount 0 ADR000264 361.940000 1 ADR000264 12.930000 2 ADR000264 11.630000 3 ADR000264 12.930000 4 ADR000264 64.630000 5 ADR000264 12.930000 6 ADR000264 77.560000 7 ADR000264 145.010000 8 API185805 112.34 9 API185805 56.45 10 API185805 48.97 11 API185805 85.44 12 API185805 94.33 13 API185805 116.45

In [59]: grouped = df.groupby('Send_Agent') In [60]: a = grouped.agg({'Send_Amount':leading_digit}) In [61]: a Out[61]: Send_Amount Send_Agent ADR000264 0 API185805 6

In [16]: result = df.assign(Leading_Digit = df['Send_Amount'].astype(str).str[0]).groupby('Send_Agent')['Leading_Digit'].value_counts(sort=False) In [17]: result Out[17]: Send_Agent Leading_Digit ADR000264 1 5509 2 4748 3 2090 4 2497 5 979 6 1206 7 529 8 549 9 729 API185805 1 1707 2 1966 3 744 4 1218 5 306 6 605 7 138 8 621 9 76

In [22]: result = result.to_frame() In [29]: result.columns = ['Count'] In [32]: result Out[32]: Count Send_Agent Leading_Digit ADR000264 1 5509 2 4748 3 2090 4 2497 5 979 6 1206 7 529 8 549 9 729 API185805 1 1707 2 1966 3 744 4 1218 5 306 6 605 7 138 8 621 9 76 In [33]: result['Count'] = (result['Count'])/(result['Count'].sum()) In [34]: result Out[34]: Count Send_Agent Leading_Digit ADR000264 1 0.210131 2 0.181104 3 0.079719 4 0.095244 5 0.037342 6 0.046001 7 0.020178 8 0.020941 9 0.027806 API185805 1 0.065110 2 0.074990 3 0.028379 4 0.046458 5 0.011672 6 0.023077 7 0.005264 8 0.023687 9 0.002899 In [35]: result.unstack() Out[35]: Count \ Leading_Digit 1 2 3 4 5 6 Send_Agent ADR000264 0.210131 0.181104 0.079719 0.095244 0.037342 0.046001 API185805 0.065110 0.074990 0.028379 0.046458 0.011672 0.023077 Leading_Digit 7 8 9 Send_Agent ADR000264 0.020178 0.020941 0.027806 API185805 0.005264 0.023687 0.002899 So , benford values for 1 to 9 as follows d = 0.30103, 0.176091, 0.124939, 0.09691, 0.0791812, 0.0669468, 0.0579919, 0.0511525, 0.0457575

2条回答

网友

1楼 · 编辑于 2024-10-04 09:28:01

很酷的项目。我将使用随机生成的数据集进行说明：

import numpy as np
import pandas as pd
np.random.seed(0)
Send_Amount = 10**(np.random.randint(1, 9, 10**6)) * \
                  (np.random.choice(np.arange(1, 10), 
                                    p=np.log10(1+(1/np.arange(1, 10))), 
                                    size=10**6) + 
                   np.random.rand(10**6))
Send_Agent = np.random.choice(['ADR000264', 'API185805'], 10**6)
df = pd.DataFrame({'Send_Agent': Send_Agent, 'Send_Amount': Send_Amount.astype(int)})

看起来像这样：

^{pr2}$

现在，如果将该函数应用于序列Send_Amount，它将返回另一个带前导数字的序列。如果首先对它们进行分组，则需要为每个分组指定所需的结果类型。该函数不是为获取一个组并返回该组的结果而设计的。它只返回一个数字的前导数字。在

相反，为了验证Benford's law，您需要检查前导数字的频率分布。因为您已经为前导数字创建了一个列，现在可以通过Send_Agent进行分组并对该列调用value_counts。总而言之，它是这样的：

result = df.assign(Leading_Digit = df['Send_Amount'].astype(str).str[0]).groupby('Send_Agent')['Leading_Digit'].value_counts(sort=False)
print(result)
Out[105]: 
Send_Agent  Leading_Digit
ADR000264   1                150522
            2                 87739
            3                 62460
            4                 48204
            5                 39757
            6                 33791
            7                 29024
            8                 25567
            9                 23044
API185805   1                150575
            2                 87994
            3                 62173
            4                 48323
            5                 39452
            6                 33720
            7                 29141
            8                 25538
            9                 22976
Name: Leading_Digit, dtype: int64

您也可以使用df.groupby('Send_Agent')['Leading_Digit'].value_counts(sort=False)完成此操作（在创建列之后）。我只是一步到位。最终，分布将（希望）如下所示：

result.unstack(level=0).plot.bar(subplots=True)

要找出理论概率和观测频率之间的差异，可以执行以下操作：

result = df.assign(Leading_Digit = df['Send_Amount'].astype(str).str[0]).groupby('Send_Agent')['Leading_Digit'].value_counts(sort=False, normalize=True)

请注意，我传递了normalize=True以便它计算比例而不是频率。在

现在，您可以用以下方法进行区别：

result.unstack(level=0).subtract(np.log10(1+(1/np.arange(1, 10))), axis=0).abs()
Out[16]: 
Send_Agent     ADR000264  API185805
Leading_Digit                      
1               0.000051   0.000185
2               0.000651   0.000065
3               0.000046   0.000566
4               0.000523   0.000243
5               0.000316   0.000260
6               0.000621   0.000508
7               0.000044   0.000303
8               0.000030   0.000065
9               0.000321   0.000204

在这里，unstack将Send_代理程序带到列。np.log10(1+(1/np.arange(1, 10)))计算理论概率。您也可以传递先前定义的阵列。由于我们要逐行减去元素，因此subtract方法有axis=0参数。最后，.abs()取结果的绝对值。在

网友

2楼 · 编辑于 2024-10-04 09:28:01

您可以将^{}与^{}一起使用，因为agg或{}聚合输出：

print (df['Send_Amount'].astype(str).str[0].astype(int))
0     3
1     1
2     1
3     1
4     6
5     1
6     7
7     1
8     1
9     5
10    4
11    8
12    9
13    1
Name: Send_Amount, dtype: int32

print (df.groupby('Send_Agent')['Send_Amount'].transform(lambda x: x.astype(str).str[0])
         .astype(int))
0     3
1     1
2     1
3     1
4     6
5     1
6     7
7     1
8     1
9     5
10    4
11    8
12    9
13    1
Name: Send_Amount, dtype: int32

如果数字大于9，请使用str[:2]：

^{pr2}$

Transformation。在

相关问题更多 >

编程相关推荐

热门问题

热门文章