下面是我的数据帧的一个小样本,它有25000多行长:
In [58]: df
Out[58]:
Send_Agent Send_Amount
0 ADR000264 361.940000
1 ADR000264 12.930000
2 ADR000264 11.630000
3 ADR000264 12.930000
4 ADR000264 64.630000
5 ADR000264 12.930000
6 ADR000264 77.560000
7 ADR000264 145.010000
8 API185805 112.34
9 API185805 56.45
10 API185805 48.97
11 API185805 85.44
12 API185805 94.33
13 API185805 116.45有两个Send_Agents ADR000264和API185805。我正在尝试将本福德定律应用于Send_Amount。当我尝试所有的Send_Amount而不考虑Send_Agent时,我能够成功地做到这一点。下面是我提取前导数字的函数。
def leading_digit(x,dig=1):
x = str(x)
out = int(x[dig-1])
return out 将此函数应用于Send_Amount列时,可以很好地工作:
In [75]: df['Send_Amount'].apply(leading_digit)
Out[75]:
0 3
1 1
2 1
3 1
4 6
5 1
6 7
7 1
8 1它给出的输出是一个序列,并从Send_Amount列中提取前导数字。
但是当我在按Send_Agent分组后尝试相同的函数时,得到了错误的结果:
In [74]: df['Send_Amount'].groupby(df['Send_Agent']).apply(leading_digit)
Out[74]:
Send_Agent
ADR000264 0
API185805 6
dtype: int64与groupby.agg相同
In [59]: grouped = df.groupby('Send_Agent')
In [60]: a = grouped.agg({'Send_Amount':leading_digit})
In [61]: a
Out[61]:
Send_Amount
Send_Agent
ADR000264 0
API185805 6编辑:
所以,现在我们有了前导数字的计数。
In [16]: result = df.assign(Leading_Digit = df['Send_Amount'].astype(str).str[0]).groupby('Send_Agent')['Leading_Digit'].value_counts(sort=False)
In [17]: result
Out[17]:
Send_Agent Leading_Digit
ADR000264 1 5509
2 4748
3 2090
4 2497
5 979
6 1206
7 529
8 549
9 729
API185805 1 1707
2 1966
3 744
4 1218
5 306
6 605
7 138
8 621
9 76数据类型: int64
In [18]: type(result)
Out[18]: pandas.core.series.Series我不需要绘制图表。我只需要从Benford值中减去计数即可。
In [22]: result = result.to_frame()
In [29]: result.columns = ['Count']
In [32]: result
Out[32]:
Count
Send_Agent Leading_Digit
ADR000264 1 5509
2 4748
3 2090
4 2497
5 979
6 1206
7 529
8 549
9 729
API185805 1 1707
2 1966
3 744
4 1218
5 306
6 605
7 138
8 621
9 76
In [33]: result['Count'] = (result['Count'])/(result['Count'].sum())
In [34]: result
Out[34]:
Count
Send_Agent Leading_Digit
ADR000264 1 0.210131
2 0.181104
3 0.079719
4 0.095244
5 0.037342
6 0.046001
7 0.020178
8 0.020941
9 0.027806
API185805 1 0.065110
2 0.074990
3 0.028379
4 0.046458
5 0.011672
6 0.023077
7 0.005264
8 0.023687
9 0.002899
In [35]: result.unstack()
Out[35]:
Count \
Leading_Digit 1 2 3 4 5 6
Send_Agent
ADR000264 0.210131 0.181104 0.079719 0.095244 0.037342 0.046001
API185805 0.065110 0.074990 0.028379 0.046458 0.011672 0.023077
Leading_Digit 7 8 9
Send_Agent
ADR000264 0.020178 0.020941 0.027806
API185805 0.005264 0.023687 0.002899
So , benford values for 1 to 9 as follows
d = 0.30103, 0.176091, 0.124939, 0.09691, 0.0791812, 0.0669468, 0.0579919, 0.0511525, 0.0457575我所要做的就是从resultcount中减去它们。
我仍然是Pandas和Python的新手。那么,我该怎么做呢?
https://stackoverflow.com/questions/38338864
复制相似问题