我正面临着熊猫的一个小小的挑战,我很难弄清楚。
我用下面的代码创建了两个数据格式
df5 = dataFrame[['PdDistrict' , 'Category']]
df5 = df5[pd.notnull(df5['PdDistrict'])]
df5 = df5.groupby(['Category', 'PdDistrict']).size()
df5 = df5.reset_index()
df5 = df5.sort_values(['PdDistrict',0], ascending=False)
df6 = df5.groupby('PdDistrict')[0].sum()
df6 = df6.reset_index()这给了我两条数据。df5包含特定类别在给定区域中发生的次数。例如:
'Category' 'PdDistrict' 'count'
Drugs Bayview 200
Theft Bayview 200
Gambling Bayview 200
Drugs CENTRAL 300
Theft CENTRAL 300
Gambling CENTRAL 300df6框架包含给定PdDistrict的总类别数。
这给了df6以下的外观
'PdDistrict' 'total count'
Bayview 600
CENTRAL 900现在我想要的是df5看起来像这样:
'Category' 'PdDistrict' 'count' 'Average'
Drugs Bayview 200 0.33
Theft Bayview 200 0.33
Gambling Bayview 200 0.33
Drugs CENTRAL 200 0.22
Theft CENTRAL 200 0.22
Gambling CENTRAL 200 0.22因此,它基本上是从df5计算,除以df6的总和,但在同一地区。我该怎么做?
res = df5.set_index('PdDistrict', append = False) / df6.set_index('PdDistrict', append = False)以上内容给了我NaN的分类。
发布于 2016-02-22 14:07:46
您可以在第一个df中添加total count值,然后执行计算:
In [45]:
df['total count'] = df['PdDistrict'].map(df1.set_index('PdDistrict')['total count'])
df
Out[45]:
Category PdDistrict count total count
0 Drugs Bayview 200 600
1 Theft Bayview 200 600
2 Gambling Bayview 200 600
3 Drugs CENTRAL 300 900
4 Theft CENTRAL 300 900
5 Gambling CENTRAL 300 900
In [46]:
df['Average'] = df['count']/df['total count']
df
Out[46]:
Category PdDistrict count total count Average
0 Drugs Bayview 200 600 0.333333
1 Theft Bayview 200 600 0.333333
2 Gambling Bayview 200 600 0.333333
3 Drugs CENTRAL 300 900 0.333333
4 Theft CENTRAL 300 900 0.333333
5 Gambling CENTRAL 300 900 0.333333https://stackoverflow.com/questions/35555474
复制相似问题