我正在寻找一个更优雅的解决方案,以获得每个熊猫组的唯一赢家(最大票数)的名单。
我已经下载了California election results,并获得了我想在一个名为create_df的函数中使用的数据。
df = create_df()
df.head() candidate county district office party precinct votes
0 JOHN COX ALAMEDA NaN GOVERNOR REP 200100 49.0
1 JOHN COX ALAMEDA NaN GOVERNOR REP 200200 55.0
2 JOHN COX ALAMEDA NaN GOVERNOR REP 200300 26.0
3 JOHN COX ALAMEDA NaN GOVERNOR REP 200600 28.0
4 JOHN COX ALAMEDA NaN GOVERNOR REP 200700 35.0我目前的实现是这样的:
county_votes = df.query("office == 'GOVERNOR'")\
.groupby(["county", "party"], as_index=False)\
.votes.sum()
winners = county_votes.reindex(
county_votes.groupby("county").votes.idxmax().values
)[["county", "party"]]
winner.head() county party
0 ALAMEDA DEM
2 ALPINE DEM
5 AMADOR REP
7 BUTTE REP
9 CALAVERAS REP有没有更好的方法?
发布于 2019-08-02 02:34:50
我找到了另一种方法,而且看起来也更快。
%%timeit
county_votes = df.query("office == 'GOVERNOR'")\
.groupby(["county", "party"], as_index=False)\
.votes.sum()
county_votes.reindex(
county_votes.groupby("county").votes.idxmax().values
)[["county", "party"]].head()42.4 ms±97µs /环路(平均值±标准dev.共7次运行,每次10次循环)
%%timeit
df.query("office == 'GOVERNOR'")\
.groupby(["county", "party"], as_index=False)\
.votes.sum()\
.sort_values(['county', 'votes'], ascending=[True, False])\
.drop_duplicates(subset="county").head()31.6 ms±60.9µs /环路(平均值±标准dev.共7次运行,每次10次循环)
https://stackoverflow.com/questions/57314439
复制相似问题