我想计算出给定组中数据帧的每一行的百分位数。例如,考虑一下来自不同运动项目的运动员的数据集。
pd.DataFrame({"name": ["Joe", "Bob", "Susan", "Kate", "Sam", "Shawn"],
"sport": ["hockey", "hockey", "hockey", "baseball", "baseball", "baseball"],
"points": [1,2,3,1,4,9]})我想把每个运动员的得分活动和来自同一项运动的运动员进行比较。直接比较棒球和曲棍球运动员是不公平的,所以我想看看每一个曲棍球运动员与其他曲棍球运动员相比会跌到哪里。这是所需的输出。
pd.DataFrame({"name": ["Joe", "Bob", "Susan", "Kate", "Sam", "Shawn"],
"sport": ["hockey", "hockey", "hockey", "baseball", "baseball", "baseball"],
"points": [1,2,3,1,4,9],
"percentile": [0,.5,1,0,.5,1]})我的真实数据集有数千组和数十万行。
发布于 2022-05-26 20:36:18
df['percentile'] = df.groupby(['sport'])['points'].rank(pct=True)
print(df)输出:
name sport points percentile
0 Joe hockey 1 0.333333
1 Bob hockey 2 0.666667
2 Susan hockey 3 1.000000
3 Kate baseball 1 0.333333
4 Sam baseball 4 0.666667
5 Shawn baseball 9 1.000000发布于 2022-05-26 20:36:33
发布于 2022-05-26 20:42:17
若要获得预期的输出,请使用groupby.rank并进行重新标度:
组号3:
df['percentile'] = (df.groupby('sport')['points']
.rank(pct=True).
.sub(1/3).mul(3/2)
)通用:
df['percentile'] = (df.groupby('sport')['points']
.apply(lambda g: g.rank(pct=True)
.sub(1/len(g))
.mul(len(g)/(len(g)-1) if len(g)>1 else 0))
)输出:
name sport points percentile
0 Joe hockey 1 0.0
1 Bob hockey 2 0.5
2 Susan hockey 3 1.0
3 Kate baseball 1 0.0
4 Sam baseball 4 0.5
5 Shawn baseball 9 1.0https://stackoverflow.com/questions/72397643
复制相似问题