我有一个数据帧(类似于下面),我正在尝试处理它。
| user | weight | height | timestamp |
| ------ | ----- |------ |---------------------------------|
| user 1 | NA | 180 | 2017-11-09 17:21:49.025259+00:00|
| user 1 | 90 | NA | 2017-11-09 17:21:49.025259+00:00|
| user 1 | 89 | NA | 2017-11-10 17:21:49.025259+00:00|
| user 1 | 88 | NA | 2017-11-11 17:21:49.025259+00:00|
| user 2 | NA | 175 | 2017-11-09 17:21:49.025259+00:00|
| user 2 | 95 | NA | 2017-11-09 17:21:49.025259+00:00|
| user 3 | NA | 170 | 2017-12-09 17:21:49.025259+00:00|
| user 4 | 70 | NA | 2017-08-09 17:21:49.025259+00:00|我的最终目标是使用每个体重和身高的最新时间戳数据来计算每个用户的BMI。有些用户只有一个数据,许多人有多个体重、身高或两者都有。由于数据是以kg和cm为单位,因此BMI将为
BMI =(体重(千克x 10000))/((身高(厘米)^2)
我希望返回如下内容。(一本字典也足够了)
| user | BMI |
| ------ | ------------------|
| user 1 | 27.1604938272 |
| user 2 | 31.0204081633 |
| user 3 | NA |
| user 4 | NA | 发布于 2018-08-01 18:43:45
使用:
#convert to datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'])
#sorting df by datetimes
df = df.sort_values('timestamp')
#for each group get last non NaN value
df1 = df.groupby('user').last()
#apply formula
BMI = (df1['weight'] * 10000 / (df1['height'])**2).reset_index(name='BMI')
print (BMI)
user BMI
0 user 1 27.160494
1 user 2 31.020408
2 user 3 NaN
3 user 4 NaN
#for dictionary
d = (df1['weight'] * 10000 / (df1['height'])**2).to_dict()
print (d)
{'user 1': 27.160493827160494,
'user 2': 31.020408163265305,
'user 3': nan,
'user 4': nan}https://stackoverflow.com/questions/51631063
复制相似问题