为了在各个地区进行有意义的比较,我想在不同国家爆发新冠肺炎确诊病例的开始日期前将其正常化。对于任何领土,领土到达或超过10例确诊病例的日期被视为“疫情爆发的第0天”。
示例dataframe:
[in]
import pandas as pd
confirmed_cases = {'Date':['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20'], 'Australia':[0, 0, 0, 30, 50], 'Albania':[0, 20, 25, 30, 50], 'Algeria':[25, 40, 50, 50, 70]}
df = pd.DataFrame(confirmed_cases)
df
[out]
Date Australia Albania Algeria
0 1/22/20 0 0 25
1 1/23/20 0 20 40
2 1/24/20 0 25 50
3 1/25/20 30 30 50
4 1/26/20 50 50 70预期结果:
Day Since Outbreak Australia Albania Algeria
0 0 30 20 25
1 1 50 25 40
2 2 NaN 30 50
3 3 NaN 50 50
4 4 NaN NaN 70有什么方法可以用简单的Python/Panda代码来执行这个任务吗?
发布于 2020-04-29 19:17:58
为每个国家找到第一个超过阈值(10)的指标值,并将每一列向上移动那么多。
df2 = df[['Australia', 'Albania', 'Algeria']].apply(lambda x: x.shift(-(x > 10).idxmax()))
# df2
Australia Albania Algeria
0 30.0 20.0 25
1 50.0 25.0 40
2 NaN 30.0 50
3 NaN 50.0 50
4 NaN NaN 70重置索引以获得日期自列。
df2.reset_index().rename(columns={'index': 'Day Since Outbreak'})
Day Since Outbreak Australia Albania Algeria
0 0 30.0 20.0 25
1 1 50.0 25.0 40
2 2 NaN 30.0 50
3 3 NaN 50.0 50
4 4 NaN NaN 70发布于 2020-04-29 19:12:33
根据值< 10的第一次运行确定每个列需要多少次shift,然后转移它们。cummin确保如果有一个间歇性值< 10,它不会被计算在shift中
df = df.drop(columns='Date') # Wont need
s = df.lt(10).cummin().sum()
for col, shift in s.iteritems():
df[col] = df[col].shift(-shift)
df['Days Since'] = range(len(df)) # Duplicative with index... Australia Albania Algeria Days Since
0 30.0 20.0 25 0
1 50.0 25.0 40 1
2 NaN 30.0 50 2
3 NaN 50.0 50 3
4 NaN NaN 70 4https://stackoverflow.com/questions/61509734
复制相似问题