我有一个数据帧df,如下所示。我想计算最后3个非nan列的平均值。如果非缺失列少于三列,则缺少平均值。
name day1 day2 day3 day4 day5 day6 day7
A 1 1 nan 2 3 0 3
B nan nan nan nan nan nan 3
C 1 1 0 1 1 1 1
D 1 1 0 1 nan 1 4expect输出应如下所示
name day1 day2 day3 day4 day5 day6 day7 expected
A 1 1 nan 2 3 0 3 2 <- 1/3*(day5 + day6 + day7)
B nan nan nan nan nan nan 3 nan <- less than 3 non-missing
C 1 1 0 1 1 1 1 1 <- 1/3*(day5 + day6 + day7)
D 1 1 0 1 nan 1 4 2 <- 1/3 *(day4 + day6 + day7)我知道如何计算最后三列的平均值,并计算有多少未丢失的观察值。df.iloc[:, 5:7].count(axis=1) average of the last three column df.iloc[:, 5:7].count(axis=1) number of non-nan in the last three column
如果有少于3个未丢失的观察值,我知道如何使用df.iloc[:, 1:7].count(axis=1) <= 3将平均值设置为丢失。
但我正在努力寻找一种方法来计算最后三个未丢失列的平均值。有人能教我怎么解决这个问题吗?
发布于 2018-12-27 04:58:44
使用justify的矢量化模型-
N = 3 # last N entries for averaging
avg = np.mean(justify(df.values,invalid_val=np.nan,axis=1, side='right')[:,-N:],1)
df['expected'] = avg发布于 2018-12-27 04:56:07
您可以通过自定义函数使用pd.DataFrame.apply。这只是部分矢量化的。
def mean_calculator(row):
non_nulls = row.notnull()
if non_nulls.sum() < 3:
return np.nan
return row[non_nulls].values[-3:].mean()
df['expected'] = df.iloc[:, 1:].apply(mean_calculator, axis=1)
print(df)
name day1 day2 day3 day4 day5 day6 day7 expected
0 A 1.0 1.0 NaN 2.0 3.0 0.0 3 2.0
1 B NaN NaN NaN NaN NaN NaN 3 NaN
2 C 1.0 1.0 0.0 1.0 1.0 1.0 1 1.0
3 D 1.0 1.0 0.0 1.0 NaN 1.0 4 2.0发布于 2018-12-27 04:58:07
您可以使用以下函数从计算expected列开始:
expected = df.apply(lambda x: x[~x.isnull()][-3:].mean(), axis = 1)并将这些值插入到至少具有3有效值的列中:
m = df.isnull().sum(axis=1) > 3
df.loc[~m,'expected'] = expected.mask(m)
day1 day2 day3 day4 day5 day6 day7 expected
name
A 1.0 1.0 NaN 2.0 3.0 0.0 3 2.0
B NaN NaN NaN NaN NaN NaN 3 NaN
C 1.0 1.0 0.0 1.0 1.0 1.0 1 1.0
D 1.0 1.0 0.0 1.0 NaN 1.0 4 2.0https://stackoverflow.com/questions/53936985
复制相似问题