我在熊猫df中有一个多级列,索引为appid,如下所示:
year |2016 2017 2018 2019 2016 2017 2018 2019
|ttl ttl ttl ttl tta tta tta tta
-----------------------------------------------------------------
appid |
75787 |NaN 227.0 470.0 426.0 NaN 25.0 23.0 21.0
146306 |NaN 858.0 226.0 NaN NaN 14.0 35.0 NaN
159479 |NaN NaN 0.0 NaN NaN NaN 3.5 NaN
163618 |NaN 0.0 650.0 100.0 NaN 12.0 14.6 123.0
215968 |23.0 0.0 NaN NaN 45.0 2.0 NaN NaN我希望以这样一种方式来转换这个df,即它可以通过当前的最新年份条目进行排序。例如。
Year |P2Y PY LY P2Y PY LY
|ttl ttl ttl tta tta tta
----------------------------------------------------
appid |
75787 |227.0 470.0 426.0 25.0 23.0 21.0
146306 |NaN 858.0 226.0 NaN 14.0 35.0
159479 |NaN NaN 0.0 NaN NaN 3.5
163618 |0.0 650.0 100.0 12.0 14.6 123.0
215968 |NaN 23.0 0.0 NaN 45.0 2.0发布于 2020-05-04 09:37:28
您可以尝试使用转置数据集并使用shift。
df.T \
.apply(lambda x: x.shift(len(x) - x.index.get_loc(x.last_valid_index()) - 1)) \
.T \
.dropna(how='all', axis='columns'))解释
使用.T
末尾的指定数量的NaN值。
1. Use [`apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) on each columns
2. Find the last not `NaN` values using [`last_valid_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html) with [`get_loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_loc.html). For more detail on this step, see this [Locate first and last non NaN values in a Pandas DataFrame ](https://stackoverflow.com/questions/22403469/locate-first-and-last-non-nan-values-in-a-pandas-dataframe)
3. Compute the number of row shift from step 2.3 and `len(x)`. Also substract 1 since the index from step 2.2 take the row index above.
4. Use [`shift`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html) to shift the column
dropna和how='all', axis='columns'
的NaN全列
代码+插图
# Step 1
print(df.T)
# 75787 146306 159479 163618 215968
# year appid
# 2016 ttl NaN NaN NaN NaN 23.0
# 2017 ttl 227.0 858.0 NaN 0.0 0.0
# 2018 ttl 470.0 226.0 0.0 650.0 NaN
# 2019 ttl 426.0 NaN NaN 100.0 NaN
# 2016 tta NaN NaN NaN NaN 45.0
# 2017 tta 25.0 14.0 NaN 12.0 2.0
# 2018 tta 23.0 35.0 3.5 14.6 NaN
# 2019 tta 21.0 NaN NaN 123.0 NaN
# Step 2.2.1
print(df.T.apply(lambda x: x.last_valid_index()))
# 75787 (2019, tta)
# 146306 (2018, tta)
# 159479 (2018, tta)
# 163618 (2019, tta)
# 215968 (2017, tta)
# dtype: object
# Step 2.2.2
print(df.T.apply(lambda x: x.index.get_loc(x.last_valid_index())))
# 75787 7
# 146306 6
# 159479 6
# 163618 7
# 215968 5
# dtype: int64
# Step 2
print(df.T.apply(lambda x: x.shift(
len(x) - x.index.get_loc(x.last_valid_index()) - 1)))
# 75787 146306 159479 163618 215968
# year appid
# 2016 ttl NaN NaN NaN NaN NaN
# 2017 ttl 227.0 NaN NaN 0.0 NaN
# 2018 ttl 470.0 858.0 NaN 650.0 23.0
# 2019 ttl 426.0 226.0 0.0 100.0 0.0
# 2016 tta NaN NaN NaN NaN NaN
# 2017 tta 25.0 NaN NaN 12.0 NaN
# 2018 tta 23.0 14.0 NaN 14.6 45.0
# 2019 tta 21.0 35.0 3.5 123.0 2.0
# Step 3
print(df.T.apply(lambda x: x.shift(
len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T)
# year 2016 2017 2018 2019 2016 2017 2018 2019
# appid ttl ttl ttl ttl tta tta tta tta
# 75787 NaN 227.0 470.0 426.0 NaN 25.0 23.0 21.0
# 146306 NaN NaN 858.0 226.0 NaN NaN 14.0 35.0
# 159479 NaN NaN NaN 0.0 NaN NaN NaN 3.5
# 163618 NaN 0.0 650.0 100.0 NaN 12.0 14.6 123.0
# 215968 NaN NaN 23.0 0.0 NaN NaN 45.0 2.0
# Step 4
print(df.T.apply(lambda x: x.shift(
len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T
.dropna(how='all', axis='columns'))
# year 2017 2018 2019 2017 2018 2019
# appid ttl ttl ttl tta tta tta
# 75787 227.0 470.0 426.0 25.0 23.0 21.0
# 146306 NaN 858.0 226.0 NaN 14.0 35.0
# 159479 NaN NaN 0.0 NaN NaN 3.5
# 163618 0.0 650.0 100.0 12.0 14.6 123.0
# 215968 NaN 23.0 0.0 NaN 45.0 2.0发布于 2020-05-04 09:49:00
您可以先对列使用DataFrame.stack数年,然后使用justify,过滤最后3列,创建DataFrame,并在必要时由DataFrame.unstack与DataFrame.reindex重新组合,以更改列名的顺序:
df1 = df.stack()
arr = justify(df1.to_numpy(),invalid_val=np.nan, side='right')[:, -3:]
print (arr)
[[ 25. 23. 21. ]
[227. 470. 426. ]
[ nan 14. 35. ]
[ nan 858. 226. ]
[ nan nan 3.5]
[ nan nan 0. ]
[ 12. 14.6 123. ]
[ 0. 650. 100. ]
[ nan 45. 2. ]
[ nan 23. 0. ]]
mux = pd.MultiIndex.from_product([df.columns.levels[1], ['P2Y','PY','LY']])
df2 = (pd.DataFrame(arr, index=df1.index, columns=['P2Y','PY','LY'])
.unstack()
.swaplevel(1,0, axis=1)
.reindex(mux, axis=1))
print (df2)
tta ttl
P2Y PY LY P2Y PY LY
75787 25.0 23.0 21.0 227.0 470.0 426.0
146306 NaN 14.0 35.0 NaN 858.0 226.0
159479 NaN NaN 3.5 NaN NaN 0.0
163618 12.0 14.6 123.0 0.0 650.0 100.0
215968 NaN 45.0 2.0 NaN 23.0 0.0功能:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return outhttps://stackoverflow.com/questions/61588450
复制相似问题