文章/答案/技术大牛

发布

社区首页 >问答首页 >基于多级数据中列中数据存在性的Arrage数据

问基于多级数据中列中数据存在性的Arrage数据
EN

Stack Overflow用户

提问于 2020-05-04 08:54:40

回答 2查看 87关注 0票数 1

我在熊猫df中有一个多级列，索引为appid，如下所示：

year   |2016    2017    2018    2019    2016  2017   2018   2019
       |ttl     ttl     ttl     ttl     tta   tta    tta    tta
-----------------------------------------------------------------
appid  |
75787  |NaN     227.0   470.0   426.0   NaN   25.0   23.0   21.0
146306 |NaN     858.0   226.0   NaN     NaN   14.0   35.0   NaN
159479 |NaN     NaN     0.0     NaN     NaN   NaN    3.5    NaN
163618 |NaN     0.0     650.0   100.0   NaN   12.0   14.6   123.0
215968 |23.0    0.0     NaN     NaN     45.0  2.0    NaN    NaN

我希望以这样一种方式来转换这个df，即它可以通过当前的最新年份条目进行排序。例如。

Year   |P2Y      PY      LY    P2Y    PY    LY
       |ttl     ttl     ttl    tta    tta   tta

----------------------------------------------------
appid  |
75787  |227.0   470.0   426.0  25.0   23.0   21.0
146306 |NaN     858.0   226.0  NaN    14.0   35.0
159479 |NaN     NaN     0.0    NaN    NaN    3.5
163618 |0.0     650.0   100.0  12.0   14.6   123.0
215968 |NaN     23.0    0.0    NaN    45.0   2.0

python

pandas

sorting

multilevel-analysis

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-05-04 09:37:28

您可以尝试使用转置数据集并使用shift。

df.T \
  .apply(lambda x: x.shift(len(x) - x.index.get_loc(x.last_valid_index()) - 1)) \
  .T \
  .dropna(how='all', axis='columns'))

解释

使用.T

Shift

将数据集转换为列

末尾的指定数量的NaN值。

1. Use [`apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) on each columns
2. Find the last not `NaN` values using [`last_valid_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html) with [`get_loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_loc.html). For more detail on this step, see this [Locate first and last non NaN values in a Pandas DataFrame ](https://stackoverflow.com/questions/22403469/locate-first-and-last-non-nan-values-in-a-pandas-dataframe)
3. Compute the number of row shift from step 2.3 and `len(x)`. Also substract 1 since the index from step 2.2 take the row index above.
4. Use [`shift`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html) to shift the column

.T

Remove

最终将数据集转回原来的步骤1，使用使用dropna和how='all', axis='columns'

的NaN全列

代码+插图

# Step 1
print(df.T)
#             75787   146306  159479  163618  215968
# year appid
# 2016 ttl       NaN     NaN     NaN     NaN    23.0
# 2017 ttl     227.0   858.0     NaN     0.0     0.0
# 2018 ttl     470.0   226.0     0.0   650.0     NaN
# 2019 ttl     426.0     NaN     NaN   100.0     NaN
# 2016 tta       NaN     NaN     NaN     NaN    45.0
# 2017 tta      25.0    14.0     NaN    12.0     2.0
# 2018 tta      23.0    35.0     3.5    14.6     NaN
# 2019 tta      21.0     NaN     NaN   123.0     NaN


# Step 2.2.1
print(df.T.apply(lambda x: x.last_valid_index()))
# 75787     (2019, tta)
# 146306    (2018, tta)
# 159479    (2018, tta)
# 163618    (2019, tta)
# 215968    (2017, tta)
# dtype: object


# Step 2.2.2
print(df.T.apply(lambda x: x.index.get_loc(x.last_valid_index())))
# 75787     7
# 146306    6
# 159479    6
# 163618    7
# 215968    5
# dtype: int64


# Step 2
print(df.T.apply(lambda x: x.shift(
    len(x) - x.index.get_loc(x.last_valid_index()) - 1)))
#             75787   146306  159479  163618  215968
# year appid
# 2016 ttl       NaN     NaN     NaN     NaN     NaN
# 2017 ttl     227.0     NaN     NaN     0.0     NaN
# 2018 ttl     470.0   858.0     NaN   650.0    23.0
# 2019 ttl     426.0   226.0     0.0   100.0     0.0
# 2016 tta       NaN     NaN     NaN     NaN     NaN
# 2017 tta      25.0     NaN     NaN    12.0     NaN
# 2018 tta      23.0    14.0     NaN    14.6    45.0
# 2019 tta      21.0    35.0     3.5   123.0     2.0


# Step 3
print(df.T.apply(lambda x: x.shift(
    len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T)
# year   2016   2017   2018   2019 2016  2017  2018   2019
# appid   ttl    ttl    ttl    ttl  tta   tta   tta    tta
# 75787   NaN  227.0  470.0  426.0  NaN  25.0  23.0   21.0
# 146306  NaN    NaN  858.0  226.0  NaN   NaN  14.0   35.0
# 159479  NaN    NaN    NaN    0.0  NaN   NaN   NaN    3.5
# 163618  NaN    0.0  650.0  100.0  NaN  12.0  14.6  123.0
# 215968  NaN    NaN   23.0    0.0  NaN   NaN  45.0    2.0



# Step 4
print(df.T.apply(lambda x: x.shift(
    len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T
    .dropna(how='all', axis='columns'))

# year     2017   2018   2019  2017  2018   2019
# appid     ttl    ttl    ttl   tta   tta    tta
# 75787   227.0  470.0  426.0  25.0  23.0   21.0
# 146306    NaN  858.0  226.0   NaN  14.0   35.0
# 159479    NaN    NaN    0.0   NaN   NaN    3.5
# 163618    0.0  650.0  100.0  12.0  14.6  123.0
# 215968    NaN   23.0    0.0   NaN  45.0    2.0

票数 1

Stack Overflow用户

发布于 2020-05-04 09:49:00

您可以先对列使用DataFrame.stack数年，然后使用justify，过滤最后3列，创建DataFrame，并在必要时由DataFrame.unstack与DataFrame.reindex重新组合，以更改列名的顺序：

df1 = df.stack()

arr = justify(df1.to_numpy(),invalid_val=np.nan, side='right')[:, -3:]
print (arr)
[[ 25.   23.   21. ]
 [227.  470.  426. ]
 [  nan  14.   35. ]
 [  nan 858.  226. ]
 [  nan   nan   3.5]
 [  nan   nan   0. ]
 [ 12.   14.6 123. ]
 [  0.  650.  100. ]
 [  nan  45.    2. ]
 [  nan  23.    0. ]]


mux = pd.MultiIndex.from_product([df.columns.levels[1], ['P2Y','PY','LY']])
df2 = (pd.DataFrame(arr, index=df1.index, columns=['P2Y','PY','LY'])
         .unstack()
         .swaplevel(1,0, axis=1)
         .reindex(mux, axis=1))
print (df2)
         tta                 ttl              
         P2Y    PY     LY    P2Y     PY     LY
75787   25.0  23.0   21.0  227.0  470.0  426.0
146306   NaN  14.0   35.0    NaN  858.0  226.0
159479   NaN   NaN    3.5    NaN    NaN    0.0
163618  12.0  14.6  123.0    0.0  650.0  100.0
215968   NaN  45.0    2.0    NaN   23.0    0.0

功能：

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61588450

复制

相似问题

问基于多级数据中列中数据存在性的Arrage数据
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于多级数据中列中数据存在性的Arrage数据EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于多级数据中列中数据存在性的Arrage数据
EN