我有以下问题。我的DataFrame看起来像这样(只有100.000条条目):
col_1 col_2 col_3
green yellow red
yellow green purple
green yellow red
yellow brown green
red yellow purple
red green yellow 不过,我想要的是,所有的绿色都列在一列里,所有的红色和黄色等等,所以它应该是这样的:
col_1 col_2 col_3 col_4 col_5
green yellow red
green yellow purple
green yellow red
green yellow brown
yellow red purple
green yellow red我该怎么做?提前谢谢。
发布于 2022-02-17 10:20:58
这里有一种方法:使用get_dummies将其转换为一个热编码列;跨列求和并使用np.where填充带有列名的DataFrame。最后,修正列名:
s = pd.get_dummies(df)
s.columns = [c.split('_')[-1] for c in s.columns]
s = s.groupby(level=0, axis=1).sum()
out = (s.apply(lambda c: np.where(c, c.name, ''))
.rename(columns=dict(zip(s.columns, ['col5','col1','col4','col3','col2'])))
.sort_index(axis=1))使用链式方法的相同代码:
out = (pd.get_dummies(df.set_axis(['0']*3, axis=1))
.pipe(lambda x: x.set_axis([c.split('_')[1] for c in x], axis=1))
.groupby(level=0, axis=1).sum()
.apply(lambda c: np.where(c, c.name, ''))
.set_axis(['col5','col1','col4','col3','col2'], axis=1)
.sort_index(axis=1)
)输出:
col1 col2 col3 col4 col5
0 green yellow red
1 green yellow purple
2 green yellow red
3 green yellow brown
4 yellow red purple
5 green yellow red 发布于 2022-02-17 10:21:22
下面是使用pandas.get_dummies或str.get_dummies的一种方法
# credit https://stackoverflow.com/a/71143503
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns或
df2 = (
df.apply(lambda c: pd.get_dummies(c).stack())
.max(1)
.unstack()
.astype(int)
)
out = df2*df2.columns产出:
brown green purple red yellow
0 green red yellow
1 green purple yellow
2 green red yellow
3 brown green yellow
4 purple red yellow
5 green red yellow替代产出:
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns
out.columns = [f'col_{i}' for i,_ in enumerate(out, start=1)]产出:
col_1 col_2 col_3 col_4 col_5
0 green red yellow
1 green purple yellow
2 green red yellow
3 brown green yellow
4 purple red yellow
5 green red yellowhttps://stackoverflow.com/questions/71156108
复制相似问题