我有一个数据集,我想对它进行一次热编码(来自sklearn)。其中一列是年龄。理想情况下,我希望它是0-100。但是我在数据集中只有零散的年龄数据,所以当我运行sklearn one-hot编码器时,它只对现有的年龄样本进行分类。
该列如下所示:
Age
55
8
26
40
45
...
25
36
28
50
35我拥有的sklearn代码片段和结果:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
enc.fit_transform(X_train)
enc.categories_
(print)
...(other encoded columns)
Age column:
array([ 0, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 60,
61, 62, 63, 64, 65, 67, 70, 75])有没有一种方法可以把它编码成:
array([0,1,2,3,4,5,6,7,...,70,71,72,73,74,75]) hopefully even more (to 100)提前感谢!
发布于 2020-11-11 16:44:36
您可以在使用pd.get_dummies进行类型转换之前使用分类变量:
import pandas as pd
age = pd.Series([ 0, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 60, 61, 62, 63, 64, 65, 67, 70, 75])
age = age.astype(pd.Categorical(range(0,101)))
df = pd.get_dummies(age)输出:
df.columns
# CategoricalIndex([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
91, 92, 93, 94, 95, 96, 97, 98, 99, 100],
categories=[0, 1, 2, 3, 4, 5, 6, 7, ...], ordered=False, dtype='category', length=101)https://stackoverflow.com/questions/64783033
复制相似问题