我试图从job_skills中提取每一项技能成为属性,并将其编码为0或1,我如何做到这一点?
注意:我试图创建一个数据框架,但不值得手动填充数据框架(代码如下),its方法可以从列中提取列表。我需要对这些数据应用ML算法。
data = [['a', ['Python', 'UI',' Information Technology (IT)','Software Development','GTK','English',' Software Engineering']],
['b', ['Python', 'Relational Databases',' Celery',' VMWare','Django','Continous Integration',' Test Driven Development',' HTTP']],
['c', ['Flask', 'Python',' Celery',' Software Development',' Computer Science','Information Technology (IT)']],
['c', ['Flask', 'Python',' Celery',' Software Development',' Computer Science','Information Technology (IT)']]
]
df1= pd.DataFrame(data, columns=['col1', 'col2'])
pd.get_dummies(df1['col2'].explode()).groupby(level=0).sum()发布于 2022-11-06 13:41:41
我想不出从熊猫盒子里有什么能直接做到这一点。如果我明白,你想为每个人(行)的每一项技能设置一个热门变量。是否有每个作业的唯一标识符。如果不是,你需要一个。在下面的示例中,我使用行。
skills = []
row = []
for index, row in df.iterrows():
for item in row['jobs_skills']:
row.append(row)
skills.append(item)
df = pd.DataFrame({'row': row, 'skills': skills})一旦有了df,就可以在这里遵循相同的逻辑:
How can I one hot encode in Python?
如果您需要原始df上的数据,那么在此之后加入/合并。
发布于 2022-11-06 13:45:07
下面是一个使用标准数据Here函数的命题:
def create_dummies(df, col):
dummies = pd.get_dummies(df[col])
df[dummies.columns] = dummies
return df
out = (
df.assign(skill= df["job_skills"].str.strip("[]")
.str.replace("'", "")
.str.split(","))
.explode("skill")
.pipe(create_dummies, 'skill')
.iloc[:, 5:]
.groupby(level=0)
.sum()
)#产出:
display(out)

#使用的投入:
print(df.to_string())
job_title company location job_skills
0 Python Or ItsTime Oakville, ['Python', 'UI', 'Computer Science', '. Information Technology (IT)', 'Software Development']
1 Senior Pyt CLOUDSIG Sofia, Bul ['Python3', 'Relational Databases', '. Celery', 'VMWare', '. Django',' Continous Integration']
2 Flask Pyth Cyber sec Cairo, Egy ['Flask', 'Python', '. Software Development', '. Computer Science', '. Information Technology (IT)']https://stackoverflow.com/questions/74336130
复制相似问题