我有一个df:
name sample
1 a Category 1: qwe, asd (line break) Category 2: sdf, erg
2 b Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30 p Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err 最后我要:
name qwe asd sdf erg zxc eru 2134 EFDgh Pdr tke err
1 a 1 1 1 1 0 0 0 0 0 0
2 b 0 0 1 1 1 1 0 0 0 0
...
30 p 0 1 0 0 0 0 0 1 1 0老实说,我甚至不知道从哪里开始这个,虽然我的第一个是在分道扬镳,但我有点迷路了。
发布于 2016-03-31 07:10:01
你可以使用str.findall和regex模式来查找所有带有negative lookbehind and lookahead的非字符符号的三个字符的单词。然后,您可以使用str.join加入所获得的列表,并使用str.get_dummies获取您的虚拟列表。然后您可以删除额外的列:
df['new'] = df['sample'].str.findall('(?<!\w)\w{3}(?!\w)')
df_dummies = df['new'].str.join('_').str.get_dummies(sep='_')
df = pd.concat([df, df_dummies], axis=1)
In [215]: df['new']
Out[215]:
1 [qwe, asd, sdf, erg]
2 [sdf, erg, zxc, eru]
Name: new, dtype: object
In [216]: df
Out[216]:
name sample new asd erg eru qwe sdf zxc
1 a Category 1: qwe, asd (line break) Category 2: ... [qwe, asd, sdf, erg] 1 1 0 1 1 0
2 b Category 2: sdf, erg(line break) Category 5: z... [sdf, erg, zxc, eru] 0 1 1 0 1 1在删除额外的列后,您将得到结果:
df = df.drop(['sample', 'new'], axis=1)
In [218]: df
Out[218]:
name asd erg eru qwe sdf zxc
1 a 1 1 0 1 1 0
2 b 0 1 1 0 1 1https://stackoverflow.com/questions/36323030
复制相似问题