文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在Pandas中获取文本中特定单词的热编码？

问如何在Pandas中获取文本中特定单词的热编码？
EN

Stack Overflow用户

提问于 2018-01-12 12:40:44

回答 2查看 2.3K关注 0票数 5

假设我有一个数据和单词列表

toxic = ['bad','horrible','disguisting']

df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})

main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)

samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])

for i,j in enumerate(samp):
    for k in j:
        main.loc[i,k] = 1

这导致：

   bad  disguisting  horrible                         text
0    0            0         1            You look horrible
1    0            0         0                 You are good
2    1            1         0  you are bad and disguisting

这比get_dummies要快一些，但是当有大量的数据时，熊猫的循环就不明显了。

我尝试了str.get_dummies，这将是一个热点编码在这个系列中的每一个字，这使它有点慢。

pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)

                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1

如果我试着做同样的事。

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)

这将导致Value Error,y contains new labels。有什么方法可以忽略枕骨中的错误吗？

我怎样才能提高达到同样目标的速度，还有其他快速的方法吗？

python

pandas

scipy

one-hot-encoding

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-01-12 13:13:09

使用extraction.text.CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(vocabulary=toxic)

r = pd.SparseDataFrame(cv.fit_transform(df['text']), 
                       df.index,
                       cv.get_feature_names(), 
                       default_fill_value=0)

结果：

In [127]: r
Out[127]:
   bad  horrible  disguisting
0    0         1            0
1    0         0            0
2    1         0            1

In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame

In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad            3 non-null int64
horrible       3 non-null int64
disguisting    3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes

In [130]: r.memory_usage()
Out[130]:
Index          80
bad             8   #  <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible        8
disguisting     8
dtype: int64

使用原始的SparseDataFrame加入DataFrame：

In [137]: r2 = df.join(r)

In [138]: r2
Out[138]:
                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1

In [139]: r2.memory_usage()
Out[139]:
Index          80
text           24
bad             8
horrible        8
disguisting     8
dtype: int64

In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame

In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries

In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series

PS在旧的Pandas版本Sparsed列失去了他们的稀疏(变得紧凑)后加入SparsedDataFrame与一个常规的DataFrame，现在我们可以有一个正规系列(列)和SparseSeries的混合物-真的很好的功能！

票数 3

Stack Overflow用户

发布于 2020-07-11 10:48:10

不推荐接受的答案，请参阅发布说明：

SparseSeries和SparseDataFrame在熊猫体内被去除了1.0.0。本迁移指南用于帮助从早期版本迁移。

熊猫1.0.5解决方案：

r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']), 
                   df.index,
                   cv.get_feature_names())

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48226506

复制

相似问题

问如何在Pandas中获取文本中特定单词的热编码？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Pandas中获取文本中特定单词的热编码？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Pandas中获取文本中特定单词的热编码？
EN