首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >辅助造词袋模型

辅助造词袋模型
EN

Stack Overflow用户
提问于 2018-06-21 16:48:47
回答 1查看 121关注 0票数 1

免责声明:这是作业作业的一部分。

我有一组推文,我需要创建一个分类器来尝试预测他们的情绪。我将通过创建一个单词包模型并将径向支持向量机核函数应用于数据来实现这一点。

这里是给你一个想法的原始数据:

代码语言:javascript
复制
> original_tweets
# A tibble: 2,385 x 3
   tweet_id sentiment text                                                                                                                      
      <int> <chr>     <chr>                                                                                                                     
 1        1 positive  @TylerSkewes: It is almost 2014. Where are the self-driving cars so we don't have to worry about a DD tonight. Forreal tho
 2        2 positive  @WIRED: BMW builds a self-driving car -- that drifts I love this technology. Drive me to work baby!
 3        3 positive  Google better hurry up with that driverless car. Watching grandma do an 8 point turn to get in a parking spot is horrific.
 4        4 positive  I just waved thank you to this lady that let me merge on the highway and she gave me the finger. Need my self driving car.
 5        5 positive  I might be the only person who starts #cheering in their car when they see a @google car :) #happiness #feelslikeChristmas
 6        6 positive  I want the driverless car, and BAD. Seriously I would be happy if tomorrow morning there were no drivers behind the wheel.
 7        7 positive  I'm over here writing a 2000 word essay while *****s at Google are on driverless cars making ground breaking shit. Damn. _
 8        8 positive  Is it crazy to think that self driving cars will be the biggest innovation of the last few decades? 
 9        9 positive  Its very nice!RT @cdixon: It's awesome that Google is investing in futuristic stuff like AR glasses and self-driving cars.
10       10 positive  Look closely you will see the reflection of a google car !!!! Screen shot from google maps !!!!!
# ... with 2,375 more rows
> 

我稍微编辑了几个术语,因为它们中有URL,但是你明白了。

我已经把数据格式化成一个整洁的格式,并计算了每个术语TF-以色列国防军的分数.对于我的特色空间,我是前1000名以色列国防军得分最高的条款。

下面是我的数据的一个例子:

代码语言:javascript
复制
> feature_space
# A tibble: 3,000 x 7
   tweet_id sentiment word                   n     tf   idf tf_idf
      <int> <chr>     <chr>              <int>  <dbl> <dbl>  <dbl>
 1        1 positive  forreal                1 0.0435  7.78  0.338
 2        2 positive  drifts                 1 0.0476  7.78  0.370
 3        2 positive  rprjtelkg6             1 0.0476  7.78  0.370
 4        5 positive  cheering               1 0.0455  7.78  0.353
 5        5 positive  feelslikechristmas     1 0.0455  7.78  0.353
 6        7 positive  2000                   1 0.0476  7.78  0.370
 7        7 positive  *****s                 1 0.0476  7.78  0.370
 8        8 positive  decades                1 0.0417  7.78  0.324
 9        8 positive  vltlymug89             1 0.0417  7.78  0.324
10        9 positive  ar                     1 0.0476  7.78  0.370
# ... with 2,990 more rows

我想创建一个包字模型,使用他们的TF-以色列国防军的分数,以创造一个情感分类器。对于这个模型,我知道我需要设置我的数据框架,这样在我的特性空间中,每条推文都是一行,并且是每一个可能的TF-国防军术语权重的列。

我很难弄清楚如何最好地变异一个tibble或数据帧,以便将数据转换成这种格式。我尝试过mutate()和join()的各种组合,但这从来都不是我喜欢的方式。

我如何能够根据一组特征词快速地将3000或更多列添加到dataframe或tibble中,并应用它们的TF-下手值来填充这种稀疏的数据结构?我不一定需要一个直接的代码答案,但是在R中朝着正确的方向迈出一步将对我有很大的帮助。

更新:我现在有一个空的标签,我只需要填写数据中的非零TF值。下面是:

代码语言:javascript
复制
    > bag_of_words
# A tibble: 2,385 x 3,002
   tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89    ar closely reflection zg7hvvfgpn
      <int> <chr>       <dbl>  <dbl>      <dbl>    <dbl>              <dbl>  <dbl>  <dbl>   <dbl>      <dbl> <dbl>   <dbl>      <dbl>      <dbl>
 1        1 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 2        2 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 3        3 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 4        4 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 5        5 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 6        6 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 7        7 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 8        8 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 9        9 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
10       10 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
# ... with 2,375 more rows, and 2,987 more variables
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-21 19:04:51

好的,我想我有个解决办法。不过,我肯定很好奇如何在没有for循环的情况下做到这一点,但我仍然不太适应apply()的编码风格。

我想出的是:

代码语言:javascript
复制
#create bag of words model
#get tweet_id and sentiment
bag_of_words <- original_tweets %>%
  select(-one_of('text'))

#get words from feature space
feature_words <- feature_space$word

#generate empty columns
for(i in feature_words)
  bag_of_words[,i] <- 0

#fill in columns with values from feature space
for(i in 1:length(feature_words)) {
  word <- feature_space[i,]$word
  tweet <- feature_space[i,]$tweet_id
  score <- feature_space[i,]$tf_idf
  bag_of_words[tweet,word] <- score
}

检查输出,看起来很好:

代码语言:javascript
复制
> bag_of_words
# A tibble: 2,385 x 3,002
   tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89    ar closely reflection zg7hvvfgpn
      <int> <chr>       <dbl>  <dbl>      <dbl>    <dbl>              <dbl>  <dbl>  <dbl>   <dbl>      <dbl> <dbl>   <dbl>      <dbl>      <dbl>
 1        1 positive    0.338  0          0        0                  0      0      0       0          0     0       0          0          0    
 2        2 positive    0      0.370      0.370    0                  0      0      0       0          0     0       0          0          0    
 3        3 positive    0      0          0        0                  0      0      0       0          0     0       0          0          0    
 4        4 positive    0      0          0        0                  0      0      0       0          0     0       0          0          0    
 5        5 positive    0      0          0        0.353              0.353  0      0       0          0     0       0          0          0    
 6        6 positive    0      0          0        0                  0      0      0       0          0     0       0          0          0    
 7        7 positive    0      0          0        0                  0      0.370  0.370   0          0     0       0          0          0    
 8        8 positive    0      0          0        0                  0      0      0       0.324      0.324 0       0          0          0    
 9        9 positive    0      0          0        0                  0      0      0       0          0     0.370   0          0          0    
10       10 positive    0      0          0        0                  0      0      0       0          0     0       0.370      0.370      0.370
# ... with 2,375 more rows, and 2,987 more variables

回想起来,我可能比我所需要的更难做到这一点,但我绝对希望看到任何更有效的方法来做这个经验丰富的R-兽医可以想到的。干杯。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50973763

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档