文章/答案/技术大牛

发布

社区首页 >问答首页 >R写入函数以获取数据帧中的unigram

问R写入函数以获取数据帧中的unigram
EN

Stack Overflow用户

提问于 2021-12-02 13:45:01

回答 2查看 46关注 0票数 1

我想写一个函数来获取单字的个数(一个单词)。然而，我当前的函数并没有按照我想要的方式工作。

这是我的函数和示例数据集：

library(ngrams)
library(tidyverse)

#dataframe
df<-tribble(~text,
            "This sentence",
            "I am going to luch",
            "This is a really nice and sunny day")

#function
get_unigrams <- function(text) {
  
  unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()

  return(unigram)
}

然而，使用"mutate“函数的计算得到了一个非常奇怪的结果：

df %>% mutate(n=get_unigrams((text)))

# A tibble: 3 x 2
  text                                    n
  <chr>                               <int>
1 This sentence                          14
2 I am going to luch                     14
3 This is a really nice and sunny day    14

每个句子的长度是相等的。我认为这是因为所有三行文本被放在一起并被视为一个文本。

但是，我希望得到这样的结果：

# A tibble: 3 x 2
  text                                    n
  <chr>                               <int>
1 This sentence                           2
2 I am going to luch                      5
3 This is a really nice and sunny day     8

有人能帮我吗？

我在我的函数中看不到错误。

首先要感谢大家！

更新：

我找到了一个(临时)解决方案：

get_unigrams <- function(text) {
  sapply(text, function(text){
  unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()
  
  return(unigram)
  }
  )
}

但是，使用sapply-function的解决方案非常慢(因为它单独执行每一行)。我有一个超过100k行的数据帧。

有人能帮我提高速度吗？例如，使用矢量化函数？

function

回答 2

Stack Overflow用户

发布于 2021-12-02 14:00:12

使用rowwise。有关更多信息，请查看?rowwise。

df %>% rowwise() %>% 
  mutate(n=get_unigrams(text))

  text                                    n
  <chr>                               <int>
1 This sentence                           2
2 I am going to luch                      5
3 This is a really nice and sunny day     8

另一种解决方案(使用基数R)是：

df$n <- apply(df, 1, get_unigrams)

票数 1

Stack Overflow用户

发布于 2021-12-02 16:06:49

另一种解决方案，基于stringr::str_count

library(tidyverse)

df<-tribble(~text,
            "This sentence",
            "I am going to luch",
            "This is a really nice and sunny day")

df %>% 
  mutate(n = str_count(text, "\\w+"))

#> # A tibble: 3 × 2
#>   text                                    n
#>   <chr>                               <int>
#> 1 This sentence                           2
#> 2 I am going to luch                      5
#> 3 This is a really nice and sunny day     8

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70200602

复制

相似问题

问R写入函数以获取数据帧中的unigram
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R写入函数以获取数据帧中的unigramEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R写入函数以获取数据帧中的unigram
EN