文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从包含多列文本的data.frame创建quanteda语料库？

问如何从包含多列文本的data.frame创建quanteda语料库？
EN

Stack Overflow用户

提问于 2018-02-06 18:09:31

回答 1查看 2.4K关注 0票数 3

让我说，我有以下几点：

x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), 
     text1=c('this is text','so is this','and this is too.'),
     text2=c('we have more text here','and here too','and look at this, more text.'))

我想使用以下内容在quanteda中创建一个dfm/语料库：

x1 = corpus(x10,docid_field='id',text_field=c(3:4),tolower=T)

显然，这是错误的，因为text_field只接受一个列。除了建造两具身体外，还有更好的方法来处理这个问题吗？我可以构建2然后在id上合并吗？那是件事吗？

quanteda

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-02-06 19:17:06

首先，让我们在不考虑字符值的情况下重新创建data.frame：

x10 = data.frame(id = c(1,2,3), vars = c('top','down','top'), 
                 text1 = c('this is text', 'so is this', 'and this is too.'),
                 text2 = c('we have more text here', 'and here too', 'and look at this, more text.'),
                 stringsAsFactors = FALSE)

那么我们有两个选择。

方法1:重塑为“长”格式并创建单个语料库

“融化”的数据首先有一个列，然后作为一个语料库导入。(另一种选择是tidy::gather()。)

x10b <- reshape2::melt(x10, id.vars = c("id", "vars"), 
                       measure.vars = c("text1", "text2"),
                       variable.name = "doc_id", value.name = "text")

# because corpus() takes document names from row names, by default 
row.names(x10b) <- paste(x10b$doc_id, x10b$id, sep = "_")

x10b
#         id vars doc_id                         text
# text1_1  1  top  text1                 this is text
# text1_2  2 down  text1                   so is this
# text1_3  3  top  text1             and this is too.
# text2_1  1  top  text2       we have more text here
# text2_2  2 down  text2                 and here too
# text2_3  3  top  text2 and look at this, more text.

x10_corpus <- corpus(x10b)
summary(x10_corpus)
# Corpus consisting of 6 documents:
#     
#    Text Types Tokens Sentences id vars doc_id
# text1_1     3      3         1  1  top  text1
# text1_2     3      3         1  2 down  text1
# text1_3     5      5         1  3  top  text1
# text2_1     5      5         1  1  top  text2
# text2_2     3      3         1  2 down  text2
# text2_3     8      8         1  3  top  text2
# 
# Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/lse-my459/assignment-2/* on x86_64 by kbenoit
# Created: Tue Feb  6 19:06:07 2018
# Notes:

方法2:创建两个语料库对象并结合

这里，我们分别创建两个语料库对象，并使用+操作符将它们组合起来。

x10_corpus2 <- 
    corpus(x10[, -which(names(x10)=="text2")], text_field = "text1") +
    corpus(x10[, -which(names(x10)=="text1")], text_field = "text2")
summary(x10_corpus2)
# Corpus consisting of 6 documents:
#     
#   Text Types Tokens Sentences id vars
#  text1     3      3         1  1  top
#  text2     3      3         1  2 down
#  text3     5      5         1  3  top
# text11     5      5         1  1  top
# text21     3      3         1  2 down
# text31     8      8         1  3  top
# 
# Source:  Combination of corpuses corpus(x10[, -which(names(x10) == "text2")], text_field = "text1") and corpus(x10[, -which(names(x10) == "text1")], text_field = "text2")
# Created: Tue Feb  6 19:14:14 2018
# Notes:

在此阶段，还可以使用docnames(x10_corpus2) <-重新分配文档名，使之更像第一种方法。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48649343

复制

相似问题

问如何从包含多列文本的data.frame创建quanteda语料库？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从包含多列文本的data.frame创建quanteda语料库？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从包含多列文本的data.frame创建quanteda语料库？
EN