文章/答案/技术大牛

发布

社区首页 >问答首页 >在udpipe_annotate中使用content_transformer

问在udpipe_annotate中使用content_transformer
EN

Stack Overflow用户

提问于 2018-08-02 21:04:52

回答 1查看 369关注 0票数 0

所以我刚刚发现udpipe有一种很棒的显示相关性的方式，所以我开始研究它。如果我在导入csv文件后在csv文件上使用它，并且不对其进行任何更改，那么this site中的代码就可以完美地工作。

但是，一旦我创建了一个语料库并更改/删除了一些单词，我的问题就会出现。我不是R方面的专家，但我用谷歌搜索了这么多，似乎找不到答案。

下面是我的代码：

txt <- read_delim(fileName, ";", escape_double = FALSE, trim_ws = TRUE)

# Maak Corpus
docs <- Corpus(VectorSource(txt))
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords('nl'))
docs <- tm_map(docs, removeWords, myWords())
docs <- tm_map(docs, content_transformer(gsub), pattern = "afspraak|afspraken|afgesproken", replacement = "afspraak")
docs <- tm_map(docs, content_transformer(gsub), pattern = "communcatie|communiceren|communicatie|comminicatie|communiceer|comuniseren|comunuseren|communictatie|comminiceren|comminisarisacie|communcaite", replacement = "communicatie")
docs <- tm_map(docs, content_transformer(gsub), pattern = "contact|kontact|kontakt", replacement = "contact")

comments <- docs

library(lattice)
stats <- txt_freq(x$upos)
stats$key <- factor(stats$key, levels = rev(stats$key))
#barchart(key ~ freq, data = stats, col = "cadetblue", main = "UPOS (Universal Parts of Speech)\n frequency of occurrence", xlab = "Freq")

## NOUNS (zelfstandige naamwoorden)
stats <- subset(x, upos %in% c("NOUN")) 
stats <- txt_freq(stats$token)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 20), col = "cadetblue", main = "Most occurring nouns", xlab = "Freq")

## ADJECTIVES (bijvoeglijke naamwoorden)
stats <- subset(x, upos %in% c("ADJ")) 
stats <- txt_freq(stats$token)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 20), col = "cadetblue", main = "Most occurring adjectives", xlab = "Freq")

## Using RAKE (harkjes)
stats <- keywords_rake(x = x, term = "lemma", group = "doc_id", relevant = x$upos %in% c("NOUN", "ADJ"))
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ rake, data = head(subset(stats, freq > 3), 20), col = "cadetblue", main = "Keywords identified by RAKE", xlab = "Rake")

## Using Pointwise Mutual Information Collocations
x$word <- tolower(x$token)
stats <- keywords_collocation(x = x, term = "word", group = "doc_id")
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ pmi, data = head(subset(stats, freq > 3), 20), col = "cadetblue", main = "Keywords identified by PMI Collocation", xlab = "PMI (Pointwise Mutual Information)")

## Using a sequence of POS tags (noun phrases / verb phrases)
x$phrase_tag <- as_phrasemachine(x$upos, type = "upos")
stats <- keywords_phrases(x = x$phrase_tag, term = tolower(x$token), pattern = "(A|N)*N(P+D*(A|N)*N)*", is_regex = TRUE, detailed = FALSE)
stats <- subset(stats, ngram > 1 & freq > 3)
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ freq, data = head(stats, 20), col = "cadetblue", main = "Keywords - simple noun phrases", xlab = "Frequency")


cooc <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")), 
                                         term = "lemma", 
                                         group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooc)
library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooc, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
    geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
    geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
    theme_graph(base_family = "Arial Narrow") +
    theme(legend.position = "none") +
    labs(title = "Cooccurrences within sentence", subtitle = "Nouns & Adjective")

一旦我将导入的文件转换为语料库，它就失败了。谁知道我如何仍然可以执行tm_map函数，然后运行udpipe代码？

提前使用Tnx！

udpipe

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-08-02 22:16:24

对于你想要的东西，有多种解决方案。但是由于您的语料库是使用vectorsource创建的，所以它只是一个很长的输入向量。你可以很容易地回到一个向量中，这样udpipe就可以接管它了。

在udpipe示例文档中，所有内容都被定义为x，所以我也会这么做。清理完语料库后，只需执行以下操作：

x <- as.character(docs[1])

docs后面的1很重要，否则会得到一些不需要的额外字符。完成后，运行udpipe命令将向量转换为所需的data.frame。

x <- udpipe_annotate(ud_model, x)
x <- as.data.frame(x)

另一种方法是首先将语料库(有关更多信息，请查看?writeCorpus )写入磁盘，然后再次读取清理后的文件，并将其放入udpipe。这在很大程度上是一种变通方法，但可能会产生更好的工作流程。

此外，udpipe还处理标点符号，它在名为PUNCT的特殊upos类中添加了xpos描述(如果使用荷兰模型，则为荷兰语) Punc|komma或unc|punt。如果名词有一个大写字母，引理将是小写的。

在您的例子中，我将只使用基本的regex选项来遍历数据，而不是使用tm。荷兰的停用词只是去掉了一些动词，如"zijn"，"worden“en "kunnen”和一些形容词如"te“和代词"ik”和"we“。当你只看名词和形容词的时候，你可以在你的udpipe代码中过滤掉这些。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51654426

复制

相似问题

问在udpipe_annotate中使用content_transformer
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在udpipe_annotate中使用content_transformerEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在udpipe_annotate中使用content_transformer
EN