文章/答案/技术大牛

发布

社区首页 >问答首页 >在使用tm和pdftools时，'utf8towcs‘中的输入’§‘的’inŸ“§‘无效

问在使用tm和pdftools时，'utf8towcs‘中的输入’§‘的’inŸ“§‘无效
EN

Stack Overflow用户

提问于 2017-05-17 03:55:36

回答 1查看 1.5K关注 0票数 0

我的工作进展顺利，但我遇到了一些问题，因为我的一些pdf文件包含奇怪的符号(“§Ÿ”§“)

我回顾了以前的讨论，但这些解决方案都没有奏效：R tm package invalid input in 'utf8towcs'

这是我到目前为止的代码：

setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50")
getwd()
library(tm)
library(NLP)
library(tidytext)
library(dplyr)
library(pdftools)
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation =     TRUE,
                                                        stopwords = TRUE,
                                                        tolower = TRUE,
                                                        stemming = TRUE,
                                                        removeNumbers = TRUE,
                                                        bounds = list(global = c(3, Inf))))

结果：.tolower(txt)中的错误：'utf8towcs‘中的无效输入’inŸ‘§’

inspect(Comments.tdm[1:32,])

ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 50.csv")

任何帮助都是非常感谢的。另外，这段代码在其他pdf上运行得很好。

xpdf

pdf

回答 1

Stack Overflow用户

发布于 2017-05-18 16:47:26

重新看了一下前面的讨论。这个解决方案终于对我起作用了：

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))

记得遵循Fransisco的指示：“查德的解决方案对我不起作用。我把它嵌入到一个函数中，它给出了一个关于iconv需要一个向量作为输入的错误。所以，我决定在创建语料库之前进行转换。”

我的代码现在看起来像这样：

files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- Corpus(VectorSource(comments))

corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
                                                        stopwords = TRUE,
                                                        tolower = TRUE,
                                                        stemming = TRUE,
                                                        removeNumbers = TRUE,
                                                        bounds = list(global = c(3, Inf)))) 

inspect(Comments.tdm[1:28,])

ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 44.csv")

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44010481

复制

相似问题

问在使用tm和pdftools时，'utf8towcs‘中的输入’§‘的’inŸ“§‘无效
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在使用tm和pdftools时，'utf8towcs‘中的输入’§‘的’inŸ“§‘无效EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在使用tm和pdftools时，'utf8towcs‘中的输入’§‘的’inŸ“§‘无效
EN