文章/答案/技术大牛

发布

社区首页 >问答首页 >外语编码中的无效多字节字符串

问外语编码中的无效多字节字符串
EN

Stack Overflow用户

提问于 2019-08-03 00:40:48

回答 1查看 182关注 0票数 0

我正在使用R的stm分析已解析/分段的外语(简体中文)文本文档，以利用该软件包的绘图环境。我没有使用包的内置文本处理函数，因为它目前不支持处理中文文本；但是，在我成功地准备了数据(它需要lda格式的documents和vocab以及相同行长的原始元数据)并对模型进行拟合后，plot()函数抛出了一条错误消息，这似乎是由于预处理阶段的一些编码问题：

Error in nchar(text) : invalid multibyte string, element 1

根据前面一些线程的建议，我应用了stringi和utf8中的编码函数将vocab编码为UTF-8，并再次重新绘制估计结果，但它返回了相同的错误。我想知道编码是怎么回事，这样的错误是否可以修复，因为stm使用了base R的绘图函数，而后者在显示外语文本方面应该没有问题。(请注意，在预处理原始文本之前，我已将语言区域设置重新设置为“中文”((简写)_China.936))

如果有人能在这方面给我一些启发，我将不胜感激。下面提供了我的代码。

Sys.setlocale("LC_ALL","Chinese")  # set locale to simplified Chinese to render the text file
# install.packages("stm")
require(stm)

con1 <- url("https://www.dropbox.com/s/tldmo7v9ssuccxn/sample_dat.RData?dl=1")
load(con1)
names(sample_dat)  # sample_dat is the original metadata and is reduced to only 3 columns
con2 <- url("https://www.dropbox.com/s/za2aeg0szt7nssd/blog_lda.RData?dl=1")
load(con2)
names(blog_lda)   # blog_lda is a lda type object consists of documents and vocab

# using the script from stm vignette to prepare the data
out <- prepDocuments(blog_lda$documents, blog_lda$vocab, sample_dat)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta

# estimate a 10-topic model for the ease of exposition
PrevFit <- stm(documents = docs, vocab = vocab, K = 10, prevalence =~ sentiment + s(day), max.em.its = 100, data = meta, init.type = "Spectral")
# model converged at the 65th run
# plot the model
par(mar=c(1,1,1,1))
plot(PrevFit, type = "summary", xlim = c(0, 1))
Error in nchar(text) : invalid multibyte string, element 1

#check vocab
head(vocab)
# returning some garbled text
[1] "\"�\xf3½\","       "\"���\xfa\xe8�\","
[3] "\"�\xe1\","        "\"\xc8\xcb\","    
[5] "\"\u02f5\","       "\"��\xca\xc7\","

encoding

topic-modeling

chinese-locale

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-04 00:57:48

请使用

vocab <- iconv(out$vocab)

或

vocab <- iconv(out$vocab，to="UTF-8")

相反，

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57330542

复制

相似问题

问外语编码中的无效多字节字符串
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问外语编码中的无效多字节字符串EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问外语编码中的无效多字节字符串
EN