我是一个文字嵌入的新手,并编写了一个简单的程序来捕捉我的whatsapp中的消息,在R中试用word2vec函数,一切都很好,我可以成功地生成正确显示汉字的嵌入矩阵。然而,当我使用预测,type=nearest函数时,程序显示该汉字不在字典中(如果该字符是英语,则没有这样的问题)。这是与编码有关的问题吗?我的代码如下:
library(tidyverse)
library(dplyr)
library(rwhatsapp)
library(word2vec)
chat<-rwa_read("C:/Users/peace/Desktop/_chat.txt")
temp<-post_seg$text
words<-word2vec(temp,dim=15,encoding ="UTF-8")
embedding <- as.matrix(words)
nn1 <- predict(words, c("cpc"), type = "nearest", top_n = 5,encoding ="UTF-8")
nn2 <- predict(words, c("夠"), type = "nearest", top_n = 5,encoding ="UTF-8")运行nn2时显示的错误消息:w2v_nearest中的错误(对象$模型,x= x,top_n = top_n,.):无法在字典中找到单词:夠
但是,当运行嵌入矩阵和nn1时,它工作得很好:
方猛 -0.1368161887 -1.1562500000 -1.461319923
夠 -0.8252676129 -1.5346769094 -1.077145815
cpc -0.1976414174 0.3481757045 0.275686920
[ reached getOption("max.print") -- omitted 2410 rows ]
> nn1
$cpc
term1 term2 similarity rank
1 cpc storeid 0.9780686 1
2 cpc ns 0.9569275 2
3 cpc term 0.8783157 3发布于 2022-06-13 06:25:29
试试这边
library(tidyverse)
library(dplyr)
library(rwhatsapp)
library(word2vec)
chat<-rwa_read("C:/Users/peace/Desktop/_chat.txt")
temp<-post_seg$text
words<-word2vec(temp,dim=15,encoding ="UTF-8")
Sys.setlocale(category = 'LC_ALL', locale = 'C')
embedding <- as.matrix(words)
nn2 <- predict(words, c("夠"), type = "nearest", top_n = 5,encoding ="UTF-8")
Sys.setlocale(); Sys.getlocale()
nn2https://stackoverflow.com/questions/71093431
复制相似问题