首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用text2vec循环以进行标记化

使用text2vec循环以进行标记化
EN

Stack Overflow用户
提问于 2018-07-18 02:59:55
回答 0查看 130关注 0票数 0

编辑以缩短并提供样本数据。

我有由8个问题组成的文本数据,这些问题被许多参与者问了两次。我想使用text2vec来比较他们在这两个时间点对这些问题的回答的相似性(重复检测)。以下是我的初始数据的结构(在本例中,只有3个参与者,4个问题而不是8个问题,以及2个季度/时间段)。我想对每个参与者在第一季度和第二季度的反应进行相似性比较。我打算使用包text2vec的psim命令来完成此操作。

代码语言:javascript
复制
df<-read.table(text="ID,Quarter,Question,Answertext
               Joy,1,And another question,adsfjasljsdaf jkldfjkl
               Joy,2,And another question,dsadsj jlijsad jkldf 
               Paul,1,And another question,adsfj aslj sd afs dfj ksdf
               Paul,2,And another question,dsadsj jlijsad
               Greg,1,And another question,adsfjasljsdaf
               Greg,2,And another question, asddsf asdfasd sdfasfsdf
               Joy,1,this is the first question that was asked,this is joys answer to this question
               Joy,2,this is the first question that was asked,this is joys answer to this question
               Paul,1,this is the first question that was asked,this is Pauls answer to this question
               Paul,2,this is the first question that was asked,Pauls answer is different 
               Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
               Greg,2,this is the first question that was asked,this is Gregs answer to this question
               Joy,1,This is the text of another question,more random text
               Joy,2,This is the text of another question, adkjjlj;ds sdafd
               Paul,1,This is the text of another question,more random text
               Paul,2,This is the text of another question, adkjjlj;ds sdafd
               Greg,1,This is the text of another question,more random text
               Greg,2,This is the text of another question,sdaf asdfasd asdff
               Joy,1,this was asked second.,some random text
               Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
               Paul,1,this was asked second.,some random text
               Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
               Greg,1,this was asked second.,some random text
               Greg,2,this was asked second.,ada dasdffasdf asdf  asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')

我做了更多的思考,我相信正确的方法是将数据帧拆分成一个数据帧列表,而不是单独的项目。

questlist<-split(df,f=df$Question)

然后编写一个函数来创建每个问题的词汇表。

library(text2vec)

vocabmkr<-function(x) { itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer() }

test<-lapply(questlist, vocabmkr)

但是,我认为我需要将原始数据帧拆分成问题-季度组合,并将其他列表中的词汇应用于它,但不确定如何进行。

最终,我想要一个相似性得分,告诉我参与者是否复制了第一季度和第二季度的部分或全部回答。

编辑:下面是我如何从上面的数据帧开始回答一个问题。

代码语言:javascript
复制
quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()

quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1

quest1q2<-filter(quest1,Quarter==2) 
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2

#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)

similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.

b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)

编辑:好的,我已经使用了更多的lapply。

代码语言:javascript
复制
df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question

vocabmkr<-function(x) {
  itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}

vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.

dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)

如何将vocab列表(由4个元素组成)应用到dfqq列表(由8个元素组成)?

EN

回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51388538

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档