编辑以缩短并提供样本数据。
我有由8个问题组成的文本数据,这些问题被许多参与者问了两次。我想使用text2vec来比较他们在这两个时间点对这些问题的回答的相似性(重复检测)。以下是我的初始数据的结构(在本例中,只有3个参与者,4个问题而不是8个问题,以及2个季度/时间段)。我想对每个参与者在第一季度和第二季度的反应进行相似性比较。我打算使用包text2vec的psim命令来完成此操作。
df<-read.table(text="ID,Quarter,Question,Answertext
Joy,1,And another question,adsfjasljsdaf jkldfjkl
Joy,2,And another question,dsadsj jlijsad jkldf
Paul,1,And another question,adsfj aslj sd afs dfj ksdf
Paul,2,And another question,dsadsj jlijsad
Greg,1,And another question,adsfjasljsdaf
Greg,2,And another question, asddsf asdfasd sdfasfsdf
Joy,1,this is the first question that was asked,this is joys answer to this question
Joy,2,this is the first question that was asked,this is joys answer to this question
Paul,1,this is the first question that was asked,this is Pauls answer to this question
Paul,2,this is the first question that was asked,Pauls answer is different
Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
Greg,2,this is the first question that was asked,this is Gregs answer to this question
Joy,1,This is the text of another question,more random text
Joy,2,This is the text of another question, adkjjlj;ds sdafd
Paul,1,This is the text of another question,more random text
Paul,2,This is the text of another question, adkjjlj;ds sdafd
Greg,1,This is the text of another question,more random text
Greg,2,This is the text of another question,sdaf asdfasd asdff
Joy,1,this was asked second.,some random text
Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
Paul,1,this was asked second.,some random text
Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
Greg,1,this was asked second.,some random text
Greg,2,this was asked second.,ada dasdffasdf asdf asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')我做了更多的思考,我相信正确的方法是将数据帧拆分成一个数据帧列表,而不是单独的项目。
questlist<-split(df,f=df$Question)
然后编写一个函数来创建每个问题的词汇表。
library(text2vec)
vocabmkr<-function(x) { itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer() }
test<-lapply(questlist, vocabmkr)
但是,我认为我需要将原始数据帧拆分成问题-季度组合,并将其他列表中的词汇应用于它,但不确定如何进行。
最终,我想要一个相似性得分,告诉我参与者是否复制了第一季度和第二季度的部分或全部回答。
编辑:下面是我如何从上面的数据帧开始回答一个问题。
quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1
quest1q2<-filter(quest1,Quarter==2)
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2
#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)
similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.
b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)编辑:好的,我已经使用了更多的lapply。
df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question
vocabmkr<-function(x) {
itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}
vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.
dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)如何将vocab列表(由4个元素组成)应用到dfqq列表(由8个元素组成)?
https://stackoverflow.com/questions/51388538
复制相似问题