文章/答案/技术大牛

发布

社区首页 >问答首页 >spimi算法误区

问spimi算法误区
EN

Stack Overflow用户

提问于 2016-12-06 00:58:43

回答 1查看 1.1K关注 0票数 1

我试图在C++中实现一个单通道内存索引器。

但是在算法中，我认为有问题，或者(很可能)我有误解。

SPIMI-INVERT(token_stream)
  output_file = NEWFILE() 
  dictionary = NEWHASH() 

  while (free memory available) 
     token ← next(token_stream)
     if term(token) ∈ dictionary
        then postings_list = ADDTODICTIONARY(dictionary, term(token)) 
        else postings_list=GETPOSTINGSLIST(dictionary,term(token))
     if full(postings_list)
        then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))

     ADDTOPOSTINGSLIST(postings_list, docID(token)) 
  sorted_terms ← SORTTERMS(dictionary)  
  WRITEBLOCKTODISK(sorted_terms,dictionary,output_file) 
  return output_file

假设我执行了所有的解析，并将所有文档转换为令牌流，其中令牌是term,doc_id对

http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html说，对每个块都调用SPIMI函数。

好吧，那我们开始吧，

我们逐块读取流，所以现在我有一个块，并将它作为参数通过SPIMI-INVERT函数发送。
该函数对字典的标记进行了一些处理。

不知怎么的(也许因为字典太大了)，当我们处于while循环中时，我们就不再有空闲的内存了。

该算法打破循环并将当前字典写入磁盘。

但是从外部世界(作为函数的调用者)，我不知道作为参数发送的块是否完全处理了。你不觉得这里有什么不对劲吗？

c++

algorithm

dictionary

indexing

information-retrieval

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-12-23 22:54:56

因为到目前为止还没有答案，在和我的教授交谈之后，我正在回答我的问题。

我必须说，算法不是很清楚，因为我的教授也不确定。我回答这个问题就像我怎么解释它一样。

令牌流是一个包含令牌的文件(术语，doc-id对)

while (there is token in token stream)
      dict = new dictionary()
      while(there is free memory available)
            token = next_token()
            dict.add_to_postingList(token)
      write_dict_to_file(dict)
      delete dict

//Assuming posting list is dynamically sized and dictionary knows if a term exist.

这里 i在C++中实现了它，可能是有用的

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40986075

复制

相似问题

问spimi算法误区
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spimi算法误区EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spimi算法误区
EN