blocks|key|151463|text|我一直在做一些类似的工作，并发现redis的zset几乎就是我所需要的(尽管我现在实际上并没有使用它；我已经推出了自己的基于内存映射文件的解决方案)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|151464|基本上，zset是一组有序的键-值对。|151465|因此，您可以为每个功能设置一个排序集，其中每个功能|151466|功能->+{docid，score+}，{+docid，score+}+..|151467|即|151468|zadd功能分数docid|151469|然后redis有一些很好的操作符来合并，提取范围等。参见zunionstore，zrange+(http://redis.io/commands/zunionstore)。|offset|length|151470|速度非常快(假设)并且全部在内存中等等。(尽管redis不是嵌入式数据库)。|151471|entityMap|0|LINK|mutability|MUTABLE|url|http://redis.io/commands/zunionstore^0|0|0|0|0|0|0|1C|10|0|0|0^^$0|@$1|2|3|4|5|6|7|Z|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|10|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|11|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|12|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|13|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|14|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|15|8|@]|9|@$N|16|O|17|1|18]]|A|$]]|$1|P|3|Q|5|6|7|19|8|@]|9|@]|A|$]]|$1|R|3|-4|5|6|7|1A|8|@]|9|@]|A|$]]]|S|$T|$5|U|V|W|A|$X|Y]]]]

I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files). 
 
Basically a zset is a sorted set of key-value pairs. 
 
So you can have a sorted set per feature where each 
feature->[ { docid, score }, {docid, score} ..] 
i.e. 
zadd feature score docid 
 
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore). 
 
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).

blocks|key|668211|text|你看过Terrier吗？我不太确定它是否有内存中的索引，但它在索引和评分方面比Lucene更具可扩展性。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|668212|entityMap|0|LINK|mutability|MUTABLE|url|http://terrier.org/^0|3|7|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

Have you looked at <a href="http://terrier.org/" rel="nofollow">Terrier</a>? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.

blocks|key|3931489|text|Lucene允许您存储几乎所有与文档相关的数据。它还有一个称为“有效负载”的特性，允许您在与文档中的术语相关联的索引中存储任意数据。所以我认为你想要的是在索引中存储你的“特征”作为术语，而权重作为有效负载，你应该能够让Lucene做你想做的事情。它确实有一个内存索引实现。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3931490|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.

blocks|key|676574|text|如果您想要比较的实体对已经预先给出，并且您对成对得分感兴趣，我认为Lucene不会给您带来任何优势。只需在一些键值存储中查找向量并计算相似度。考虑使用稀疏向量表示，以提高空间和时间效率。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|676575|如果只预先给出了一个实体，并且您对类似场景的排名更感兴趣，Lucene可能值得一试。值得关注的地方应该是|676576|org.apache.lucene.search.Similarity|code-block|syntax|javascript|676577|您应该能够使其适应您的需要，并将您的版本设置为默认版本|676578|setDefault(Similarity+similarity)+|676579|我会谨慎对待速度提升的期望(w.r.t.迭代所有)，但是，因为它们在很大程度上取决于(查询的)稀疏性和您选择实现的评分函数。还要注意的是，Lucene使用了一个两阶段检索方案，第一个布尔值(“所有的AND术语都包含?+OR术语中的任何一个？”)然后传球得分。而对于tf.idf，您在使用其他评分函数的过程中不会有任何损失。|676580|有关高效近似最近邻搜索的更一般方法，可能值得研究LSH：|676581|http://en.wikipedia.org/wiki/Locality-sensitive_hashing|offset|length|676582|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|0|0|0|0|0|0|1J|0|0^^$0|@$1|2|3|4|5|6|7|11|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|12|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|13|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|14|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|15|8|@]|9|@]|A|$G|H]]|$1|M|3|N|5|6|7|16|8|@]|9|@]|A|$]]|$1|O|3|P|5|6|7|17|8|@]|9|@]|A|$]]|$1|Q|3|R|5|6|7|18|8|@]|9|@$S|19|T|1A|1|1B]]|A|$]]|$1|U|3|-4|5|6|7|1C|8|@]|9|@]|A|$]]]|V|$W|$5|X|Y|Z|A|$10|R]]]]

If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.

If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try. 
The right place to look at would be

<pre><code>org.apache.lucene.search.Similarity
</code></pre>

you should be able to adapt it to your needs and set your version as default with

<pre><code>setDefault(Similarity similarity) 
</code></pre>

I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.

For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:

<a href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing" rel="nofollow">http://en.wikipedia.org/wiki/Locality-sensitive_hashing</a>

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.

All other attributes of entities I can store in some fast key-value store.

I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!

Thank you.

Fast in-memory inverted index

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我正在寻找一个通用倒排索引的快速内存实现。我所需要的就是为几百万个实体存储具有权重的特征，并使用倒排索引通过各种距离函数来计算实体之间的相似度。我可以将实体的所有其他属性存储在某个快速键值存储中。我希望我可以使用Lucene作为倒排索引，但无法看到如何将我自己的自定义特征向量与预先计算的权重相关联。任何建议都将不胜感激！谢谢。

问快速内存倒排索引
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问快速内存倒排索引EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问快速内存倒排索引
EN