我正在处理文本嵌入,以稀疏格式存储为csr_matrix (通过TfIdfVectorizer生成)。我想使用NMSLIB的余弦/HNSW指数插入它们并进行最近的邻居搜索。
我的问题是,当我有超过1M的嵌入要插入时,插入embeddings.toarray()就不会扩展。我注意到这里支持直接插入csr_matrix而不调用toarray():
test_features = sparse.csr_matrix(test_features)
train_features = sparse.csr_matrix(train_features)
nsw = nmslib.init(method = 'sw-graph', space = 'cosinesimil_sparse', data_type=nmslib.DataType.SPARSE_VECTOR)
nsw.addDataPointBatch(train_features)然而,当我尝试插入我的嵌入时,我会得到以下错误:
self.similar_items_index = nmslib.init(space='cosinesimil', method='hnsw')
self.similar_items_index.addDataPointBatch(self.embeddings)->
Traceback (most recent call last):
File "/home/pln/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/213.7172.26/plugins/python/helpers/pydev/pydevd.py", line 1483, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/pln/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/213.7172.26/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/pln/Work/Recommend/python/projects/ai_recommendations/related_products/dev.py", line 140, in <module>
cbf_model.train()
File "/home/pln/Work/Recommend/python/projects/utils/structured_logging.py", line 152, in timing_wrapper
value = func(*args, **kwargs)
File "/home/pln/Work/Recommend/python/projects/ai_recommendations/related_products/algorithms/content_based_filtering.py", line 130, in train
self.insert_datapoints()
File "/home/pln/Work/Recommend/python/projects/utils/structured_logging.py", line 152, in timing_wrapper
value = func(*args, **kwargs)
File "/home/pln/Work/Recommend/python/projects/ai_recommendations/related_products/algorithms/content_based_filtering.py", line 159, in insert_datapoints
self.similar_items_index.addDataPointBatch(self.embeddings)
ValueError: setting an array element with a sequence.
python-builtins.ValueError这是预期的,还是我应该能够插入一个csr_matrix作为-是这样一个索引?
发布于 2022-06-17 09:03:14
代码的问题在于所使用的空间:正如您在引用的示例中所看到的,插入压缩稀疏行矩阵的正确方法是使用压缩稀疏行矩阵空间。
请参见空间的NMSLIB文档,特别是关于输入格式的一节:
对于包含Lp-空间、稀疏余弦相似性和最大内积空间的稀疏空间,输入数据是稀疏的枕矩阵。这里可以找到一个例子。
https://stackoverflow.com/questions/72656867
复制相似问题