当我试图在Server 2008 BM25中实现R2算法时,我遇到了麻烦。我知道Server包含全文搜索选项,它已经实现了BM25的一个变体,但是我想做一些参数优化测试,而且由于FTS过程是不可编辑的(据我所知),我决定自己实现它。
我有两个表,TF (术语频率)和DF (文档频率),其结构如下:
*注:重量栏表示单词的重要性(通常为1)
ID | Term | DocumentID | Count | TermID | Weight*ID | Term | CountTF表包含术语和文档之间的关系;也就是说,术语在文档中的频率。DF表包含关于一个术语包含多少文档的信息。可以使用DF.ID和TF.TermID链接这两个表。使用这两个表,我现在要根据BM25中的公式计算两个文档之间的维基百科文章相似度值(其中一个文档充当查询)。表TF和DF分别转换为函数f(q,D)和n(q):


我希望结果是这样的:
DocumentA_ID | DocumentB_ID | BM25_Value下面是一些我到目前为止所掌握的代码:
DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
DECLARE @K1 FLOAT;
DECLARE @B FLOAT;
SET @K1 = 1.2;
SET @B = 0.75;
-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;
-- average document length (in words)
SELECT @AVGDL = AVG(DocumentLength) FROM (SELECT DocumentID, SUM(TF.Count) AS DocumentLength FROM TF WHERE Weight = 3 GROUP BY DocumentID) A;
-- BM25 implementation
SELECT D.DocumentID AS DocumentA,
Q.DocumentID AS DocumentB,
*,
-- need help here (SUM or something ...)
LOG((@N - DF.Count + 0.5)/(DF.Count + 0.5)) AS IDF
FROM TF AS D
INNER JOIN TF AS Q ON D.Term = Q.Term
INNER JOIN DF ON D.TermID = DF.ID
WHERE D.DocumentID <> Q.DocumentID 在最后一节(BM25实现)中,我很难构造查询以获得所需的结果格式。任何帮助都将不胜感激。
发布于 2013-05-15 06:25:09
经过几天的摆弄,我终于完成了这件事。下面是我最后得到的代码,打包到一个存储过程中。为此,我添加了另一个表,它只保存文档ID及其长度。
ALTER PROCEDURE [dbo].[BuildIndexBM25]
-- default parameters K1=> [1.2 - 2.0], B => [0.0 - 1.0], Weight => 3 (e.g. titles only)
@K1 FLOAT = 1.2,
@B FLOAT = 0.75,
@Weight FLOAT = 3
AS
BEGIN
SET NOCOUNT ON;
DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;
-- average document length (in words)
SELECT @AVGDL = AVG(Length) FROM DocumentLength;
-- BM25 implementation
-- result set: | DocumentA | DocumentB | BM25 |
SELECT
DL.ID AS DocumentA,
BM.DocumentID AS DocumentB,
BM.BM25 AS BM25
FROM (
SELECT ID FROM DocumentLength
) DL
CROSS APPLY (
SELECT
Q.DocumentID,
SUM(LOG((@N - X.Count + 0.5) / (X.Count + 0.5)) * (Q.Count * (@K1 + 1)) / (Q.Count + @K1 * (1 - @B + (@B * L.Length / @AVGDL)))) AS BM25
FROM TF D, TF Q, DF X, DocumentLength L
WHERE D.TermID = Q.TermID
AND D.DocumentID = DL.ID
AND Q.TermID = X.ID
AND D.DocumentID = L.ID
AND Q.Weight = @Weight
AND D.Weight = @Weight
GROUP BY Q.DocumentID
) BM
WHERE DL.ID != BM.DocumentID
END我希望这个答案能帮助那些想自己实现全文搜索算法的人。Server 2008 R2中的默认算法与此实现只有很小的区别。
https://dba.stackexchange.com/questions/42023
复制相似问题