问查找带有空全文索引的记录
EN

Stack Overflow用户

提问于 2014-07-28 23:26:47

回答 1查看 163关注 0票数 2

我将文档二进制文件(主要是PDF文件)存储在Server数据库中，并使用Acrobat IFilter和全文索引来搜索文件的内容。

然而，这些PDF中的一些是用真正便宜的软件扫描的，这些软件不做OCR，而是文档的图像，而不是带有可搜索文本的适当文档。我想确定哪些记录在数据库中没有可搜索的文本，以便它们可以被OCRed和重新上传。

我可以使用sys.dm_fts_index_keywords_By_Document获得至少有一个全文条目的文档I。我试着用document表加入不同的it列表，以找到不匹配的记录，但结果却非常慢--我有大约20,000个文档(大约几百页)，查询运行了20多分钟，然后我取消了它。

有更好的方法吗？

full-text-indexing

sql-server

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-11-17 19:03:22

我想出了一个解决方案，这个解决方案只需2分钟就能在4万份文档上运行。

1)创建一个临时表来存储来自document_id的sys.dm_fts_index_keywords_by_document值。

( 2)通过document_id分组进行填充。几乎所有的文档都至少有一些条目，所以选择一个关键字计数阈值来表示全文索引没有有意义的信息(我用了30，但大多数“坏”文档只有3-5)。在我的特殊情况下，存储PDF二进制文件的表是PhysicalFile。

3)如果需要，将临时表加入到任何其他表中，这些表都有您需要的信息。在我的特殊情况下，MasterDocument包含文档标题，我还包含了一些查找表。

create table #PhysicalFileIDs (PhysicalFileID int, KeywordCount int)

insert into #PhysicalFileIDs (PhysicalFileID, KeywordCount)
    select document_id, count(keyword) from sys.dm_fts_index_keywords_by_document (db_id(), object_id('PhysicalFile'))
    group by document_id having count(keyword) < 30

select MasterDocument.DocumentID, MasterDocument.Title, ProfileType.ProfileTypeDisplayName, #PhysicalFileIDs.KeywordCount
    from MasterDocument
    inner join #PhysicalFileIDs on Masterdocument.PhysicalFileID = #PhysicalFileIDs.PhysicalFileID
    inner join DocumentType on MasterDocument.DocumentTypeID = DocumentType.DocumentTypeID
    inner join ProfileType on ProfileType.ProfileTypeID = DocumentType.ProfileTypeID

drop table #PhysicalFileIDs

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25005926

复制

相似问题

问查找带有空全文索引的记录
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问查找带有空全文索引的记录EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问查找带有空全文索引的记录
EN