blocks|key|1058372|text|要支持对文档集进行搜索，构建反向索引通常是最好的解决方案。在这里，我假设您希望支持全文快速搜索操作，如谷歌、必应等提供的.但根据你的数据。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1058373|构建反向索引通常涉及将文档拆分成单词，并将它们单独添加到反向索引中。每个索引条目将包括一个作为键的单词，以及文档名称(或文档的其他标识符)，以及文档中单词作为值的位置。|1058374|您可以手动完成此操作，但是解析文档、提取单词、消除非重要单词并对其进行索引并不是那么简单。使用专用产品更容易。|1058375|大多数RDBMS支持扩展，提供全文索引功能。例如：|1058376|MySQL|unordered-list-item|1058377|PostgreSQL|1058378|甲骨文|1058379|MS+SQL+Server|1058380|IBM+DB2|1058381|一般来说，这些RDBMS扩展比专用引擎效率低。我推荐下列产品之一：|1058382|ElasticSearch，基于Lucene的|1058383|阿帕奇索尔，基于Lucene的|1058384|狮身人面像|1058385|我认为这些产品中的任何一个都可以索引数百万份文档。|1058386|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/Full_text_search|1|http://dev.mysql.com/doc/refman/5.5/en/fulltext-natural-language.html|2|http://www.postgresql.org/docs/9.1/static/textsearch-intro.html|3|http://www.oracle.com/technetwork/database/enterprise-edition/index-098492.html|4|http://msdn.microsoft.com/en-us/library/ms142571.aspx|5|http://www.ibm.com/developerworks/data/tutorials/dm-0810shettar/index.html|6|http://www.elasticsearch.org/overview/elasticsearch/|7|http://lucene.apache.org/solr/|8|http://sphinxsearch.com/^0|15|6|0|0|0|0|0|0|5|1|0|0|A|2|0|0|3|3|0|0|D|4|0|0|7|5|0|0|0|D|6|0|0|5|7|0|0|5|8|0|0^^$0|@$1|2|3|4|5|6|7|1S|8|@]|9|@$A|1T|B|1U|1|1V]]|C|$]]|$1|D|3|E|5|6|7|1W|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|1X|8|@]|9|@]|C|$]]|$1|H|3|I|5|6|7|1Y|8|@]|9|@]|C|$]]|$1|J|3|K|5|L|7|1Z|8|@]|9|@$A|20|B|21|1|22]]|C|$]]|$1|M|3|N|5|L|7|23|8|@]|9|@$A|24|B|25|1|26]]|C|$]]|$1|O|3|P|5|L|7|27|8|@]|9|@$A|28|B|29|1|2A]]|C|$]]|$1|Q|3|R|5|L|7|2B|8|@]|9|@$A|2C|B|2D|1|2E]]|C|$]]|$1|S|3|T|5|L|7|2F|8|@]|9|@$A|2G|B|2H|1|2I]]|C|$]]|$1|U|3|V|5|6|7|2J|8|@]|9|@]|C|$]]|$1|W|3|X|5|L|7|2K|8|@]|9|@$A|2L|B|2M|1|2N]]|C|$]]|$1|Y|3|Z|5|L|7|2O|8|@]|9|@$A|2P|B|2Q|1|2R]]|C|$]]|$1|10|3|11|5|L|7|2S|8|@]|9|@$A|2T|B|2U|1|2V]]|C|$]]|$1|12|3|13|5|6|7|2W|8|@]|9|@]|C|$]]|$1|14|3|-4|5|6|7|2X|8|@]|9|@]|C|$]]]|15|$16|$5|17|18|19|C|$1A|1B]]|1C|$5|17|18|19|C|$1A|1D]]|1E|$5|17|18|19|C|$1A|1F]]|1G|$5|17|18|19|C|$1A|1H]]|1I|$5|17|18|19|C|$1A|1J]]|1K|$5|17|18|19|C|$1A|1L]]|1M|$5|17|18|19|C|$1A|1N]]|1O|$5|17|18|19|C|$1A|1P]]|1Q|$5|17|18|19|C|$1A|1R]]]]

To support searching over a set a documents, building a reverse index is generally the best solution. Here I assume you want to support <a href="http://en.wikipedia.org/wiki/Full_text_search" rel="nofollow">full text fast search</a> operations such as the ones provided by Google, Bing, etc... but on your data.

Building a reverse index generally involves splitting the documents into words, and add them individually into the reverse index. Each index entry will include a word as a key, and the document name (or some other identifier of the document), and locations of the word in the document as a value.

You can do this manually, but it is not so trivial to parse documents, extract words, eliminate non significant words, and index them. It is easier to use a dedicated product.

Most RDBMS supports extensions providing fulltext indexing capabilities. For instance:

<ul>
<li><a href="http://dev.mysql.com/doc/refman/5.5/en/fulltext-natural-language.html" rel="nofollow">MySQL</a></li>
<li><a href="http://www.postgresql.org/docs/9.1/static/textsearch-intro.html" rel="nofollow">PostgreSQL</a></li>
<li><a href="http://www.oracle.com/technetwork/database/enterprise-edition/index-098492.html" rel="nofollow">Oracle</a></li>
<li><a href="http://msdn.microsoft.com/en-us/library/ms142571.aspx" rel="nofollow">MS SQL Server</a></li>
<li><a href="http://www.ibm.com/developerworks/data/tutorials/dm-0810shettar/index.html" rel="nofollow">IBM DB2</a></li>
</ul>

Generally, these RDBMS extensions are less efficient than specialized engines. I would recommend one of the following products:

<ul>
<li><a href="http://www.elasticsearch.org/overview/elasticsearch/" rel="nofollow">ElasticSearch</a>, based on Lucene</li>
<li><a href="http://lucene.apache.org/solr/" rel="nofollow">Apache Solr</a>, based on Lucene</li>
<li><a href="http://sphinxsearch.com/" rel="nofollow">Sphinx</a></li>
</ul>

I think any of these products can index a few millions of documents.

I've been provided with aprox 4-5 million images of old documents my company has decided to delete. We're trying to go paperless but I'm facing an issue I've been unable to fully comprehend.
I've always used SQL for this amount of data but now I only have images. I've already bought ABBYY Fine Reader OCR and it's currently working on OCRing all the files to Word or PDF. My problem is they'd like to search within this massive amount of data in less than 7-10 seconds and get all the results with a download link to the original image of the file.

I read about NoSQL but it seems to me it's not the best approach as I'd have to create a table with no schema whatsoever and just add the entire text of each image with a corresponding Page number and a link to the original file. According to my knowledge this will take ages.
What other solutions can I use?

NoSQL for searching millions of pages?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我收到了公司决定删除的450万张旧文件的图片。我们正试图实现无纸化，但我面临着一个我一直无法完全理解的问题。对于这么多的数据，我一直使用SQL，但现在我只有图像。我已经购买了ABBYY精细读取器OCR，它目前正在工作的OCRing所有文件的Word或PDF。我的问题是，他们想在7到10秒内搜索大量的数据，并下载到文件原始图像的链接获得所有结果。我读过关于NoSQL的文章，但在我看来，这并不是最好的

问NoSQL用于搜索数百万页？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NoSQL用于搜索数百万页？EN