blocks|key|4079863|text|在我看来，这是一个非常重要的漏洞，它阻碍了Solr的广泛采用。新的DataImportHandler是导入结构化数据的很好的第一步，但是对于Solr来说，没有一个好的文档摄取管道。Nutch确实可以工作，但是Nutch+crawler和Solr之间的集成有点笨拙。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4079864|我尝试了我能找到的所有开源爬虫，但它们都没有与Solr集成。|4079865|密切关注OpenPipeline和Apache+Tika。|4079866|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy. 
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr. 
Keep an eye on OpenPipeline and Apache Tika.

blocks|key|875388|text|我尝试过nutch，但它很难与Solr集成。我会去看看Heritrix。它有一个广泛的插件系统，使得它很容易与Solr集成，并且爬行速度要快得多。它大量使用线程来加速进程。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|875389|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

I've tried nutch, but it was very difficult to integrate with Solr. I would take a look at Heritrix. It has an extensive plugin system to make it easy to integrate with Solr, and it is much much faster at crawling. It makes extensive use of threads to speed up the process.

blocks|key|4079834|text|我建议你去看看Nutch，从中获得一些灵感：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|4079835|4079836|+Nutch是一个开源的网络搜索软件。它构建在Lucene+Java之上，添加了特定于web的功能，如爬虫、链接图数据库、超文本标记语言和其他文档格式的解析器等。|blockquote|4079837|4079838|entityMap|0|LINK|mutability|MUTABLE|url|http://lucene.apache.org/nutch/about.html^0|7|5|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|-4|5|6|7|U|8|@]|9|@]|C|$]]|$1|E|3|F|5|G|7|V|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|W|8|@]|9|@]|C|$]]|$1|I|3|-4|5|6|7|X|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

I suggest you to check out <a href="http://lucene.apache.org/nutch/about.html" rel="nofollow noreferrer">Nutch</a> to get some inspiration:

<blockquote>
 Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
</blockquote>

blocks|key|12017|text|还要查看Apache+Droids+[http://incubator.apache.org/droids/]+--这不是一个简单的爬虫/爬虫/工作者框架。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|12018|它是新的，而且还不容易使用(它需要一些take才能运行)，但这是一件值得关注的好事情。|12019|entityMap|0|LINK|mutability|MUTABLE|url|http://incubator.apache.org/droids/]^0|I|11|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

Also check Apache Droids [<a href="http://incubator.apache.org/droids/]" rel="nofollow noreferrer">http://incubator.apache.org/droids/]</a> -- this hopes not be a simple spider/crawler/worker framework.

It is new and is not yet easy to use off the shelf (it will take some tweeking to get running), but is a good thing to keep your eye on.

blocks|key|4080002|text|Nutch可能是您最接近的匹配，但它不太灵活。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4080003|如果你需要更多的东西，你将不得不破解你自己的爬虫。这并不像听起来那么糟糕，每种语言都有web库，所以你只需要连接一些任务队列管理器与HTTP下载器和HTML解析器，这并不是很多的工作。您很可能只需要一个机器，因为爬行主要是带宽密集型的，而不是CPU密集型的。|4080004|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

Nutch might be your closest match, but it's not too flexible.

If you need something more you will have to pretty much hack your own crawler. It's not as bad as it sounds, every language has web libraries, so you just need to connect some task queue manager with HTTP downloader and HTML parser, it's not really that much work. You can most likely get away with a single box, as crawling is mostly bandwidth-intentive, not CPU-intensive.

blocks|key|4080057|text|http://arachnode.net|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|4080058|C#，但生成Lucene+(Java和C#)可使用的索引文件。|4080059|entityMap|0|LINK|mutability|MUTABLE|url|http://arachnode.net/^0|0|K|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

<a href="http://arachnode.net" rel="nofollow">http://arachnode.net</a>

C#, but produces Lucene (Java and C#) consumable index files.

blocks|key|3233682|text|有人试过Xapian吗？它看起来比solr快得多，而且是用c%2B%2B编写的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3233683|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Did anyone tried Xapian? It seams much quicker than solr and written in c++.

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.

Recommendations for a spidering tool to use with Lucene or Solr?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

对于HTML和XML文档(本地或基于web)，什么是一个好的爬虫(爬虫)，并且在Lucene / Solr解决方案空间中工作得很好？可以是基于Java的，但不一定是。

问对使用Lucene或Solr的爬虫工具的建议？
EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对使用Lucene或Solr的爬虫工具的建议？EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对使用Lucene或Solr的爬虫工具的建议？
EN