blocks|key|3946446|text|HTTrack+--+http://www.httrack.com/+--是一个非常好的网站复制器。效果很好。已经用了很长时间了。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3946447|Nutch是一个网络爬虫(crawler就是你要找的程序的类型)+--+http://lucene.apache.org/nutch/+--它使用一个顶级的搜索工具lucene。|3946448|entityMap|0|LINK|mutability|MUTABLE|url|http://www.httrack.com/|1|http://lucene.apache.org/nutch/^0|B|N|0|0|10|V|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]]|C|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]|N|$5|I|J|K|C|$L|O]]]]

HTTrack -- <a href="http://www.httrack.com/" rel="noreferrer">http://www.httrack.com/</a> -- is a very good Website copier. Works pretty good. Have been using it for a long time.

Nutch is a web crawler(crawler is the type of program you're looking for) -- <a href="http://lucene.apache.org/nutch/" rel="noreferrer">http://lucene.apache.org/nutch/</a> -- which uses a top notch search utility lucene.

blocks|key|3100793|text|Crawler4j是一个开源的Java爬虫程序，它提供了一个简单的网络爬行界面。您可以在5分钟内设置一个多线程网络爬虫。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3100794|您可以设置自己的过滤器来访问页面或不访问页面(urls)，并根据您的逻辑为每个抓取的页面定义一些操作。|3100795|选择crawler4j的一些原因；|style|BOLD|3100796|3100797|Multi-Threaded+Structure，|ordered-list-item|3100798|您可以设置要爬网的深度，|3100799|它是基于Java且开源的，用于冗余链接(urls)的|3100800|控件，|3100801|您可以设置要爬网的页数，|3100802|<代码>H114您可以设置要爬网的页面大小，<代码>H215<代码>H116足够的urls|3100803|entityMap|0|LINK|mutability|MUTABLE|url|http://code.google.com/p/crawler4j/^0|0|9|0|0|0|0|H|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|15|8|@]|9|@$A|16|B|17|1|18]]|C|$]]|$1|D|3|E|5|6|7|19|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|1A|8|@$A|1B|B|1C|H|I]]|9|@]|C|$]]|$1|J|3|-4|5|6|7|1D|8|@]|9|@]|C|$]]|$1|K|3|L|5|M|7|1E|8|@]|9|@]|C|$]]|$1|N|3|O|5|M|7|1F|8|@]|9|@]|C|$]]|$1|P|3|Q|5|M|7|1G|8|@]|9|@]|C|$]]|$1|R|3|S|5|M|7|1H|8|@]|9|@]|C|$]]|$1|T|3|U|5|M|7|1I|8|@]|9|@]|C|$]]|$1|V|3|W|5|6|7|1J|8|@]|9|@]|C|$]]|$1|X|3|-4|5|6|7|1K|8|@]|9|@]|C|$]]]|Y|$Z|$5|10|11|12|C|$13|14]]]]

<a href="http://code.google.com/p/crawler4j/" rel="nofollow">Crawler4j</a> is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes.

You can set your own filter to visit pages or not (urls) and define some operation for each crawled page according to your logic. 

Some reasons to select crawler4j;

<ol>
<li>Multi-Threaded Structure,</li>
<li>You can Set Depth to be crawled,</li>
<li>It is Java Based and open source,</li>
<li>Control for redundant links (urls),</li>
<li>You can set number of pages to be crawled,</li>
<li>You can set page size to be crawled,</li>
<li>Enough documentation</li>
</ol>

blocks|key|807418|text|Searcharoo.NET包含一个爬行和索引内容的爬虫，以及一个使用它的搜索引擎。您应该能够找到Searcharoo.Indexer.EXE代码的方法来捕获下载的内容，并从那里添加您自己的自定义代码……|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|807419|它非常基础(所有的源代码都包含在内，并在六篇CodeProject文章中进行了解释，其中最新的一篇是这里的Searcharoo+v6)：爬行器遵循链接、图像地图、图像，服从ROBOTS指令，解析一些非HTML文件类型。它是针对单个网站(而不是整个网站)的。|807420|Nutch/Lucene几乎肯定是一个更健壮的/商业级的解决方案--但我没有看过他们的代码。不知道你想完成什么，但是你也看过Microsoft+Search+Server+Express吗？|807421|免责声明:我是Searcharoo的作者；只是在这里提供一个选项。|807422|entityMap|0|LINK|mutability|MUTABLE|url|http://searcharoo.net/|1|http://www.codeproject.com/KB/IP/Searcharoo_6.aspx|2|http://www.microsoft.com/enterprisesearch/serverproducts/searchserverexpress/default.aspx^0|0|E|0|0|1H|D|1|0|1Q|V|2|0|0^^$0|@$1|2|3|4|5|6|7|V|8|@]|9|@$A|W|B|X|1|Y]]|C|$]]|$1|D|3|E|5|6|7|Z|8|@]|9|@$A|10|B|11|1|12]]|C|$]]|$1|F|3|G|5|6|7|13|8|@]|9|@$A|14|B|15|1|16]]|C|$]]|$1|H|3|I|5|6|7|17|8|@]|9|@]|C|$]]|$1|J|3|-4|5|6|7|18|8|@]|9|@]|C|$]]]|K|$L|$5|M|N|O|C|$P|Q]]|R|$5|M|N|O|C|$P|S]]|T|$5|M|N|O|C|$P|U]]]]

<a href="http://Searcharoo.net" rel="nofollow noreferrer">Searcharoo.NET</a> contains a spider that crawls and indexes content, and a search engine to use it. You should be able to find your way around the Searcharoo.Indexer.EXE code to trap the content as it's downloaded, and add your own custom code from there...

It's very basic (all the source code is included, and is explained in six CodeProject articles, the most recent of which is here <a href="http://www.codeproject.com/KB/IP/Searcharoo_6.aspx" rel="nofollow noreferrer">Searcharoo v6</a>): the spider follows links, imagemaps, images, obeys ROBOTS directives, parses some non-HTML file types. It is intended for single websites (not the entire web).

Nutch/Lucene is almost certainly a more robust/commercial-grade solution - but I have not looked at their code. Not sure what you are wanting to accomplish, but have you also seen <a href="http://www.microsoft.com/enterprisesearch/serverproducts/searchserverexpress/default.aspx" rel="nofollow noreferrer">Microsoft Search Server Express</a>?

Disclaimer: I am the author of Searcharoo; just offering it here as an option.

blocks|key|3100353|text|Sphider相当不错。它是PHP，但它可能会有一些帮助。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3100354|entityMap|0|LINK|mutability|MUTABLE|url|http://www.sphider.eu/^0|0|7|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

<a href="http://www.sphider.eu/" rel="nofollow noreferrer">Sphider</a> is pretty good. It's PHP, but it might be of some help.

blocks|key|5356387|text|我使用Mozenda's+Web+Scraping+software。你可以很容易地让它抓取你需要的所有链接和信息，这是一款物有所值的软件。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|5356388|entityMap|0|LINK|mutability|MUTABLE|url|http://www.mozenda.com/^0|3|V|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

I use <a href="http://www.mozenda.com" rel="nofollow noreferrer">Mozenda's Web Scraping software</a>. You could easily have it crawl all of the links and grab all of the information you need and it's a great
software for the money.

blocks|key|3946476|text|我还没有用过这个，但是this看起来很有趣。作者从头开始写了它，并发布了他是如何做到的。它的代码也可以下载。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3946477|entityMap|0|LINK|mutability|MUTABLE|url|http://www.vsj.co.uk/dotnet/display.asp?id=407^0|B|4|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

I haven't used this yet, but <a href="http://www.vsj.co.uk/dotnet/display.asp?id=407" rel="nofollow noreferrer">this</a> looks interesting. The author wrote it from scratch and posted how he did. The code for it is available for download as well.

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.

What I really need is something that I can give a site url to &amp; it will follow every link and store the content for indexing.

What's a good Web Crawler tool

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我需要索引一大堆网页，有哪些好的网络爬虫工具？我最好找的是.NET能跟我交流的东西，但这不是什么花哨的东西。我真正需要的是一些东西，我可以给一个网站的网址&它将遵循每个链接，并存储索引的内容。

问什么是好的Web爬虫工具
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问什么是好的Web爬虫工具EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问什么是好的Web爬虫工具
EN