blocks|key|979173|text|你可以试着看看这个书签程序readability背后的算法--它在所有网页垃圾中提取内容的成功率相当高。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|979174|这是我的一个朋友做的，这就是我推荐它的原因--因为我知道它是有效的，而且我知道他用来解析数据的许多技术。你可以应用这些技术来满足你的需求。|979175|entityMap|0|LINK|mutability|MUTABLE|url|http://lab.arc90.com/2009/03/02/readability/^0|D|B|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

You might try looking at the algorithms behind this bookmarklet, <a href="http://lab.arc90.com/2009/03/02/readability/" rel="nofollow noreferrer">readability</a> - It's got a decent success rate for extracting content among on all web page rubbish.

Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.

blocks|key|7818|text|你可以看看Goose+->的源代码，它已经做了很多类似于instapaper文本提取的工作|type|unstyled|depth|inlineStyleRanges|entityRanges|data|7819|https://github.com/jiminoc/goose/wiki|offset|length|7820|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|11|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@$D|O|E|P|1|Q]]|A|$]]|$1|F|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|C]]]]

you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions

<a href="https://github.com/jiminoc/goose/wiki" rel="nofollow">https://github.com/jiminoc/goose/wiki</a>

blocks|key|979243|text|看看来自Shuyo+Nakatani的ExtractContent代码。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|979244|请参阅原始的Ruby源代码http://rubyforge.org/projects/extractcontent/或将其移植到Perl+http://metacpan.org/pod/HTML::ExtractContent。|offset|length|979245|entityMap|0|LINK|mutability|MUTABLE|url|http://rubyforge.org/projects/extractcontent/|1|http://metacpan.org/pod/HTML::ExtractContent^0|0|D|19|0|1X|18|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@$D|R|E|S|1|T]|$D|U|E|V|1|W]]|A|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]|N|$5|I|J|K|A|$L|O]]]]

Have a look at the ExtractContent code from Shuyo Nakatani. 

See original Ruby source <a href="http://rubyforge.org/projects/extractcontent/" rel="nofollow">http://rubyforge.org/projects/extractcontent/</a> or a port of it to Perl <a href="http://metacpan.org/pod/HTML::ExtractContent" rel="nofollow">http://metacpan.org/pod/HTML::ExtractContent</a>

blocks|key|7929|text|为此，您真的应该考虑使用HTML+parser。收集相似的页面并比较DOM树以查找不同的节点。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|7930|entityMap|0|LINK|mutability|MUTABLE|url|http://simplehtmldom.sourceforge.net/^0|C|B|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

You really should consider using a <a href="http://simplehtmldom.sourceforge.net/" rel="nofollow noreferrer">HTML parser</a> for this. Gather similar pages and compare the DOM trees to find the differing nodes.

blocks|key|979267|text|此article提供了不同方法的比较。java库boilerpipe被评为高分。在boilerpipe网站上，你可以找到他的科学论文，它与其他算法进行了比较。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|979268|并不是所有的算法都适合所有的目的。这类工具最大的应用就是将原始文本作为搜索引擎进行索引。这个想法是，你不希望搜索结果被广告搞乱。这样的提取可能是破坏性的；这意味着它不会给你“最好的阅读区域”，而这正是人们想要的instapaper或可读性。|979269|entityMap|0|LINK|mutability|MUTABLE|url|http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/|1|http://code.google.com/p/boilerpipe/^0|1|7|0|O|A|1|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]|$A|T|B|U|1|V]]|C|$]]|$1|D|3|E|5|6|7|W|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]|N|$5|I|J|K|C|$L|O]]]]

this <a href="http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/" rel="nofollow">article</a> provides a comparison of different approaches. the java library <a href="http://code.google.com/p/boilerpipe/" rel="nofollow">boilerpipe</a> was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms. 

not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.

It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.

I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.

It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.

Thanks for helping.

Text Parser with PHP, like Instapaper

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我正在尝试用PHP编写一个文本解析器，就像Instapaper一样。我想做的是；获取一个网页，并以纯文本模式解析它。获取带有cURL和条带式超文本标记语言的网页很简单。但每个网页都有一些共同的区域，如页眉，导航，侧边栏，页脚，横幅等。我只想获得的文章在文本模式，并排除所有其他部分。如果我知道"id“或"class”信息，那么排除这些部分也很简单。但我正在尝试自动化这个过程，并申请任何页面，如Ins

问使用PHP的文本解析器，如Instapaper
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用PHP的文本解析器，如InstapaperEN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用PHP的文本解析器，如Instapaper
EN