blocks|key|381222|text|见本文：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|381223|http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/|offset|length|381224|如果你使用PHP+/是优柔寡断的话，你很快就会站起来|381225|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|2W|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@$D|Q|E|R|1|S]]|A|$]]|$1|F|3|G|5|6|7|T|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|U|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|C]]]]

See this article:

<a href="http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/" rel="nofollow">http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/</a>

Will get you on your feet in no time if you are using PHP / are undecisive

blocks|key|422713|text|我建议使用正则表达式，但是您需要为每个网站编写一个表达式。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|422714|或者你可以用DOM。|422715|但是无论如何，您总是需要跟踪所有想要解析的WWW上的所有更改。每个网站都需要一套不同的规则。|422716|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

I suggest regular expressions, but you'll need to cook an expressions for each website.

Or you can use DOM.

But anyways you'll always need to follow all the changes on all the WWW you want to parse.
And you'll need a different set of rules for each website.

blocks|key|381251|text|使用DOM解析器和获取内容。不要使用regex。RegEx匹配打开的标记，但XHTML自包含标记除外。解释得很好。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|381252|为您选择的语言找到一个DOM解析器，然后使用XPath或类似的方法查询DOM对象。对于在javascript操作DOM方面有经验的人来说，另一个很好的解决方案是查看phanomJS，它很棒，现在我用它作为我所有内容刮板的后端。|381253|干杯|381254|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454^0|O|R|0|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]]]

Use a DOM parser with for getting your content.
Do NOT use regex. <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">RegEx match open tags except XHTML self-contained tags</a> explains it very well.

Find a DOM parser for the language of your choice, then use XPath or similar for querying the DOM object. Another nice solution for people with experience in javascript for manipulating the DOM, check out phanomJS, it's awesome and it's what i use as the backend of all my content scrapers now a days.

Cheers

blocks|key|381286|text|我也会推荐一个DOM解析器。我以前使用过PHP简单HTML+DOM解析器，并会推荐它。它的速度相当快，并处理破坏的HTML，以及一些正则表达式将与之斗争。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|381287|但是，如果您愿意删除RSS不提供的映像，那么解析RSS提要应该容易得多，因为它是一个有效的XML文档。|381288|entityMap|0|LINK|mutability|MUTABLE|url|http://simplehtmldom.sourceforge.net/^0|K|G|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

I'd recommend a DOM Parser as well. I've used <a href="http://simplehtmldom.sourceforge.net/" rel="nofollow">PHP Simple HTML DOM Parser</a> in the past and would recommend it. Its fairly fast and handles broken HTML as well, something regular expressions would struggle with.

However, if you are willing to do away with images where RSS does not provide them, parsing an RSS feed should be a lot easier since it is a valid XML document.

Hi I have a task to build an application that display news from various websites(BBC News, CNN, etc)

I came up with 2 ideas to either parse an RSS Feed of the news site or parse the html pages of each news article.

However after researching abit on RSS feeds i found out it is hard to parse an image from mainly because not all rss feeds have images. 

Therefore what do you recommend as a good HTML document parser which i can extract the Title, Description, Data and Image of the news article.

Parsing a HTML Page?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

嗨，我有一个任务，要建立一个应用程序，显示来自不同网站的新闻(BBC新闻，CNN等)我想出了两个想法，要么解析新闻网站的RSS源，要么解析每篇新闻文章的html页面。然而，在研究了RSS提要的abit之后，我发现很难解析来自rss提要的图像，主要是因为不是所有rss提要都有图像。因此，您推荐什么作为一个好的HTML文档解析器，我可以提取标题，描述，数据和图像的新闻文章。

问解析HTML页面？
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解析HTML页面？EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解析HTML页面？
EN