文章/答案/技术大牛

发布

社区首页 >问答首页 >解析html -> xml并使用Xpath查询

问解析html -> xml并使用Xpath查询
EN

Stack Overflow用户

提问于 2011-03-19 03:10:45

回答 2查看 2.5K关注 0票数 7

我想解析一个html页面来获取一些数据。首先，我使用SgmlReader将其转换为XML文档。然后，我将结果加载到XMLDocument，然后导航到XPath：

//contains html document
var loadedFile = LoadWebPage();

...

Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;

sgmlReader.InputStream = new StringReader(loadedFile);

XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);

除了这个站点- www.arrow.com (尝试搜索类似于OP295GS的内容)之外，这段代码在大多数情况下都能正常工作。我可以使用以下XPath获得一个具有结果的表：

var node = doc.SelectSingleNode(".//*[@id='results-table']");

这给出了一个具有几个子节点的节点：

[0]         {Element, Name="thead"}  
[1]         {Element, Name="tbody"}  
[2]         {Element, Name="tbody"}  
FirstChild   {Element, Name="thead"}

好的，让我们尝试使用XPath获取一些子节点。但这不管用：

var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0

这也是：

var childNode = node.SelectSingleNode("thead");
// childNode = null

即使是这个：

var childNode = doc.SelectSingleNode(".//*[@id='results-table']/thead")

Xpath查询可能有什么问题？

我刚刚尝试用解析那个HTML页面，我的XPath查询工作得很好。但是我的应用程序内部使用XmlDocument，不适合我。

我甚至在Html敏捷包中尝试了以下技巧，但是Xpath查询也不起作用：

//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));

XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);

也许，网页包含错误(并非所有标记都关闭等等)，但尽管如此，我仍然可以看到子节点(通过Visual中的Quick )，但不能通过XPath访问它们。

我的XPath查询在Firefox + FirePath + XPather插件中正确工作，但在.net XmlDocument中不工作:(

html-parsing

.net

xml

回答 2

Stack Overflow用户

发布于 2011-07-02 15:14:11

我没有使用SqmlReader，但是每次我看到这个问题都是由于名称空间造成的。快速查看一下www.arrow.com上的HTML，就会发现这个节点有一个名称空间(注意xmlns:javaurlen编码器)：

<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">

这段代码是我如何遍历文档中的所有节点，以查看哪些节点有名称空间，哪些节点没有名称空间。如果要查找的节点或其任何父节点都有名称空间，则必须创建一个XmlNamespaceManager，并将其传递给对SelectNodes()的调用。

这有点烦人，所以另一个想法可能是在将xmln:属性加载到XmlDocument之前将其从XML中删除。那么，你就不需要愚弄XmlNamespaceManager了！

XmlDocument doc = new XmlDocument();
doc.Load(@"C:\temp\X.loadtest.xml");

Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
    if (n.NodeType != XmlNodeType.Element) continue;

    if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
    {
        namespaces.Add(n.Name, n.NamespaceURI);
    }
}

// Inspect the namespaces dictionary to write the code below

XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI); 
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder"); 

XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
    // Do stuff
}

票数 1

Stack Overflow用户

发布于 2011-03-19 03:14:56

老实说，当我试图从一个网站获取信息时，我使用regex。Ok Kore Nordmann (在他的php博客中)认为，这是不好的。但有些评论却有不同之处。

html.html

regexp.html

但是它在php中，很抱歉这=)希望它还是有帮助的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/5359805

复制

相似问题

问解析html -> xml并使用Xpath查询
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解析html -> xml并使用Xpath查询EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解析html -> xml并使用Xpath查询
EN