首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何更改linq以包含表单元格的href时,使用htmlagilitypack刮取表

如何更改linq以包含表单元格的href时,使用htmlagilitypack刮取表
EN

Stack Overflow用户
提问于 2018-03-27 07:25:36
回答 2查看 135关注 0票数 1

假设我有一张桌子像:

代码语言:javascript
复制
<table class="MyClass" width="100%" cellspacing="0" cellpadding="0">
    <thead>
        <tr>
            <th class="releaseCol">Name</th>
            <th class="typeCol">Type</th>
        </tr>
    </thead>
    <tbody>
        <tr>
                <td><a href="https://www.somescrapypage.com/x/x/x/644892" class="demo">one</a></td>
                <td class="demo">Demo</td>
        </tr>
        <tr>
                <td><a href="https://www.somescrapypage.com/x/x/x/6876" class="other">two</a></td>
                <td class="other">Compilation</td>
        <tr>
                <td><a href="https://www.somescrapypage.com/x/x/x/8440" class="album">three</a></td>
                <td class="album">Full-length</td>
        <tr>
        <tr>
                <td><a href="https://www.somescrapypage.com/x/x/x/610225" class="single">four</a></td>
                <td class="single">Single</td>
        </tr>
    </tbody>
</table>

#当前代码

代码语言:javascript
复制
var doc = new HtmlAgilityPack.HtmlDocument
{
   OptionFixNestedTags = true,
   OptionCheckSyntax = true,
   OptionAutoCloseOnEnd = true
};
doc.LoadHtml(html);
List<List<string>> parsedTbl = 
  doc.DocumentNode.SelectSingleNode("//table[@class='MyClass']")
  .Descendants("tr")
  .Skip(1) //To Skip Table Header Row
  .Where(tr => tr.Elements("td").Count() > 1)
  .Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
  .ToList();
  foreach (var r in parsedTbl)
  {
    Console.WriteLine($"{r[0]} {r[1]} "); //HOW TO INCLUDE HREF INFO?
  }

我应该如何编辑.Select(td => td.InnerText.Trim()),以便它也包括第一个单元格的href

#预期结果:

代码语言:javascript
复制
https://www.somescrapypage.com/x/x/x/644892  one  Demo
https://www.somescrapppage.com/x/x/x/6876    two  Compilation...
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-03-27 08:05:30

测试和工作。

代码语言:javascript
复制
var doc = new HtmlAgilityPack.HtmlDocument
{
   OptionFixNestedTags = true,
   OptionCheckSyntax = true,
   OptionAutoCloseOnEnd = true
};
doc.LoadHtml(html);

List<List<HtmlAgilityPack.HtmlNode>> parsedTbl =
              doc.DocumentNode.SelectSingleNode("//table[@class='MyClass']")
              .Descendants("tr")
              .Skip(1)
              .Where(tr => tr.Elements("td").Count() > 1)
              .Select(tr => tr.Elements("td").ToList())
              .ToList();

foreach (var r in parsedTbl)
{
   Console.WriteLine(r[0].FirstChild.Attributes["href"].Value + "  " + r[0].InnerText + "  " + r[1].InnerText); //HOW TO INCLUDE HREF INFO?
}

输出

代码语言:javascript
复制
https://www.somescrapypage.com/x/x/x/644892  one  Demo
https://www.somescrapypage.com/x/x/x/6876  two  Compilation
https://www.somescrapypage.com/x/x/x/8440  three  Full-length
https://www.somescrapypage.com/x/x/x/610225  four  Single
票数 1
EN

Stack Overflow用户

发布于 2018-03-27 08:01:33

这不是超级漂亮,但应该让你开始:

代码语言:javascript
复制
class Program
{
    static void Main(string[] args)
    {
        var html = System.IO.File.ReadAllText(@"index.html");
        var doc = new HtmlAgilityPack.HtmlDocument
        {
            OptionFixNestedTags = true,
            OptionCheckSyntax = true,
            OptionAutoCloseOnEnd = true
        };
        doc.LoadHtml(html);

        var results =
        doc.DocumentNode.SelectSingleNode("//table[@class='MyClass']")
        .Descendants("tr")
        .Skip(1) //To Skip Table Header Row
        .Where(tr => tr.Elements("td").Count() > 1)
        .Select(tr =>
        {

            return new Result
            {
                link = tr.Elements("td").Select(td => td.Elements("a").FirstOrDefault().Attributes["href"].Value).FirstOrDefault(),
                inner = tr.Elements("td").Select(td => td.Elements("a").FirstOrDefault().InnerText).FirstOrDefault(),
                name = tr.Elements("td").Skip(1).FirstOrDefault().InnerText
            };

        });

        foreach (var result in results)
        {
            Console.WriteLine($"Link: {result.link} InnerText: {result.inner} Name: {result.name}");
        }
    }
}

class Result
{
    public string link { get; set; }
    public string inner { get; set; }
    public string name { get; set; }
}
}
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49506711

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档