文章/答案/技术大牛

发布

社区首页 >问答首页 >使用C#解析HTML页面有问题

问使用C#解析HTML页面有问题
EN

Stack Overflow用户

提问于 2016-05-03 12:22:29

回答 1查看 86关注 0票数 0

我一直试图使用HtmlAgilityPack、Fizzler和正则表达式来完成一些任务，但没有成功。

我试图抓取和解析到元素的页面在这里http://www.sczg.unizg.hr/student-servis/vijest/2015-04-14-poslovi-u-administraciji/。

Example of an item in item list:
<p> 
  <b>1628/ SomeBoldedTitle
  </b> 
    Some Description. 
    Some price 20,00kuna. 
  <strong>Contact somenumber
       098/1234-567 some mail
  </strong> 
</p>

我想将这个项目解析为：

4/5数字ID >1628/在b元素中
标题> b>element中的SomeBoldedTitle
描述>在/b之后
联系人号码和链接>有时在strong>element中有时在b中

下面是一些代码，我试图至少获得一些输出，我希望所有的p元素都有b的，但什么也没有出来。

 using System;
    using HtmlAgilityPack;
    using Fizzler.Systems.HtmlAgilityPack;

namespace Sample
{
    class Program
    {

        static void Main(string[] args)
        {
            var web = new HtmlWeb();
            var document = web.Load("http://www.sczg.unizg.hr/student-servis/vijest/2015-04-14-poslovi-u-administraciji/");
            var page = document.DocumentNode;
                foreach (var item in page.QuerySelectorAll("p.item"))
            {
                Console.WriteLine(item.QuerySelector("p:has(b)").InnerHtml);
            }
        }
    }
}

下面是我用来获取这个代码https://fizzlerex.codeplex.com/的fizzler“文档”的链接

parsing

html-agility-pack

fizzler

regex

回答 1

Stack Overflow用户

发布于 2016-05-04 01:29:58

我建议使用HTML解析模块，因为HTML可能导致一些疯狂的边缘情况，这些情况会使您的数据扭曲。但是，如果您控制了源文本，并且仍然希望/需要使用regex，我提供了这个可能的解决方案。

描述

鉴于以下案文

Example of an item in item list:
<p> 
  <b>1628/ SomeBoldedTitle
  </b> 
    Some Description. 
    Some price 20,00kuna. 
  <strong>Contact somenumber
       098/1234-567 some mail
  </strong> 
</p>

这个Regex

<p>(?:(?!<p>).)*<b>([0-9]+)/\s*((?:(?!</b>).)*?)\s*</b>\s*((?:(?!<strong>|<b>).)*?)\s*<(?:strong|b)>\s*((?:(?!</).)*?)\s*</

将您的文本解析为以下捕获组：

组0将是大部分字符串。
第一组将是多位代码。
第二组将是标题
第三组将是描述
第四组将是电话号码

捕获组

[0][0] = <p> 
  <b>1628/ SomeBoldedTitle
  </b> 
    Some Description. 
    Some price 20,00kuna. 
  <strong>Contact somenumber
       098/1234-567 some mail
  </
[0][1] = 1628
[0][2] = SomeBoldedTitle
[0][3] = Some Description. 
    Some price 20,00kuna.
[0][4] = Contact somenumber
       098/1234-567 some mail

解说

注意:右击图像并选择新窗口中的视图。

NODE                     EXPLANATION
----------------------------------------------------------------------
  <p>                      '<p>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      <p>                      '<p>'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    .                        any character
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  <b>                      '<b>'
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        </b>                     '</b>'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  </b>                     '</b>'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        <strong>                 '<strong>'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        <b>                      '<b>'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    strong                   'strong'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    b                        'b'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        </                       '</'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  </                       '</'

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37004036

复制

相似问题

问使用C#解析HTML页面有问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用C#解析HTML页面有问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用C#解析HTML页面有问题
EN