首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在字符串第一次出现时停止,多次使用

如何在字符串第一次出现时停止,多次使用
EN

Stack Overflow用户
提问于 2015-05-18 17:00:32
回答 2查看 69关注 0票数 0

我目前正在编写一个脚本来解析HTML文档中的部分内容。

下面是我正在解析的代码的示例:

代码语言:javascript
复制
<div class="tab-content">
<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">
<h3>What is Pantoprazole?</h3>
Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is
used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is
a condition where the acid in the stomach washes back up into the esophagus. <br/> Pantoprazole is a proton pump
inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach.
<h3>How To Take</h3>
Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water
</div>
</div>
<div class="tab-pane fade" id="alternative-treatments">
<div class="panel-body">
<h3>Alternatives</h3>
Antacids taken as required Antacids are alkali liquids or tablets
that can neutralise the stomach acid. A dose may give quick relief.
There are many brands which you can buy. You can also get some on
prescription. If you have mild or infrequent bouts of dyspepsia you
may find that antacids used as required are all that you need.<br/>
</div>
</div>
<div class="tab-pane fade" id="side-effects">
<div class="panel-body">
<p>Most people who take acid reflux medication do not have any side-effects.
However, side-effects occur in a small number of users. The most
common side-effects are:</p>
<ul>

我正在尝试解析以下所有内容:

代码语言:javascript
复制
<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">

代码语言:javascript
复制
</div>

我已经编写了以下regex代码:

代码语言:javascript
复制
<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n(?:<\/div>)

也尝试过:

代码语言:javascript
复制
<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n<\/div>

但它似乎不会在第一个<\/div>停止,它会一直持续到代码中的最后一个<div>

EN

回答 2

Stack Overflow用户

发布于 2015-05-18 17:08:42

Don't use regex to parse HTML。您可以使用HtmlAgilityPack

然后,这将按预期工作:

代码语言:javascript
复制
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(File.ReadAllText("Path"));
var divPanelBody = doc.DocumentNode.SelectSingleNode("//div[@class='panel-body']");
string text = divPanelBody.InnerText.Trim();  // null check omitted

结果:

潘托拉唑是什么?潘托拉唑是一种仿制药,用于治疗胃酸过多的某些情况。它用于治疗胃和十二指肠溃疡,腐蚀性食管炎和胃食管反流病(GERD)。GERD是一种胃中的酸被冲回食道的情况。潘托拉唑是一种质子泵抑制剂。它的工作原理是减少胃部产生的酸量。如何在用餐前1小时服下药片而不嚼碎,并用清水吞下。

下面是另一种LINQ方法,我更喜欢它而不是XPath语法:

代码语言:javascript
复制
var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "") == "panel-body");

请注意,这两种方法都区分大小写,因此它们找不到Panel-Body。您可以很容易地使最后一种方法不区分大小写:

代码语言:javascript
复制
var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "").Equals("panel-body", StringComparison.InvariantCultureIgnoreCase));
票数 3
EN

Stack Overflow用户

发布于 2015-05-18 17:27:48

您可以使用HtmlAgilityPack轻松完成此操作

代码语言:javascript
复制
public string GetInnerHtml(string html)
{
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      var nodes = doc.DocumentNode.SelectNodes("//div[@class=\"panel-body\"]");
      StringBuilder sb = new StringBuilder();
      foreach (var n in nodes)
      {
            sb.Append(n.InnerHtml);
      }
      return sb.ToString();
}
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/30298946

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档