我有大约2500 html文件的不同标准。我需要移除它们的页脚部分。下面的HTML代码是我的文件页脚之一,我需要删除两个hr元素和两者之间的元素。
到目前为止,我只尝试使用xpath (和HTML ) selectSingleNode和DocumentNode.SelectNodes("//hr");来瞄准hr元素。然后试着用一个预测来迭代。但是我是个菜鸟,不能正确地使用XPath,也不知道如何选择节点及其兄弟姐妹(?)删除它们。
在这个社区的帮助下,到目前为止,这就是我所得到的。:)
private static void RemoveHR(IEnumerable<string> files)
{
var document = new HtmlDocument();
List<string> hr = new List<string>();
List<string> errors = new List<string>();
int i = 0;
foreach (var file in files)
{
try
{
document.Load(@file);
i++;
var hrs = document.DocumentNode.SelectNodes("//hr");
foreach (var hr in hrs) hr.Remove();
document.Save(@file);
}
catch (Exception Ex)
{
errors.Add(file + "|" + Ex.Message);
}
}
using (StreamWriter logger = File.CreateText(@"D:\websites\dev.openjournal.tld\public\arkivet\ErrorLogs\hr_error_log.txt"))
{
foreach (var file in errors)
{
logger.WriteLine(file);
}
}
int nrOfHr = hr.Count();
int nrOfErrors = errors.Count();
Console.WriteLine("Number of hr elements collected: {0}", nrOfHr);
Console.WriteLine("Number of files missing hr element: {0}", nrOfErrors);
}HTML-源:
<hr color=#ff00ff SIZE=3> //start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996) "Stemming and N-gram matching for term conflation in Turkish texts" <em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">© the authors, 1996.</p>
<hr color="#ff00ff" size="1"><div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&q=http://informationr.net/ir/2-2/paper13.html&btnG=Search&as_sdt=2000">using Google Scholar</a></div>
<hr color="#ff00ff" size="1">
<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr>
<td><a href="infres22.html"><h4>Contents</h4></a></td>
<td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
<td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
<hr color=#ff00ff SIZE=3> //end element编辑我对前面的兄弟姐妹和对目标节点的跟踪兄弟进行了一些实验。不幸的是,它不包括列表中的目标节点。
var footerTags = document.DocumentNode.SelectNodes("//*[preceding-sibling::p[contains(text(),'How to cite this')] and following-sibling::hr[@color = '#ff00ff']]");它找到带有文本“如何引用此”的段落,然后选择它之间的所有节点和颜色为"ff00ff“的hr。但不包括要删除的列表中实际选定的节点,它们需要与所选节点一起删除。
发布于 2018-02-23 18:40:48
假设开始节点和结束节点都是--与上面的注释中提到的相同的(相同的标记名称、属性和属性值)--这并不难:
示例HTML:
var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<div>DO NOT DELETE</div>
<hr color=""#ff00ff"" SIZE='3'> //start element
<p style='text-align : center; color : Red; font-weight : bold;'>How to cite this paper:</i></p>
<p style='text-align : left; color : black;'>Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996) "Stemming and N-gram matching for term conflation in Turkish texts" <em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style='text-align : center'>© the authors, 1996.</p>
<hr color='#ff00ff' size='1'><div align='center'>Check for citations, <a href='http://scholar.google.co.uk/scholar?hl=en&q=http://informationr.net/ir/2-2/paper13.html&btnG=Search&as_sdt=2000'>using Google Scholar</a></div>
<hr color='#ff00ff' size='1'>
<table border='0' cellpadding='15' cellspacing='0' align='center'>
<tr>
<td><a href='infres22.html'><h4>Contents</h4></a></td>
<td align='center' valign='top'><h5 align='center'><IMG SRC='http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13' ALIGN=middle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href='http://www.digits.net/'>Web Counter</a><br>Counting only since 13 December 2002</h5></td>
<td><a href='http://InformationR.net/ir/'><h4>Home</h4></a></td>
</tr>
</table>
<hr COLOR='#ff00ff' SIZE=""3""> //end element
<div>DO NOT DELETE</div>
</body>
</html>";解析它:
var document = new HtmlDocument();
document.LoadHtml(html);
var startNode = document.DocumentNode.SelectSingleNode("//hr[@size='3'][@color='#ff00ff']");
// account for mismatched quotes in HTML source
var quotesRegex = new Regex("[\"']");
var startNodeNoQuotes = quotesRegex.Replace(startNode.OuterHtml, "");
HtmlNode siblingNode;
while ( (siblingNode = startNode.NextSibling) != null)
{
siblingNode.Remove();
if (quotesRegex.Replace(siblingNode.OuterHtml, "") == startNodeNoQuotes)
{
break; // end node
}
}
startNode.Remove();结果产出:
<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<div>DO NOT DELETE</div>
//end element
<div>DO NOT DELETE</div>
</body>
</html>发布于 2018-02-20 19:00:16
我想,你是这么想的,
码
string content = System.IO.File.ReadAllText(@"D:\New Text Document.txt");
string html = Regex.Replace(content, "<hr.*?>", "", RegexOptions.Singleline);结果
//start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996) "Stemming and N-gram matching for term conflation in Turkish texts" <em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">© the authors, 1996.</p>
<div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&q=http://informationr.net/ir/2-2/paper13.html&btnG=Search&as_sdt=2000">using Google Scholar</a></div>
<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr>
<td><a href="infres22.html"><h4>Contents</h4></a></td>
<td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
<td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
//end elementhttps://stackoverflow.com/questions/48891744
复制相似问题