文章/答案/技术大牛

发布

社区首页 >问答首页 >XSLT:如何从复杂生成的HTML页面中过滤内容？

问XSLT:如何从复杂生成的HTML页面中过滤内容？
EN

Stack Overflow用户

提问于 2021-08-22 13:54:49

回答 3查看 56关注 0票数 0

这里可以找到一些非常好的例子，如何使用XSLT来过滤和合并简单的HTML页面。

有大量保存的HTML页面(使用ASP生成)，如下面的示例所示，这些页面应该被过滤并合并到一个HTML中，以便从其中生成一本书。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="../../../../external.html?link=http://www.w3.org/1999/xhtml" >
<head id="Head1"><title>
    2021_0623.aspx
</title>
</style></head>
<body>
<div align="center">
   
<div class="aspNetHidden">
</div>
    <table width="95%" id="table1" cellspacing="0" cellpadding="0" border="0" >   
    <tr>
        <td>
            &nbsp;
        </td>
        <td width="100%" bgcolor="black" style="padding: 10px;">
        <div align="center">
            
        </div>    
        </td>
    </tr>
    <tr>
        <td>
            &nbsp;
        </td>
        <td bgcolor="black" width="100%" height="20px" style="padding-left: 20px; padding-right: 20px; padding-bottom: 10px;">
        <div class="align-left">
        </div>
        </td>    
    </tr>
    <tr>
        <td align="right" valign="top" style="padding-right: 10px">
            <a href="" /></a><div id="Menu1">
    <ul class="level1">
        <li>Recent Updates</li>
    </ul>
</div><a id="Menu1_SkipLink"></a>
        </td>
        <td width="100%" valign="top" bgcolor="white" style="padding: 20px;"> 
            
<p class="page-title">Library</p>
<p class="page-title-2">Library Text</p>
<div class="nav">
    <table class="nav">
    <tr class="nav">
    <td class="nav-title">Some unneeded navigation</td>
    <td class="nav">
    </td>
    </tr>
    </table>
</div>
<p class="copyright">Copyright © 2021</p>
<p class="about"><strong>ABOUT THE CONTENTS.</strong></p>
<p class="text-title">Title of text</p>
<p class="text-date">August 22, 2021</p>
<p>text of interest.</p>
<p>more text of interest.</p>

<p class="separator-left-33">&nbsp;</p>
<p class="footnote"><a id="_ftn1" href="#_ftnref1" name="_ftn1">[1]</a> a footnote of interest</p>
<p class="footnote"><a id="_ftn2" href="#_ftnref2" name="_ftn1">[2]</a> one more footnote of interest</p>

<div class="nav">
<table class="nav">
</table>
</div>
    </td>
    </tr>   
    <tr>
        <td>
            &nbsp;
        </td>
        <td width="100%" height="45" align="left" valign="top" style="padding-left: 20px; padding-top: 5px;" bgcolor="black">
            </td>
    </tr>     
    </table>
    </form>
</div>
</body>
</html>

结果应该是过滤以标题 <p class="page-title">Library</p>开头的所有内容，包括脚注。

对于XSLT来说，这是可能的吗?也许会出现这样的方法吗？

最好是过滤不需要的导航，也许是class=“关于”，这总是一样的。但这可以在之后的几个步骤中完成。

预期的输出应该是这样的，或者可以是格式良好的HTML页面：

<p class="page-title">Library</p>
<p class="page-title-2">Library Text</p>
<p class="copyright">Copyright © 2021</p>
<p class="text-title">Title of text</p>
<p class="text-date">August 22, 2021</p>
<p>text of interest.</p>
<p>more text of interest.</p>
<p class="separator-left-33">&nbsp;</p>
<p class="footnote"><a id="_ftn1" href="#_ftnref1" name="_ftn1">[1]</a> a footnote of interest</p>
<p class="footnote"><a id="_ftn2" href="#_ftnref2" name="_ftn1">[2]</a> one more footnote of interest</p>

xslt

filter

merge

回答 3

Stack Overflow用户

回答已采纳

发布于 2021-08-22 16:33:46

xsltproc似乎有一个处理--html文档的选项，而不是XML文档，因此假设该选项允许您将输入解析为HTML，而不需要命名空间XSLT 1代码。

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

  <xsl:output method="html" indent="yes" version="5" doctype-system="about:legacy-doctype"/>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="body">
      <xsl:copy>
          <xsl:variable name="start-element" select="//p[@class = 'page-title']"/>
          <xsl:apply-templates select="$start-element | $start-element/following-sibling::p"/>
      </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

如果HTML文档最终出现在这个奇怪的名称空间中，您的输入将不得不在XSLT 1中将前缀绑定到该名称空间，并使用prefix:local-name (例如，xhtml:body或xhtml:p，其中名称空间声明为xmlns:xhtml="../../../../external.html?link=http://www.w3.org/1999/xhtml" )选择并匹配具有限定名称的元素节点。

票数 0

Stack Overflow用户

发布于 2021-08-22 15:30:25

下面是perl脚本执行所需提取的基本解决方案：

#!/usr/bin/perl
my $LCount = 0; # Line count
my $ICount = 0; # Line ignore count
my $DCount = 0; # Line done count
my $Line;           # actual line

if (@ARGV == 0) {       # Kein Paramter -> Beschreibung
    print "\n";
    print "extract.pl [input-file] [output-file]\n";
    print "\n";
    exit;
}

if (@ARGV < 1) { die "To less parameter!\n"; }
if (@ARGV > 2) { die "To much parameter!\n"; }

my $InputFile = $ARGV[0];
my $OutputFile = $ARGV[1];


###############################################################################
# Main programm
###############################################################################

open(InFile, $InputFile) or die "Error opening '$InputFile': $!\n";
open(OutFile,"> $OutputFile") or die "Error opening '$OutputFile': $!\n";

while(defined($Line = <InFile>)) {
    $LCount ++;

    if ($Line =~ /^<p/) {
        if ($Line =~ /class=\"about\"/) {
            $ICount ++;
        } else {
            $DCount ++;
            print OutFile $Line;
        }
    } else {
        $ICount ++;
    }
}

close(InFile) or die "Error closing '$InputFile': $! \n";
close(OutFile) or die "Error closing '$OutputFile': $! \n";

print "\n$LCount lines from $InputFile processed.\n";
print "$DCount lines extracted.\n";
print "$ICount lines ignored.\n\n";

有了一些行，更多的内容可以被过滤掉，HTML框架也是可选的。

但是，如果可以使用XSLT进行类似的简单操作，则仍然很有趣.

票数 0

Stack Overflow用户

发布于 2021-08-22 15:50:36

在这种特殊情况下，可以使用简单的grep在shell中进行基本筛选：

grep "<p" 1.html > out.html

perl解决方案是首选的，因为可以在行为和筛选中实现更多选项。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68881985

复制

相似问题

问XSLT:如何从复杂生成的HTML页面中过滤内容？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XSLT:如何从复杂生成的HTML页面中过滤内容？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XSLT:如何从复杂生成的HTML页面中过滤内容？
EN