首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >以内存高效的方式进行内容分析,但有一些异常

以内存高效的方式进行内容分析,但有一些异常
EN

Stack Overflow用户
提问于 2021-07-05 15:09:17
回答 1查看 57关注 0票数 0

下面是我的示例数据,以及我尝试使用xpath的内容。这里,我的目标是通过排除script, style标记和少数类noparse, generic来修改html中的所有文本。

下面是指向我的示例输入和php脚本的链接:

https://3v4l.org/urIBl#v7.4.21

有人能向正确的道路展示一些光明吗?

My输入:

代码语言:javascript
复制
$html=<<<doc
<html>
<head>
<title>My page</title>
<script>
//<![CDATA[
$(function(){
   $('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
    <li>Languages
        <ol>
        <li>PHP</li>
        <li class='noparse'>C++</li>
        </ol>
    </li>
</ul>
<span>inline text</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub">Stack Overflow</a>
<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>
doc;

,这就是我尝试过的,

代码语言:javascript
复制
<?php


libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html, LIBXML_SCHEMA_CREATE);
$xpath = new DOMXPath($dom);

$exclude='.generic,.noparse';

foreach ($xpath->query("//*/text()[not(@class='$exclude')]|//a/@title[not(@class='$exclude')]|//img/@alt[not(@class='$exclude')]")  as $node)
{ 
  $node->textContent=$node->textContent.' powered by sometext';
} 

echo $dom->saveHTML();

?>

预期结果:

代码语言:javascript
复制
<html>
<head>
<title>My page powered by sometext</title>
<script>
//<![CDATA[
$(function(){
   $('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
    <li>Languages  powered by sometext
        <ol>
        <li>PHP  powered by sometext</li>
        <li class='noparse'>C++</li>
        </ol>
    </li>
</ul>
<span>inline text  powered by sometext</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub  powered by sometext">Stack Overflow  powered by sometext</a>
<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image  powered by sometext" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>

--这是我从脚本中得到的(这不是期望的输出)

代码语言:javascript
复制
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
 powered by sometext<head>
 powered by sometext<title>My page powered by sometext</title>
 powered by sometext<script>
//<![CDATA[

$(function(){
   $('.ajax').trigger('change');
})
//]]> powered by sometext</script>
 powered by sometext<style>ul li ol li{color;red;} powered by sometext</style>
 powered by sometext</head>
 powered by sometext<body>
 powered by sometext<div>
 powered by sometext<ul>
     powered by sometext<li>Languages
         powered by sometext<ol>
         powered by sometext<li>PHP powered by sometext</li>
         powered by sometext<li class="noparse">C++ powered by sometext</li>
         powered by sometext</ol>
     powered by sometext</li>
 powered by sometext</ul>
 powered by sometext<span>inline text powered by sometext</span>
 powered by sometext<p class="generic">some long text data powered by sometext</p>
 powered by sometext<a href="https://stackoverflow.com" title>Stack Overflow powered by sometext</a>
 powered by sometext<a href="https://google.nl" title class="inline-a noparse otherclass">Google powered by sometext</a>
 powered by sometext<img class="img-responsive parse round red" src="" alt>
 powered by sometext<img class="img-responsive noparse round red" src="" alt>
 powered by sometext</div>
 powered by sometext</body>
 powered by sometext</html>
EN

回答 1

Stack Overflow用户

发布于 2021-07-05 17:19:57

编辑的

下面是编辑的脚本:

备注:

  1. 您有以下代码。我不知道这是什么。我试图在网上搜索,但没有得到任何信息。因为语法:

,所以解析和输出出错了。

代码语言:javascript
复制
  //<![CDATA[
  <script>

如果您知道它是什么,但不知道如何修复解析,请回答。

  1. 我不确定您是否也想要更改属性。我看到您的预期输出有一些不一致之处,所以我没有花更多时间修复属性方面的问题:首先,a href没有排除类,但是它的类属性预计会改变。而对于img,则不是。

代码语言:javascript
复制
<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image  powered by sometext" />
代码语言:javascript
复制
<?php

  $html=<<<doc
  <html>
  <head>
  <title>My page</title>
  //<![CDATA[
  <script>
  $(function(){
     $('.ajax').trigger('change');
  })
  //]]></script>
  <style>ul li ol li{color;red;}</style>
  </head>
  <body>
  <div>
  <ul>
      <li>Languages
          <ol>
          <li>PHP</li>
          <li class='noparse'>C++</li>
          </ol>
      </li>
  </ul>
  <span>inline text</span>
  <p class="generic">some long text data</p>
  <a href="https://stackoverflow.com" title="resource hub">Stack Overflow</a>
  <a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
  <img class="img-responsive parse round red" src="" alt="round image" />
  <img class="img-responsive noparse round red" src="" alt="square image" />
  </div>
  </body>
  </html>
  doc;

  libxml_use_internal_errors(true);
  $dom = new DOMDocument();
  $dom->preserveWhiteSpace = false;
  $dom->loadHTML($html, LIBXML_SCHEMA_CREATE);
  $xpath = new DOMXPath($dom);

  $excluded_tags = array("script", "style");
  $excluded_classes=array('generic', 'noparse');

  $nodes = $xpath->query("//*");
  foreach ($nodes as $node)
  {
  
     if ($node && $node->nodeName) {
        if (!in_array($node->nodeName, $excluded_tags)) {
           if (0 < $node->childNodes->count() && "#text" === $node->childNodes[0]->nodeName) {
             if (!$node->hasAttribute('class') || !in_array($node->getAttribute('class'), $excluded_classes)) {
                $nodeValue = preg_replace('/\s+$/', '', $node->childNodes[0]->nodeValue);
                if (0 != strlen($nodeValue)) {
                  $node->childNodes[0]->nodeValue = $node->childNodes[0]->nodeValue.' powered by sometext';
                  //echo "Node Name: ", $node->nodeName, " Node Child Count: ", $node->childNodes->count(), " Node Child Name: ", $node->childNodes[0]->nodeName, " Node Child Value: ", preg_replace('/\s+$/', '', $node->childNodes[0]->nodeValue), PHP_EOL;
                  
                   if ($node->attributes) {
                      foreach ($node->attributes as $attribute) {
                        if ('href' != $attribute->nodeName) {
                           $attribute->nodeValue = $attribute->nodeValue.' powered by sometext';
                        }
                      }
                   }
                }
             }
           }
        }
     }
  } 

  echo $dom->saveHTML();

输出

代码语言:javascript
复制
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>My page powered by sometext</title></head><body><p>
//
$(function(){
   $('.ajax').trigger('change');
})
//]]&gt;
 powered by sometext<style>ul li ol li{color;red;}</style>

</p>
<div>
<ul>
    <li>Languages
         powered by sometext<ol>
        <li>PHP powered by sometext</li>
        <li class="noparse">C++</li>
        </ol>
    </li>
</ul>
<span>inline text powered by sometext</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub powered by sometext">Stack Overflow powered by sometext</a>
<a href="https://google.nl" title="Google powered by sometext" class="inline-a noparse otherclass powered by sometext">Google powered by sometext</a>
<img class="img-responsive parse round red" src="" alt="round image">
<img class="img-responsive noparse round red" src="" alt="square image">
</div>

</body></html>

图像

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68258380

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档