下面是我的示例数据,以及我尝试使用xpath的内容。这里,我的目标是通过排除script, style标记和少数类noparse, generic来修改html中的所有文本。
下面是指向我的示例输入和php脚本的链接:
https://3v4l.org/urIBl#v7.4.21
有人能向正确的道路展示一些光明吗?
My输入:
$html=<<<doc
<html>
<head>
<title>My page</title>
<script>
//<![CDATA[
$(function(){
$('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
<li>Languages
<ol>
<li>PHP</li>
<li class='noparse'>C++</li>
</ol>
</li>
</ul>
<span>inline text</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub">Stack Overflow</a>
<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>
doc;,这就是我尝试过的,
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html, LIBXML_SCHEMA_CREATE);
$xpath = new DOMXPath($dom);
$exclude='.generic,.noparse';
foreach ($xpath->query("//*/text()[not(@class='$exclude')]|//a/@title[not(@class='$exclude')]|//img/@alt[not(@class='$exclude')]") as $node)
{
$node->textContent=$node->textContent.' powered by sometext';
}
echo $dom->saveHTML();
?>预期结果:
<html>
<head>
<title>My page powered by sometext</title>
<script>
//<![CDATA[
$(function(){
$('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
<li>Languages powered by sometext
<ol>
<li>PHP powered by sometext</li>
<li class='noparse'>C++</li>
</ol>
</li>
</ul>
<span>inline text powered by sometext</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub powered by sometext">Stack Overflow powered by sometext</a>
<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image powered by sometext" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>--这是我从脚本中得到的(这不是期望的输出)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
powered by sometext<head>
powered by sometext<title>My page powered by sometext</title>
powered by sometext<script>
//<![CDATA[
$(function(){
$('.ajax').trigger('change');
})
//]]> powered by sometext</script>
powered by sometext<style>ul li ol li{color;red;} powered by sometext</style>
powered by sometext</head>
powered by sometext<body>
powered by sometext<div>
powered by sometext<ul>
powered by sometext<li>Languages
powered by sometext<ol>
powered by sometext<li>PHP powered by sometext</li>
powered by sometext<li class="noparse">C++ powered by sometext</li>
powered by sometext</ol>
powered by sometext</li>
powered by sometext</ul>
powered by sometext<span>inline text powered by sometext</span>
powered by sometext<p class="generic">some long text data powered by sometext</p>
powered by sometext<a href="https://stackoverflow.com" title>Stack Overflow powered by sometext</a>
powered by sometext<a href="https://google.nl" title class="inline-a noparse otherclass">Google powered by sometext</a>
powered by sometext<img class="img-responsive parse round red" src="" alt>
powered by sometext<img class="img-responsive noparse round red" src="" alt>
powered by sometext</div>
powered by sometext</body>
powered by sometext</html>发布于 2021-07-05 17:19:57
编辑的
下面是编辑的脚本:
备注:
,所以解析和输出出错了。
//<![CDATA[
<script>如果您知道它是什么,但不知道如何修复解析,请回答。
a href没有排除类,但是它的类属性预计会改变。而对于img,则不是。<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image powered by sometext" /><?php
$html=<<<doc
<html>
<head>
<title>My page</title>
//<![CDATA[
<script>
$(function(){
$('.ajax').trigger('change');
})
//]]></script>
<style>ul li ol li{color;red;}</style>
</head>
<body>
<div>
<ul>
<li>Languages
<ol>
<li>PHP</li>
<li class='noparse'>C++</li>
</ol>
</li>
</ul>
<span>inline text</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub">Stack Overflow</a>
<a href="https://google.nl" title="Google" class="inline-a noparse otherclass">Google</a>
<img class="img-responsive parse round red" src="" alt="round image" />
<img class="img-responsive noparse round red" src="" alt="square image" />
</div>
</body>
</html>
doc;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html, LIBXML_SCHEMA_CREATE);
$xpath = new DOMXPath($dom);
$excluded_tags = array("script", "style");
$excluded_classes=array('generic', 'noparse');
$nodes = $xpath->query("//*");
foreach ($nodes as $node)
{
if ($node && $node->nodeName) {
if (!in_array($node->nodeName, $excluded_tags)) {
if (0 < $node->childNodes->count() && "#text" === $node->childNodes[0]->nodeName) {
if (!$node->hasAttribute('class') || !in_array($node->getAttribute('class'), $excluded_classes)) {
$nodeValue = preg_replace('/\s+$/', '', $node->childNodes[0]->nodeValue);
if (0 != strlen($nodeValue)) {
$node->childNodes[0]->nodeValue = $node->childNodes[0]->nodeValue.' powered by sometext';
//echo "Node Name: ", $node->nodeName, " Node Child Count: ", $node->childNodes->count(), " Node Child Name: ", $node->childNodes[0]->nodeName, " Node Child Value: ", preg_replace('/\s+$/', '', $node->childNodes[0]->nodeValue), PHP_EOL;
if ($node->attributes) {
foreach ($node->attributes as $attribute) {
if ('href' != $attribute->nodeName) {
$attribute->nodeValue = $attribute->nodeValue.' powered by sometext';
}
}
}
}
}
}
}
}
}
echo $dom->saveHTML();输出
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>My page powered by sometext</title></head><body><p>
//
$(function(){
$('.ajax').trigger('change');
})
//]]>
powered by sometext<style>ul li ol li{color;red;}</style>
</p>
<div>
<ul>
<li>Languages
powered by sometext<ol>
<li>PHP powered by sometext</li>
<li class="noparse">C++</li>
</ol>
</li>
</ul>
<span>inline text powered by sometext</span>
<p class="generic">some long text data</p>
<a href="https://stackoverflow.com" title="resource hub powered by sometext">Stack Overflow powered by sometext</a>
<a href="https://google.nl" title="Google powered by sometext" class="inline-a noparse otherclass powered by sometext">Google powered by sometext</a>
<img class="img-responsive parse round red" src="" alt="round image">
<img class="img-responsive noparse round red" src="" alt="square image">
</div>
</body></html>图像

https://stackoverflow.com/questions/68258380
复制相似问题