文章/答案/技术大牛

发布

问PHP nodeValue和字符数
EN

Stack Overflow用户

提问于 2012-10-11 01:31:32

回答 2查看 281关注 0票数 1

我有一个字符串变量，其中包含使用fopen()的文本文件(如.html)，然后转到strip_tags()，这样我就可以将未标记的文本用于文章预览，但在此之前，我需要获取h1 nodeValue，并计算它的字符数，这样我就可以用该值替换下面代码中的零，并以150+结束。

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
$StrippedFile=strip_tags($WholeFile);
$TextExtract = mb_substr("$StrippedFile", 0,150);

对我来说最好的方法是什么？解析器是答案吗？由于到目前为止这是唯一的情况，我将从html标记中提取值。

php

回答 2

Stack Overflow用户

回答已采纳

发布于 2012-10-11 01:43:12

如果您确定正在处理的文件的内容，并且知道标题为H1，那么您可能会将在</h1>位置获得的字符串分割为两个字符串(例如使用strstr()，尽管有很多方法可以做到这一点)。

然后，您可以在第一个上剥离标签以获得标题，并在第二个上剥离标签以获得内容。这里假设您的文件在包含文章内容的h1元素之前只有一个包含标题的dom。

请记住，这不是解析大量在线文章的最佳方式，对于更一般的解决方案，我会研究专用的解析器类。

下面是一个代码示例：

代码示例

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
// Modified part
$content = strip_tags(strstr($WholeFile, '</h1>'));
$title = strip_tags(strstr($WholeFile, '</h1>', true)); // Valid with PHP 5.3.0 only I think
$TextExtract = mb_substr($content, 0,150);

票数 0

Stack Overflow用户

发布于 2012-10-11 03:01:33

当您具有结构化文本(如HTML、XML、json、YAML等)时，您应该始终使用适当的解析器，除非您有很好的理由不使用解析器。

在这种情况下，您也许能够使用正则表达式，但是您的解决方案非常脆弱，并且可能会遇到与字符编码、实体或空格相关的问题。上面的所有解决方案都会被巧妙地打破。例如，如果您有如下输入：

<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here's
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>

这是一个使用DOMDocument和DOMXPath的解决方案，除了最差的超文本标记语言之外，它应该可以在所有的超文本标记语言上工作，并且总是会给你一个150个字符(而不是字节，字符)的utf-8回复，所有实体都归一化为它们的字符值。

$html = '<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here\'s
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>';


$doc = new DOMDocument();
$doc->loadHTML($html);
// if you have a url or filename, you can use this instead:
// $doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);

// you can easily modify the xquery to match the "title" of different documents
$titlenode = $xp->query('/html/body//h1[1]');

$xpath = 'normalize-space(substring(
        concat(
            normalize-space(.),
            " ",
            normalize-space(./following-sibling::*)
        ), 0, 150))';


$excerpt = null;
if ($titlenode->length) {
    $excerpt = $xp->evaluate($xpath, $titlenode->item(0));
}

var_export($excerpt);

此代码将输出：

'Title — maybe emdash counted as 7 characters and whitespace counted excessively too. And here\'s a utf-8 character that may get split in the middle: ©'

这里的基本思想是将标题(或任何标题元素)与XPath进行匹配，然后获取该元素和所有后续元素的字符串值并截断150个字符，也使用XPath。将所有内容都保留在XPath中可以避免使用PHP时必须处理的所有乱七八糟的字符集和实体问题。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/12825042

复制

相似问题

问PHP nodeValue和字符数
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PHP nodeValue和字符数EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PHP nodeValue和字符数
EN