我们有数千个封闭的标题XML文件,必须以纯文本形式导入数据库,并保留HTML标记以转换为另一种CC格式。我已经能够很容易地提取纯文本,但似乎也找不到提取原始HTML的正确方法。
是否有一种方法可以实现类似于"->htmlContent“的事情,就像->textContent在下面工作一样?
$ctx = stream_context_create(array('http' => array('timeout' => 60)));
$xml = @file_get_contents('http://blah-blah-blah/16TH.xml', 0, $ctx);
$dom = new DOMDocument;
$dom->loadXML($xml);
$ptags = $dom->getElementsByTagName( "p" );
foreach( $ptags as $p ) {
$text = $p->textContent;
}正在处理的<p> :
<p begin="00:00:14.83" end="00:00:18.83" tts:textAlign="left">
<metadata ccrow="12" cccol="8"/>
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
</p>->textContent Result
(male narrator) THE 16TH AND 17TH CENTURIES WERE THE FORMATIVE 200 YEARS所需的 HTML结果
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS发布于 2015-11-03 09:47:17
换句话说,您希望保存特定的节点-- br元素和文本节点。您可以使用DOM+Xpath来完成这一任务:
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}输出:
string(86) "
(male narrator)<br> THE 16TH AND 17TH CENTURIES<br> WERE THE FORMATIVE 200 YEARS
"Xpath表达式
任何后裔br:.//br
任何子代文本节点:.//text()
联合表达:.//br|.//text()
命名空间
如果XML使用名称空间,则必须注册和使用它们。
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('tt', 'http://www.w3.org/2006/04/ttaf1');
foreach ($xpath->evaluate('//tt:p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//tt:br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}发布于 2015-11-01 18:13:43
在我意识到由于trees...quite标记的结束标记导致strip_tags()失败之后,我无法看到BR的森林--这是一个简单的解决方案:
foreach( $ptags as $p ) {
$text = $p->textContent;
$html = $p->ownerDocument->saveXML($p); // Raw HTML
$html = str_ireplace('<br></br>','<br>',$html); // Cleanup the BR usage
$html = strip_tags($html,'<br>'); // Strip the tags I don't need
}使用DOM或regex可能有一个更优雅的解决方案,但这确实完成了它。
https://stackoverflow.com/questions/33463559
复制相似问题