全!
如何使用Symfony2 DomCrawler组件解析正确描述的XML文件?
我需要分割所有的部分,并收集内部标签(铭文,p,诗等)。将仅属于本节的当前部分放在一起。
下面介绍了标准的FB2图书XML格式:
<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink">
<description></description>
<body>
<section>
<title><p><strong>Level 1, section 1</strong></p></title>
<section>
<title><p><strong>Level 2, section 2</strong></p></title>
<section>
<title><p><strong>Level 3, section 3</strong></p></title>
<p>Level 3, section 3, paragraph 1</p>
<poem>
<stanza>
<v>bla-bla-bla 1</v>
<v>bla-bla-bla 2</v>
<v>bla-bla-bla 3</v>
</stanza>
</poem>
<p>Level3, section 3, paragraph 2</p>
<subtitle><strong>x x x</strong></subtitle>
</section>
<section>
<title><p><strong>Level 3, section 4</strong></p></title>
<p>Level 3, section 4, paragraph 1</p>
<p>Level 3, section 4, paragraph 2</p>
<subtitle><strong>x x x</strong></subtitle>
</section>
<section>
<title><p><strong>Level 3, section 5</strong></p></title>
<p>Level 3, section 5, paragraph 1</p>
<p>Level 3, section 5, paragraph 2</p>
<p>Level 3, section 5, paragraph 3</p>
<empty-line/>
<subtitle>This file was created</subtitle>
<subtitle>with BookDesigner program</subtitle>
<subtitle>bookdesigner@the-ebook.org</subtitle>
<subtitle>22.04.2004</subtitle>
</section>
</section>
</section>
</body>
</FictionBook>下面的代码不起作用,有人能帮我解决这个问题吗?顺便说一下,标题被正确地解析了..。但部门的标签不是..。
private function loadBookSections(Crawler $crawler)
{
$sections = $crawler->filter('section')->each(function(Crawler $node) {
$c = $node->filter('section')->reduce(function(Crawler $node, $i) {
return ($i == 0);
});
return array(
'title' => $node->filter('title')->text(),
'inner' => $c->html(),
);
});
echo "*******************************************\n";
foreach($sections as $section ) {
echo ">>> ".$section['title']."\n";
echo "!!! ".$section['inner']."\n";
}
}谢谢你的帮助!
发布于 2013-11-20 15:12:34
四天后..。我通过XPath找到了解决方案..。
private function loadBookSections(Crawler $crawler)
{
$sections = $crawler->filter('section')->each(function(Crawler $node) {
return array(
'title' => $node->filter('title')->text(),
'inner' => $node->filterXPath("//*[not(section)]")->html(),
);
});
foreach($sections as $section) {
echo "TITLE: ".$section['title']."\n";
echo "INNER: ".$section['inner']."\n";
}
}发布于 2013-11-18 12:57:48
如果您对XML文件进行了相当大的压缩,就会得到如下所示:
<section>
<section>
<!-- ... -->
</section>
<section>
<!-- ... -->
</section>
<section>
<!-- ... -->
</section>
</section>您希望捕获子section元素,而不是父元素。
当前,您只对父section元素列表进行迭代,这意味着您只能获得父section元素的HTML。
要迭代子程序,需要选择section section而不是section。
进一步改进代码的附带信息:而不是丑陋的reduce调用,只需使用->first()获取节点列表的第一个元素。
总之,您的代码将是:
$sections = $crawler->filter('section section')->each(function(Crawler $node) {
$c = $node->filter('section')->first();
return array(
'title' => $node->filter('title')->text(),
'inner' => $c->html(),
);
});https://stackoverflow.com/questions/19979488
复制相似问题