给定以下HTML,是否有一个XPath查询将提取两个<h2>标记之间的所有标记文本和未标记文本?(我使用的是RStudio中的RStudio包。)
<html>
<h2 id="section1" class="article">Heading 1</h2>
<h3 id="section1.1" class="article">Subheading 1</h3>
<p id="para001" class="article section clear">
Paragraph text 1.</p>
<div id="formula1" class="formula">...<img />...</div>
Untagged text 1.
<sub> Subscripted text. </sub>
Untagged text 2.
<em> Emphasized text. </em>
Untagged text 3.
<span id="bib"> Bibliography text. </span>
Untagged text 4.
<p id="para002" class="article section clear">
Paragraph text 2.</p>
<h3 id="section1.2" class="article">Subheading 2</h3>
<p id="para003" class="article section clear">
Paragraph 3 text.</p>
<h3 id="section1.3" class="article">Subheading 3</h3>
<p id="para004" class="article section clear">
Paragraph 4 text.</p>
<h2 id="section2" class="article">Heading 2</h2>
</html>我正在尝试提出一个查询,该查询将返回:
Paragraph text 1.
Untagged text 1.
Subscripted text.
Untagged text 2.
Emphasized text.
Untagged text 3.
Bibliography text.
Untagged text 4.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4. 到目前为止我尝试的是,
//p[preceding-sibling::h2[@id='section1']
and following-sibling::h2[@id='section2']
and descendant::node()]它回来了,
Paragraph text 1.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4.我尝试使用this question的解决方案,但我的问题稍微复杂一些。我尝试添加following-sibling::text()[1],但它没有提取无标记文本。如果没有一个好的XPath解决方案,那么我会很高兴地欢迎诸如CSS选择器这样的替代方法。
发布于 2016-02-04 18:52:20
首先,您不想只过滤p标记(这是p在第三个字母中所做的),您需要在section1之后和第2节之前的所有标签。第二,您正在寻找这两个文本节点之间的标签的所有附件。
因此:查找所有具有preceding-sibling::h2[@id='section1']和following-sibling::h2[@id='section2']的标记
//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]然后查找以下所有的text()-tags:
//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]//text()https://stackoverflow.com/questions/35205168
复制相似问题