文章/答案/技术大牛

发布

社区首页 >问答首页 >XPATH --如何获得内部文本数据，其中散落着<br>标记？

问XPATH --如何获得内部文本数据，其中散落着<br>标记？
EN

Stack Overflow用户

提问于 2015-07-27 14:02:48

回答 2查看 2.7K关注 0票数 0

我有这样的HTML文本

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 data
</othertag>
<moretag>
 data
</moretag>

我试图用XPATH查询以下内容

//p//text() | //othertag//text() | //moretag//text()

它给了我在每个<br>标记点上被打破的文本。

像这样

('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')

我想把它当作一根完整的绳子，

('This is some important data Even this is data this is useful too')

因为我将使用| Union操作符查询其他元素，并且它非常重要，因此本文内容被正确地划分

我该怎么做？

如果这不可能，

我至少能得到<p>的内部HTML吗？

这样我就可以将文本存储为

This is some important data<br>Even this is data<br>this is useful too

我在Python 2.7中使用Python 2.7

python

xml

xpath

回答 2

Stack Overflow用户

发布于 2015-07-27 14:09:56

更新

根据您的编辑，也许您可以使用XPath string()函数。例如：

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(原答覆如下)

如果你要拿回你想要的多段文字：

('This is some important data','Even this is data','this is useful too')

为什么不直接加入这些片段呢？

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

你甚至可以去掉断线：

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

如果您想要p元素的“内部html”，可以对所有它的子元素调用lxml.etree.tostring：

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

注:所有这些例子都假定：

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

票数 2

Stack Overflow用户

发布于 2015-07-27 14:24:27

您还可以在XPath中公开自己的函数：

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

哪种指纹

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

您可以使用此方法执行任何您喜欢的转换。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/31655262

复制

相似问题

问XPATH --如何获得内部文本数据，其中散落着<br>标记？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XPATH --如何获得内部文本数据，其中散落着<br>标记？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XPATH --如何获得内部文本数据，其中散落着<br>标记？
EN