我有这样的HTML文本
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
<othertag>
data
</othertag>
<moretag>
data
</moretag>我试图用XPATH查询以下内容
//p//text() | //othertag//text() | //moretag//text()
它给了我在每个<br>标记点上被打破的文本。
像这样
('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')我想把它当作一根完整的绳子,
('This is some important data Even this is data this is useful too')因为我将使用| Union操作符查询其他元素,并且它非常重要,因此本文内容被正确地划分
我该怎么做?
如果这不可能,
我至少能得到<p>的内部HTML吗?
这样我就可以将文本存储为
This is some important data<br>Even this is data<br>this is useful too我在Python 2.7中使用Python 2.7
发布于 2015-07-27 14:09:56
更新
根据您的编辑,也许您可以使用XPath string()函数。例如:
>>> doc.xpath('string(//p)')
'\n This is some important data\n \n Even this is data\n \n this is useful too\n '(原答覆如下)
如果你要拿回你想要的多段文字:
('This is some important data','Even this is data','this is useful too')为什么不直接加入这些片段呢?
>>> ' '.join(doc.xpath('//p/text()'))
['\n This is some important data\n ', '\n Even this is data\n ', '\n this is useful too\n ']你甚至可以去掉断线:
>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'如果您想要p元素的“内部html”,可以对所有它的子元素调用lxml.etree.tostring:
>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n Even this is data\n <br/>\n this is useful too\n '注:所有这些例子都假定:
>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
... parser=etree.HTMLParser())发布于 2015-07-27 14:24:27
您还可以在XPath中公开自己的函数:
import lxml.html, lxml.etree
raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''
doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)
def cat(context, a):
return [''.join(a)]
ns['cat'] = cat
print repr(doc.xpath('cat(//p/text())'))哪种指纹
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'您可以使用此方法执行任何您喜欢的转换。
https://stackoverflow.com/questions/31655262
复制相似问题