首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >xml rss提要上的xpath不像预期的那样工作。

xml rss提要上的xpath不像预期的那样工作。
EN

Stack Overflow用户
提问于 2014-10-23 16:26:55
回答 1查看 195关注 0票数 0

尝试用scrapy (0.16)控制台解析这个rss提要不像预期的那样工作,我也不知道出了什么问题。似乎只有@href这样的属性是可以访问的:

代码语言:javascript
复制
>>> fetch('http://www2c.cdc.gov/podcasts/feed.asp?feedid=183')
2014-10-23 12:20:54-0400 [default] DEBUG: Crawled (200) <GET http://www2c.cdc.go
v/podcasts/feed.asp?feedid=183> (referer: None)
[s] Available Scrapy objects:
[s]   item       {}
[s]   request    <GET http://www2c.cdc.gov/podcasts/feed.asp?feedid=183>
[s]   response   <200 http://www2c.cdc.gov/podcasts/feed.asp?feedid=183>
[s]   settings   <CrawlerSettings module=<module 'ebola.scraper.scrape.settings'
 from 'ebola\scraper\scrape\settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x3efc130>
[s]   xxs        <XmlXPathSelector xpath=None data=u'<feed xmlns="http://www.w3.
org/2005/Atom'>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> xxs.select("//entry").extract()
[]
>>> xxs.select("//link").extract()
[]
>>> xxs.select("//link/text()").extract()
[]
>>> xxs.select("//title").extract()
[]
>>> xxs.select("//title/text()").extract()
[]
>>> xxs.select("//link/@href").extract()
[]
>>> xxs.select("//@href").extract()
[u'http://www2c.cdc.gov/podcasts/feed.asp?feedid=183', u'http://www.cdc.gov/medi
a/index.html', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634459', u'h
ttp://www.cdc.gov/media/releases/2014/images/p1022-post-arrival-monitoring-300x2
00.jpg', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634458', u'http://
www2c.cdc.gov/podcasts/download.asp?af=h&f=8634453', u'http://www2c.cdc.gov/podc
asts/download.asp?af=h&f=8634436', u'http://www2c.cdc.gov/podcasts/download.asp?
af=h&f=8634435', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634434', u
'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634417', u'http://www2c.cdc.
gov/podcasts/download.asp?af=h&f=8634403', u'http://www2c.cdc.gov/podcasts/downl
oad.asp?af=h&f=8634373', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=863
4367', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634365', u'http://ww
w2c.cdc.gov/podcasts/download.asp?af=h&f=8634362', u'http://www2c.cdc.gov/podcas
ts/download.asp?af=h&f=8634361', u'http://www2c.cdc.gov/podcasts/download.asp?af
=h&f=8634355', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634350', u'h
ttp://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634349', u'http://www2c.cdc.go
v/podcasts/download.asp?af=h&f=8634330', u'http://www2c.cdc.gov/podcasts/downloa
d.asp?af=h&f=8634329', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=86343
28', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634325', u'http://www2
c.cdc.gov/podcasts/download.asp?af=h&f=8634324', u'http://www2c.cdc.gov/podcasts
/download.asp?af=h&f=8634322', u'http://www2c.cdc.gov/podcasts/download.asp?af=h
&f=8634283', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634278', u'htt
p://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634277', u'http://www2c.cdc.gov/
podcasts/download.asp?af=h&f=8634273', u'http://www2c.cdc.gov/podcasts/download.
asp?af=h&f=8634265', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634262
', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634250', u'http://www2c.
cdc.gov/podcasts/download.asp?af=h&f=8634251', u'http://www.cdc.gov/media/DPK/20
14/images/vs-crash-injuries/fb.jpg', u'http://www2c.cdc.gov/podcasts/download.as
p?af=h&f=8634248', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634234',
 u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634233', u'http://www2c.cd
c.gov/podcasts/download.asp?af=h&f=8634225', u'http://www2c.cdc.gov/podcasts/dow
nload.asp?af=h&f=8634224', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8
634222', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634221', u'http://
www2c.cdc.gov/podcasts/download.asp?af=h&f=8634323', u'http://www2c.cdc.gov/podc
asts/download.asp?af=h&f=8634217', u'http://www2c.cdc.gov/podcasts/download.asp?
af=h&f=8634214', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634178', u
'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634174', u'http://www.cdc.go
v/media/images/L2/p1002-smoke-free-housing.jpg', u'http://www2c.cdc.gov/podcasts
/download.asp?af=h&f=8634173', u'http://www2c.cdc.gov/podcasts/download.asp?af=h
&f=8634211', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634164', u'htt
p://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634157', u'http://www2c.cdc.gov/
podcasts/download.asp?af=h&f=8634160', u'http://www2c.cdc.gov/podcasts/download.
asp?af=h&f=8634161', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634146
', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634073']
>>>

请记住,改变版本的刮痕不是一个选择,我锁定在0.16任何想法,谢谢。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-10-23 18:25:41

当您查看浏览器中的HTML源时,可以看到提要XML位于默认名称空间中。

代码语言:javascript
复制
<feed xmlns="http://www.w3.org/2005/Atom">

feed的所有子代元素也都属于这个名称空间--这就是为什么选择器不会产生任何结果。除了选择一个属性外:

似乎只有@href这样的属性是可访问的。

因为属性不接受默认的命名空间,并且不存在名称空间。

如果要访问名称空间中的元素,则必须首先注册所述名称空间,并为其选择前缀:

代码语言:javascript
复制
xxs.register_namespace("atom", "http://www.w3.org/2005/Atom")

然后,在元素前面加上atom:(或任何其他前缀):

代码语言:javascript
复制
xxs.select("//atom:link").extract()

刮除文件的有关部分中查找更多信息。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26532797

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档