我正在编写一个新闻sprider,我想通过The source code of one page从脚本中获取pubtime值。目前我可以得到脚本的内容,如下所示:
{
site:'sports',
site_cname:'体育',
site_url:'',
title:'球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 ',
id:'20170802002470',
pubtime:'2017-08-02 06:22',
type:'2',
article_url:'',
sosokeys:{key1:'NBA',key2:'湖人',key3:'球爹',key4:'詹姆斯'},
tags:['NBA','湖人','球爹','詹姆斯'],
catalog:'basket',
catalog_full:'sports-basket-nba',
sub_nav:'nba',
topic:{name:'',cname:'',ztcatalog:''},
subName:{name:'basket',url:'', cname:'篮球'},
isShowLastAD:'',
tpl:
{dev:'nba',ver:'1.0.0.0',time:'20150512',type:'1',stype:''}
}我曾尝试使用json.loads()方法将字符串传输到json对象,但失败了。它抛出错误:
ValueError: Expecting property name enclosed in double quotes. 在抛出这个错误之前,我已经把所有的‘’替换成了“。对于这个错误,我知道原因可能是所有的key都应该用双引号括起来,但是这里的key太多了,我认为手动用双引号括起每个key并不是最好的选择。我目前还不知道pubtime的值。欢迎任何建议。提前谢谢。
发布于 2017-08-03 03:27:23
有一些工具可以解析json变量之类的,主要是js2xml,它是由制作scrapy的同一批人开发的。
然而,通常简单的正则表达式就足够了:
>>> text = "pubtime:'2017-08-02 06:22',"
>>> import re
>>> re.findall("pubtime:'(.+?)'", text)
['2017-08-02 06:22']当然,在本例中,您将使用response.body_as_unicode()而不是预定义的text变量来搜索整个HTML体。
发布于 2017-08-05 01:09:24
以下是使用js2xml实现这一点的一种方法:
首先,获取您感兴趣的JavaScript代码:
$ scrapy shell http://sports.qq.com/a/20170802/002470.htm
2017-08-04 18:41:23 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-08-04 18:41:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sports.qq.com/a/20170802/002470.htm> (referer: None)
>>> js = response.xpath('//script/text()').get()
>>> print(js)
ARTICLE_INFO = window.ARTICLE_INFO || {
site:'sports',
site_cname:'体育',
site_url:'http://sports.qq.com',
title:'球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 ',
id:'20170802002470',
pubtime:'2017-08-02 06:22',
type:'2',
article_url:'http://sports.qq.com/a/20170802/002470.htm',
sosokeys:{key1:'NBA',key2:'湖人',key3:'球爹',key4:'詹姆斯'},
tags:['NBA','湖人','球爹','詹姆斯'],
catalog:'basket',
catalog_full:'sports-basket-nba',
sub_nav:'nba',
topic:{name:'',cname:'',ztcatalog:''},
subName:{name:'basket',url:'http://sports.qq.com/nba/', cname:'篮球'},
isShowLastAD:'',
tpl:{dev:'nba',ver:'1.0.0.0',time:'20150512',type:'1',stype:''}
}然后,将此代码提供给js2xml.parse()以获取解析树:
>>> import js2xml
>>> tree = js2xml.parse(js)您可以使用js2xml.pretty_print()检查js2xml解析的内容
>>> print(js2xml.pretty_print(tree))
<program>
<assign operator="=">
<left>
<identifier name="ARTICLE_INFO"/>
</left>
<right>
<binaryoperation operation="||">
<left>
<dotaccessor>
<object>
<identifier name="window"/>
</object>
<property>
<identifier name="ARTICLE_INFO"/>
</property>
</dotaccessor>
</left>
<right>
<object>
<property name="site">
<string>sports</string>
</property>
<property name="site_cname">
<string>体育</string>
</property>
<property name="site_url">
<string>http://sports.qq.com</string>
</property>
<property name="title">
<string>球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 </string>
</property>
<property name="id">
<string>20170802002470</string>
</property>
<property name="pubtime">
<string>2017-08-02 06:22</string>
</property>
<property name="type">
<string>2</string>
</property>
<property name="article_url">
<string>http://sports.qq.com/a/20170802/002470.htm</string>
</property>
<property name="sosokeys">
<object>
<property name="key1">
<string>NBA</string>
</property>
<property name="key2">
<string>湖人</string>
</property>
<property name="key3">
<string>球爹</string>
</property>
<property name="key4">
<string>詹姆斯</string>
</property>
</object>
</property>
<property name="tags">
<array>
<string>NBA</string>
<string>湖人</string>
<string>球爹</string>
<string>詹姆斯</string>
</array>
</property>
<property name="catalog">
<string>basket</string>
</property>
<property name="catalog_full">
<string>sports-basket-nba</string>
</property>
<property name="sub_nav">
<string>nba</string>
</property>
<property name="topic">
<object>
<property name="name">
<string></string>
</property>
<property name="cname">
<string></string>
</property>
<property name="ztcatalog">
<string></string>
</property>
</object>
</property>
<property name="subName">
<object>
<property name="name">
<string>basket</string>
</property>
<property name="url">
<string>http://sports.qq.com/nba/</string>
</property>
<property name="cname">
<string>篮球</string>
</property>
</object>
</property>
<property name="isShowLastAD">
<string></string>
</property>
<property name="tpl">
<object>
<property name="dev">
<string>nba</string>
</property>
<property name="ver">
<string>1.0.0.0</string>
</property>
<property name="time">
<string>20150512</string>
</property>
<property name="type">
<string>1</string>
</property>
<property name="stype">
<string></string>
</property>
</object>
</property>
</object>
</right>
</binaryoperation>
</right>
</assign>
</program>您需要的数据是||二进制操作的right操作数。您可以在解析树上使用XPath来获取它:
>>> o = tree.xpath('//binaryoperation/right/object')[0]
>>> o
<Element object at 0x7f6c8c7967e8>js2xml.utils.objects.make用于从以下内容构建Python对象:
>>> from pprint import pprint
>>> pprint(data)
{'article_url': 'http://sports.qq.com/a/20170802/002470.htm',
'catalog': 'basket',
'catalog_full': 'sports-basket-nba',
'id': '20170802002470',
'isShowLastAD': '',
'pubtime': '2017-08-02 06:22',
'site': 'sports',
'site_cname': '体育',
'site_url': 'http://sports.qq.com',
'sosokeys': {'key1': 'NBA', 'key2': '湖人', 'key3': '球爹', 'key4': '詹姆斯'},
'subName': {'cname': '篮球',
'name': 'basket',
'url': 'http://sports.qq.com/nba/'},
'sub_nav': 'nba',
'tags': ['NBA', '湖人', '球爹', '詹姆斯'],
'title': '球爹喊话詹皇:想拿更多冠军 那就和我儿子搭档 ',
'topic': {'cname': '', 'name': '', 'ztcatalog': ''},
'tpl': {'dev': 'nba',
'stype': '',
'time': '20150512',
'type': '1',
'ver': '1.0.0.0'},
'type': '2'}
>>> 正如@Granitosaurus提到的,对于这样的任务,这可能看起来有点“太多”,但当JSON数据不是100% JSON (例如使用单引号)时,它可能会被证明是有用的。
https://stackoverflow.com/questions/45468445
复制相似问题