我对蟒蛇和玩刮破的网络爬虫很陌生。我要抓取描述字符串的前10个字符,并将其用作标题。
下面的python代码片段产生了下面的JSON
item['image'] = img.xpath('@src').extract()
item_desc = img.xpath('@title').extract()
print(item_desc)
item['description'] = item_desc
item['title'] = item_desc[:10]
item['parentUrl'] = response.url
{'description': [u'CHAR-BROIL Tru-Infrared 350 IR Gas Grill - SportsAuthority.com '],
'image': [u'http://www.sportsauthority.com/graphics/product_images/pTSA-10854895t130.jpg'],
'parentUrl': 'http://www.sportsauthority.com/category/index.jsp?categoryId=3077576&clickid=topnav_Jerseys+%26+Fan+Shop',
'title': [u'CHAR-BROIL Tru-Infrared 350 IR Gas Grill - SportsAuthority.com ']}我想要的是以下内容。这片不是我所期望的那样。
{'description': [u'CHAR-BROIL Tru-Infrared 350 IR Gas Grill - SportsAuthority.com '],
'image': [u'http://www.sportsauthority.com/graphics/product_images/pTSA-10854895t130.jpg'],
'parentUrl': 'http://www.sportsauthority.com/category/index.jsp?categoryId=3077576&clickid=topnav_Jerseys+%26+Fan+Shop',
'title': [u'CHAR-BROIL']}发布于 2013-11-26 12:06:06
item_desc是一个包含一个元素的列表,该元素是一个unicode字符串。它本身不是unicode字符串。[...]是一个很大的提示。
取出元素,切片,并将其放回列表中:
item['title'] = [item_desc[0][:10]]显然,.extract()函数可以返回多个匹配;如果只期望一个匹配,也可以选择第一个匹配:
item['image'] = img.xpath('@src').extract()[0]
item_desc = img.xpath('@title').extract()[0]
item['description'] = item_desc
item['title'] = item_desc[:10]如果您的XPath查询并不总是返回结果,那么首先测试一个空列表:
img_match = img.xpath('@src').extract()
item['image'] = img_match[0] if img_match else ''
item_desc = img.xpath('@title').extract()
item['description'] = item_desc[0] if item_desc else ''
item['title'] = item_desc[0][:10] if item_desc else ''https://stackoverflow.com/questions/20216600
复制相似问题