我正在尝试使用scrapy 0.20.2在http://www.dmoz.org/Computers/Programming/Languages/Python/Books上刮这个页面。
我可以做我需要的所有事情,比如获取信息和分类...
但是,我仍然在结果中得到\r和\t和\n。例如,这是一个json {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]},。
数据是正确的,但我不想在结果中看到\t、\r和\n。
我的蜘蛛是
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from dirbot.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items发布于 2014-03-31 07:40:13
我使用:
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = map(unicode.strip,site.xpath('a/text()').extract())
item['link'] = map(unicode.strip, site.xpath('a/@href').extract())
item['desc'] = map(unicode.strip, site.xpath('text()').extract())
items.append(item)
print "hello"
return items而且它是有效的。我不确定它是什么,但我仍然在阅读unicode.strip。我希望这对你有所帮助
发布于 2014-01-14 01:44:05
这是另一种方法(我使用了您的JSON数据):
>>> data = {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]}
>>> clean_data = ''.join(data['desc'])
>>> print clean_data.strip(' \r\n\t')输出:
- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.而不是:
['\r\n\t\t\t\r\n ', ' \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n ']发布于 2014-01-14 01:57:04
假设您希望删除所有的\r、\n和\t (不仅仅是边缘上的东西),同时仍然保留JSON的形式,您可以尝试以下操作:
def normalize_whitespace(json):
if isinstance(json, str):
return ' '.join(json.split())
if isinstance(json, dict):
it = json.items() # iteritems in Python 2
elif isinstance(json, list):
it = enumerate(json)
for k, v in it:
json[k] = normalize_whitespace(v)
return json用法:
>>> normalize_whitespace({"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]})
{'title': ['Data Structures and Algorithms with Object-Oriented Design Patterns in Python'], 'link': ['http://www.brpreiss.com/books/opus7/html/book.html'], 'desc': ['', '- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns. A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.']}正如https://stackoverflow.com/a/10711166/138772提醒的那样,拆分-联接方法可能比正则表达式替换更好,因为它将strip功能与空格规范化结合在一起。
https://stackoverflow.com/questions/21091501
复制相似问题