首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >页面上的Python抓取仍包含类似于\r \n \t的字符

页面上的Python抓取仍包含类似于\r \n \t的字符
EN

Stack Overflow用户
提问于 2014-01-13 20:37:11
回答 3查看 3.4K关注 0票数 4

我正在尝试使用scrapy 0.20.2在http://www.dmoz.org/Computers/Programming/Languages/Python/Books上刮这个页面。

我可以做我需要的所有事情,比如获取信息和分类...

但是,我仍然在结果中得到\r和\t和\n。例如,这是一个json {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]},

数据是正确的,但我不想在结果中看到\t、\r和\n。

我的蜘蛛是

代码语言:javascript
复制
from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from dirbot.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
   ]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//ul[@class="directory-url"]/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.xpath('a/text()').extract()
           item['link'] = site.xpath('a/@href').extract()
           item['desc'] = site.xpath('text()').extract()
           items.append(item)
       return items
EN

回答 3

Stack Overflow用户

发布于 2014-03-31 07:40:13

我使用:

代码语言:javascript
复制
def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//ul/li')
    items = []
    for site in sites:
        item = DmozItem()
        item['title'] = map(unicode.strip,site.xpath('a/text()').extract())
        item['link'] = map(unicode.strip, site.xpath('a/@href').extract())
        item['desc'] = map(unicode.strip, site.xpath('text()').extract())
        items.append(item)
    print "hello"
    return items

而且它是有效的。我不确定它是什么,但我仍然在阅读unicode.strip。我希望这对你有所帮助

票数 3
EN

Stack Overflow用户

发布于 2014-01-14 01:44:05

这是另一种方法(我使用了您的JSON数据):

代码语言:javascript
复制
>>> data = {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]}

>>> clean_data = ''.join(data['desc'])

>>> print clean_data.strip(' \r\n\t')

输出:

代码语言:javascript
复制
- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.

而不是:

代码语言:javascript
复制
['\r\n\t\t\t\r\n ', ' \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n ']
票数 0
EN

Stack Overflow用户

发布于 2014-01-14 01:57:04

假设您希望删除所有的\r\n\t (不仅仅是边缘上的东西),同时仍然保留JSON的形式,您可以尝试以下操作:

代码语言:javascript
复制
def normalize_whitespace(json):
    if isinstance(json, str):
        return ' '.join(json.split())

    if isinstance(json, dict):
        it = json.items() # iteritems in Python 2
    elif isinstance(json, list):
        it = enumerate(json)

    for k, v in it:
        json[k] = normalize_whitespace(v)

    return json

用法:

代码语言:javascript
复制
>>> normalize_whitespace({"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]})
{'title': ['Data Structures and Algorithms with Object-Oriented Design Patterns in Python'], 'link': ['http://www.brpreiss.com/books/opus7/html/book.html'], 'desc': ['', '- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns. A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.']}

正如https://stackoverflow.com/a/10711166/138772提醒的那样,拆分-联接方法可能比正则表达式替换更好,因为它将strip功能与空格规范化结合在一起。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/21091501

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档