文章/答案/技术大牛

发布

社区首页 >问答首页 >刮擦“引号教程”-提取文本中的Unicode

问刮擦“引号教程”-提取文本中的Unicode
EN

Stack Overflow用户

提问于 2016-12-17 13:36:17

回答 1查看 472关注 0票数 0

我提取了引用的标题，遵循了刮伤文档中的“教程”。问题是，它给了我两个独角兽在标题的开头和结尾。

>>>quote = response.css("div.quote")[0]
>>> quote
<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>
>>> title = quote.css("span.text::text").extract_first()
>>> title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
>>>

在文档中，提取的标题如下所示：

>>>title
'"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."'
>>>

我不知道我在这里做错了什么，只是看了一下文件。是否在配置文件中设置了什么，或者如何修复？没有提到解码/编码unicode。

其他示例

我继续讨论刮伤文档，下面是另一个示例：

刮板壳输入：

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))

输出片段：

{'text': u'\u201cTry not to become a man of success. Rather become a man of value.\u201d', 'tags': [u'humor', u'obvious', u'simile'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d', 'tags': [u'humor', u'obvious', u'simile'], 'author': u'Albert Einstein'}
{'text': u"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", 'tags': [u'humor', u'obvious', u'simile'], 'author': u'Albert Einstein'}
{'text': u"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", 'tags': [u'humor', u'obvious', u'simile'], 'author': u'Albert Einstein'}
{'text': u'\u201cA day without sunshine is like, you know, night.\u201d', 'tags': [u'humor', u'obvious', u'simile'], 'author': u'Albert Einstein'}

我从以下地点刮到的：

[http://quotes.toscrape.com]

文件抄袭(临20)：

https://media.readthedocs.org/pdf/scrapy/1.2/scrapy.pdf

系统：

macOS达尔文内核16.3.0:清华11月17日20:23:58 PST 2016；根:xnu-3789.31.2~1/ReleaseX86_64

virtualenv scrapy Python 2.7.10

更新

我对Python 3.5.2和Python3.5.2进行了同样的尝试，最终得到了正确的结果，没有像其他设置那样出现unicode问题。

python-2.7

unicode

scrapy

scrapy-spider

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-12-17 19:16:39

您所看到的是字符串的调试表示，因为您只是在解释器中查看变量，而不是打印它。在Python2.7上，所有不可打印的、非ASCII字符都用转义代码显示.在Python 3中，只有在当前终端编码中可显示的字符才会显示为转义代码。

打印字符串以强制显示字符。

>>> s=u'\u201cThe world\u201d'
>>> s
u'\u201cThe world\u201d'
>>> print s
“The world”

如果您打印的终端使用不支持非ASCII字符的编码，您可能会得到一个UnicodeEncodeError，但是由于Python3.5适合您，所以您的终端必须支持它们。

注意，调试显示还显示了表示Unicode字符串的u，并引用了输出。print只显示字符串内容。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41199142

复制

相似问题

问刮擦“引号教程”-提取文本中的Unicode
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮擦“引号教程”-提取文本中的UnicodeEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮擦“引号教程”-提取文本中的Unicode
EN