文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在文本0.6.2中初始化`Doc‘？

问如何在文本0.6.2中初始化`Doc‘？
EN

Stack Overflow用户

提问于 2018-07-19 20:21:06

回答 2查看 693关注 0票数 3

试图遵循Python2中的文档中的初始化是行不通的：

>>> import textacy
>>> content = '''
...     The apparent symmetry between the quark and lepton families of
...     the Standard Model (SM) are, at the very least, suggestive of
...     a more fundamental relationship between them. In some Beyond the
...     Standard Model theories, such interactions are mediated by
...     leptoquarks (LQs): hypothetical color-triplet bosons with both
...     lepton and baryon number and fractional electric charge.'''
>>> metadata = {
...     'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
...     'author': 'Burton DeWilde',
...     'pub_date': '2012-08-01'}
>>> doc = textacy.Doc(content, metadata=metadata)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 120, in __init__
    {compat.unicode_, SpacyDoc}, type(content)))
ValueError: `Doc` must be initialized with set([<type 'unicode'>, <type 'spacy.tokens.doc.Doc'>]) content, not "<type 'str'>"

对于字符串或字符串序列来说，这种简单的不对称化应该是什么样子？

更新

将unicode(content)传递给textacy.Doc()

ImportError: 'cld2-cffi' must be installed to use textacy's automatic language detection; you may do so via 'pip install cld2-cffi' or 'pip install textacy[lang]'.

从textacy安装的那一刻起，imo，那就太好了。

即使在安装cld2-cffi之后，尝试上述代码也会抛出。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 114, in __init__
    self._init_from_text(content, metadata, lang)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 136, in _init_from_text
    spacy_lang = cache.load_spacy(langstr)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/cachetools/__init__.py", line 46, in wrapper
    v = func(*args, **kwargs)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/cache.py", line 99, in load_spacy
    return spacy.load(name, disable=disable)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/util.py", line 120, in load_model
    raise IOError("Can't find model '%s'" % name)
IOError: Can't find model 'en'

textacy

python

nlp

回答 2

Stack Overflow用户

发布于 2018-08-03 20:55:22

这个问题，如跟踪所示，位于textacy/doc.py中的_init_from_text()函数中，该函数试图检测语言并使用第136行中的字符串'en'调用它。( spacy回购在这个问题的评论。中涉及到这一点)

我通过提供一个有效的lang ( unicode ) u'en_core_web_sm'字符串和在content和lang参数字符串中使用unicode来解决这个问题。

import textacy

content = u'''
    The apparent symmetry between the quark and lepton families of
    the Standard Model (SM) are, at the very least, suggestive of
    a more fundamental relationship between them. In some Beyond the
    Standard Model theories, such interactions are mediated by
    leptoquarks (LQs): hypothetical color-triplet bosons with both
    lepton and baryon number and fractional electric charge.'''

metadata = {
    'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
    'author': 'Burton DeWilde',
    'pub_date': '2012-08-01'}

doc = textacy.Doc(content, metadata=metadata, lang=u'en_core_web_sm')

在我看来，字符串而不是unicode字符串(带有神秘的错误消息)改变了行为，缺少了一个包，以及使用spacy语言字符串的可能已经过时/不全面的方式。‍♂️

票数 1

Stack Overflow用户

发布于 2018-08-03 15:41:03

看起来您正在使用Python 2，并得到了一个unicode错误。在textacy文档中，有一个关于使用Python2时的一些unicode细微差别的说明：

注意:几乎在所有情况下，textacy (以及spacy)都希望使用unicode文本数据。在整个代码中，这表示为str，以与Python3的默认字符串类型保持一致；然而，Python2的用户必须注意使用unicode，并根据需要从默认(字节)字符串类型转换。

因此，我想尝试一下(注意u''')：

content = u'''
          The apparent symmetry between the quark and lepton families of
          the Standard Model (SM) are, at the very least, suggestive of
          a more fundamental relationship between them. In some Beyond the
          Standard Model theories, such interactions are mediated by
          leptoquarks (LQs): hypothetical color-triplet bosons with both
          lepton and baryon number and fractional electric charge.'''

这为我产生了一个Doc对象(在Python3上)。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51431112

复制

相似问题

问如何在文本0.6.2中初始化`Doc‘？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在文本0.6.2中初始化`Doc‘？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在文本0.6.2中初始化`Doc‘？
EN