文章/答案/技术大牛

发布

问标签文本数据的预处理
EN

Stack Overflow用户

提问于 2020-07-22 08:03:53

回答 1查看 826关注 0票数 0

为了训练NLP模型，需要在文本上有命名实体标签的文本数据。在许多情况下，这是由字符偏移(例如。("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]))，BILUO格式(例如。( (["Facebook", "released", "React", "in", "2014"], ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]))或类似的东西(取自spaCy101的例子)。当一个人想要对这些数据进行预处理时，重要的是将标签保持在正确的位置。例如：为了删除停止词或操作空格字符和标记将被删除。我的问题是：

是否有数据结构允许对标签文本数据进行预处理，同时将标签移动到新文本中的正确位置？

如果在python中已经有了实现，我也会对此感兴趣。否则，我可能也愿意自己编码。

我使用NER作为这个问题的动机，但我也会感兴趣的是，是否有一个更通用的数据结构，可以同时存储不同类型的标签。

例如，我想做一些像这样的操作，使我所有的文档小写。

import spacy
nlp = spacy.blank('en')
doc = nlp('Hello World! This is an amazing day!')
doc.text = doc.text.lower()

这是不可能的，并返回一个错误。

AttributeError: attribute 'text' of 'spacy.tokens.doc.Doc' objects is not writable

nlp

preprocessor

named-entity-recognition

python

data-structures

回答 1

Stack Overflow用户

发布于 2020-07-23 17:40:12

如果我没有弄错的话，spaCy的设计使得它的Token类能够跟踪相对于原始文档的字符偏移量(例如，token.idx)。

如果您需要做一些自定义的预处理，这仍然会被处理。例如，如果以这里为例

print([t.idx for t in doc])

这应该会给你：

['hello', '-', 'world.', ':)']
[0, 5, 6, 13]

如果您确实希望或需要有您自己的自定义属性，spaCy允许您通过._命名空间设置自定义属性和方法，您可以阅读更多关于这里的内容。

希望这些选择之一能满足你的需求。如果没有，您可以从spaCy的源代码和文档中寻找灵感吗？

编辑1:示例

import spacy


nlp = spacy.load('en_core_web_sm')

text = "Sam is going to the store."

labels = [(0, 3, "PER"),
          (20, 25, "LOC")]

extra_stopwords = ['hi']

for word in extra_stopwords:
    nlp.vocab[word].is_stop = True

doc = nlp(text)

# add Span(start, end, label) to doc.ents
for start, end, label in labels:
    span = doc.char_span(start, end, label)

    # I forgot char_span only assigns the label to the span
    # might be safer to use the ._ namespace I mentioned earlier
    # instead of .ent_type_, but the idea should be the same
    for token in span:
        token.ent_type_ = label

# examples of things you can access
print([(token,
        token.ent_type_,
        token.lemma_.lower(),
        token.is_stop)
      for token in doc])

编辑2：

如果您真的想得到移位的偏移量，您可以使用上面的并循环通过令牌，跟踪您所在的位置，并在其中构造一个新的Doc对象。您必须决定如何处理添加空白的问题；为了简单起见，我在下面使用了一种天真的方法。

from spacy.tokens import Doc


# example
def preprocessing(token):
    """Takes a token and returns the processed string."""
    return token.lemma_.lower()


labels_new = []
words = []
spaces = []
offset = 0


for i, token in enumerate(doc):
    token_new = preprocessing(token)
    words.append(token_new)

    if token.ent_type_:
        labels_new.append((offset,
                           offset + len(token_new),
                           token.ent_type_))

    offset += len(token_new)

    if len(token.whitespace_) == 1:
        spaces.append(True)

    else:
        spaces.append(False)


doc_new = Doc(nlp.vocab, words, spaces)

那你就会有：

print([(token, token.idx) for token in doc_new])

# [(sam, 0), (be, 4), (go, 7), (to, 10), (the, 13), (store, 17), (., 22)]

以及：

print(labels_new)

# [(0, 3, 'PER'), (12, 17, 'LOC')]

(当然，您必须像以前一样将这些重新分配到每个令牌。)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63029584

复制

相似问题

问标签文本数据的预处理
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问标签文本数据的预处理EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问标签文本数据的预处理
EN