文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在没有IOB标签的情况下用拥抱面板的变压器管道重建文本实体？

问如何在没有IOB标签的情况下用拥抱面板的变压器管道重建文本实体？
EN

Stack Overflow用户

提问于 2020-03-30 18:58:03

回答 3查看 7.9K关注 0票数 9

我一直在寻找使用拥抱脸的管道为NER (命名实体识别)。然而，它正在返回实体标签的内部-外部开始(IOB)格式，但没有IOB标签.因此，我无法将管道的输出映射回我的原始文本。此外，输出以BERT令牌化格式掩蔽(默认模型为BERT-large)。

例如：

from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))

产出如下：

[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]

正如你所看到的，纽约被分成两个标签。

我如何将拥抱脸的新管道映射回我的原始文本？

变压器版本: 2.7

nlp

tokenize

transformer-model

named-entity-recognition

huggingface-transformers

回答 3

Stack Overflow用户

回答已采纳

发布于 2020-05-20 09:07:36

五月十七日，一个新的拉请求https://github.com/huggingface/transformers/pull/3957和你想要的东西合并了，所以现在我们的生活变得容易多了，你可以在管道里找到它吗？

ner = pipeline('ner', grouped_entities=True)

你的产出就会和预期的一样。目前，您必须从主分支安装，因为还没有新的版本。你可以通过

pip install git+git://github.com/huggingface/transformers.git@48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0

票数 19

Stack Overflow用户

发布于 2020-04-01 08:41:19

不幸的是，到目前为止(版本2.6，我认为即使是2.7版本)，您不能单独使用pipeline特性来完成这个任务。由于管道调用的__call__函数只是返回一个列表，请参见这里的代码。这意味着您必须使用“外部”令牌器执行第二个令牌化步骤，这完全违背了管道的目的。

但是，相反，您可以使用在文档中发布的第二个示例，就在类似于您的示例下面。为了将来的完整性，下面是代码：

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

这正是你要找的东西。请注意，ConLL注释方案在其原纸中列出了以下内容

每一行包含四个字段:单词、词性标记、块标记和命名实体标记。带有O标记的单词位于命名实体之外，I-XXX标记用于XXX类型命名实体内的单词。每当两个XXX类型的实体紧邻在一起时，第二个实体的第一个单词将被标记为B，以表明它启动了另一个实体。数据包含四种类型的实体:人员(PER)、组织(ORG)、地点(LOC)和杂项名称(MISC)。这个标记方案是由Ramshaw和Marcus (1995)提出的IOB方案。

这意味着，如果您对(仍被拆分的)实体不满意，则可以将所有后续的I-标记实体(或B-后面跟着I-标记)连接起来。在此方案中，不可能只使用I-标记标记两个不同的(立即相邻的)实体。

票数 6

Stack Overflow用户

发布于 2022-03-25 14:40:06

如果你在2022年看到这个：

现在不推荐使用grouped_entities关键字。
您应该使用aggregation_strategy：默认为None，您正在寻找simple或first或average或max -> (参见班级 )

from transformers import pipeline
import pandas as pd

text = 'Hugging Face is a French company based in New York.'

tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)

[{'entity_group': 'ORG',
  'score': 0.96934015,
  'word': 'Hugging Face',
  'start': 0,
  'end': 12},
 {'entity_group': 'MISC',
  'score': 0.9981816,
  'word': 'French',
  'start': 18,
  'end': 24},
 {'entity_group': 'LOC',
  'score': 0.9982121,
  'word': 'New York',
  'start': 42,
  'end': 50}]

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60937617

复制

相似问题

问如何在没有IOB标签的情况下用拥抱面板的变压器管道重建文本实体？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在没有IOB标签的情况下用拥抱面板的变压器管道重建文本实体？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在没有IOB标签的情况下用拥抱面板的变压器管道重建文本实体？
EN