斯坦福自然语言处理集团最近发布了
，这是一个新的60+自然语言处理工具包。Stanza支持在Python语言上运行各种准确的自然语言处理工具，并支持从Python语言访问Java Stanford CoreNLP软件。
新的生物医学和python临床英语模型包集合<E211现已面世，支持语法分析和命名实体识别(NER)来自和临床Stanza支持3.6或更高版本

<代码>G223

优势

从原始文本到注释: purpose.

Multilingual.提供了一个完整的神经管道，它接受原始文本作为输入，并为任何特定的生成注释Stanza架构设计是语言不可知和数据驱动的，这允许我们通过在通用依赖(UD)树库和其他多语言corpora.
State-of-the-art性能上训练管道来发布支持60+语言的模型: Stanza神经管道能够很好地适应不同类型的文本，在管道的每一步实现最先进的或具有竞争力的性能。

安装所需的库使用pip，可以很容易地安装软件包

pip install stanza

导入所需的库

导入节

它将下载以下链接

#http://nlp.stanford.edu/software/stanza/1.1.0/en/default.zip #https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json nlp = stanza.Pipeline(lang='en'，processors='tokenize，ner') #初始化英语神经管道doc =nlp(“约翰·F·肯尼迪国际机场是位于美国纽约皇后区的一个国际机场，也是为纽约市提供服务的主要机场之一。”)#在句子打印上运行注释(*f‘实体：{ent.text}\t类型：{ent.type}’for sent in doc.sentences for ent in sent.ents，sep='\n')正在初始化2021-01-21 11:36:23信息:正在为语言加载这些型号: en (英语)：========================= |处理器|软件包|-| tokenize | ewt || ner | ontonotes | ========================= 2021-01-21 11:36:23信息:使用设备: cpu 2021-01-21 11:36:23信息:正在加载: tokenize 2021-01-21 -2111:36:23信息:正在加载: ner 2021-01-21 11:36:26信息:完成处理器加载！

节输出

    entity: John F. Kennedy International Airport   type: FAC
    entity: Queens  type: GPE
    entity: New York    type: GPE
    entity: USA type: GPE
    entity: one type: CARDINAL
    entity: New York City   type: GPE

ne_chunk

词性标记的句子被解析成具有正常组块的组块树，但是树的标签可以是实体标签，而不是组块短语标签。
NLTK使用这些组块作为树状系统的一部分来进行标记，尽管它也有一个遵循IOB系统的标签器。
NLTK已经有了一个预先训练的命名实体组块，可以使用nltk.chunk模块中的ne_chunk()方法。

下面是一些代码片段，解释了如何使用这两种方法，以及如何在它们之间进行转换：

from nltk.chunk import conlltags2tree, tree2conlltags 
from nltk import pos_tag 
from nltk import word_tokenize
from nltk.chunk import ne_chunk

这些模型是在NLTK中的CoNLL (来自CoNLL会议)语料库上训练的。因为我们已经完成了标记化、位置标记和分块，所以对于基于树的标记，我们所需要做的就是使用conlltags2tree

sentence = """John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City."""

ne_tree = ne_chunk(pos_tag(word_tokenize(sentence))) 
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
    # print(subtree.label())
    # print(subtree.leaves())
    ne_label = subtree.label()
    ne_string = " ".join([token for token, pos in subtree.leaves()])
    ne_in_sent.append((ne_string, ne_label))

print(ne_in_sent)

NLTK输出

[('John', 'PERSON'), ('Kennedy International Airport', 'PERSON'), ('Queens', 'GPE'), ('New York', 'GPE'), ('USA', 'ORGANIZATION'), ('New York City', 'GPE')]

在我看来，Stanza比NLTK更高级。

因为有时模棱两可/混乱的例子:约翰·F·肯尼迪(机场与人)节NER确定实体：约翰·F·肯尼迪国际机场类型: FAC (设施建筑，机场，高速公路，桥梁等)

节被正确识别

NLTK NER确定实体：('John'，'PERSON')，(‘肯尼迪国际机场’，'PERSON')

部分鉴定出NLTK。

我还和AllenNLP进行了交叉验证

AllenNLP

import allennlp
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")
document = """John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City. """
def convert_results(allen_results):
  ents = set()
  for word, tag in zip(allen_results["words"], allen_results["tags"]):
    if tag != "O":
      ent_position, ent_type = tag.split("-")
      if ent_position == "U":
        ents.add((word,ent_type))
      else:
        if ent_position == "B":
          w = word
        elif ent_position == "I":
          w += " " + word
        elif ent_position == "L":
          w += " " + word
          ents.add((w,ent_type))
  return ents

def allennlp_ner(document):
  return convert_results(predictor.predict(sentence=document))

allennlp_ner(document)

AllenNLP输出

{('John F. Kennedy International Airport', 'LOC'),  ('New York', 'LOC'),  ('New York City', 'LOC'),  ('Queens', 'LOC'),  ('USA', 'LOC')}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64671500

复制

相似问题

问NLTK的ne_chunk和NER的stanza之间的区别？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NLTK的ne_chunk和NER的stanza之间的区别？EN