首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从文本中提取国籍和国家

从文本中提取国籍和国家
EN

Stack Overflow用户
提问于 2016-06-18 00:44:31
回答 4查看 9.7K关注 0票数 12

我想使用nltk从文本中提取所有提到的国家和国籍,我使用了POS标签来提取所有GPE标记的令牌,但结果并不令人满意。

代码语言:javascript
复制
 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")

得到的结果是:

代码语言:javascript
复制
['Thyroid', 'Australian', 'Caucasian', 'Graves']

有些是民族的,但有些只是名词。

那么我到底做错了什么,还是有其他方法可以提取这些信息呢?

EN

回答 4

Stack Overflow用户

发布于 2016-06-22 20:40:01

因此,在发表了富有成效的评论后,我深入研究了不同的NER工具,以找到识别国籍和国家提及的最佳工具,并发现SPACY有一个NORP实体,可以有效地提取国籍。https://spacy.io/docs/usage/entity-recognition

票数 6
EN

Stack Overflow用户

发布于 2016-06-18 15:12:28

如果你想提取国家名称,你需要的是NER标记器,而不是POS标记器。

命名实体识别(NER)是信息抽取的一个子任务,它试图定位文本中的元素并将其分类为预定义的类别,如人名、组织、位置、时间表达式、数量、货币价值、百分比等。

看看斯坦福大学的NER tagger!

代码语言:javascript
复制
from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split()) 
票数 3
EN

Stack Overflow用户

发布于 2016-06-21 19:13:40

下面是使用NLTK执行实体提取的geograpy。它以地名词典的形式存储所有地点和位置。然后,它在地名词典上执行查找,以获取相关的地点和位置。查看文档以了解更多使用细节-

代码语言:javascript
复制
from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/37886534

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档