文章/答案/技术大牛

发布

社区首页 >问答首页 >从文本中提取国籍和国家

问从文本中提取国籍和国家
EN

Stack Overflow用户

提问于 2016-06-18 00:44:31

回答 4查看 9.7K关注 0票数 12

我想使用nltk从文本中提取所有提到的国家和国籍，我使用了POS标签来提取所有GPE标记的令牌，但结果并不令人满意。

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")

得到的结果是：

['Thyroid', 'Australian', 'Caucasian', 'Graves']

有些是民族的，但有些只是名词。

那么我到底做错了什么，还是有其他方法可以提取这些信息呢？

pos-tagger

python

nlp

nltk

回答 4

Stack Overflow用户

发布于 2016-06-22 20:40:01

因此，在发表了富有成效的评论后，我深入研究了不同的NER工具，以找到识别国籍和国家提及的最佳工具，并发现SPACY有一个NORP实体，可以有效地提取国籍。https://spacy.io/docs/usage/entity-recognition

票数 6

Stack Overflow用户

发布于 2016-06-18 15:12:28

如果你想提取国家名称，你需要的是NER标记器，而不是POS标记器。

命名实体识别(NER)是信息抽取的一个子任务，它试图定位文本中的元素并将其分类为预定义的类别，如人名、组织、位置、时间表达式、数量、货币价值、百分比等。

看看斯坦福大学的NER tagger！

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

票数 3

Stack Overflow用户

发布于 2016-06-21 19:13:40

下面是使用NLTK执行实体提取的geograpy。它以地名词典的形式存储所有地点和位置。然后，它在地名词典上执行查找，以获取相关的地点和位置。查看文档以了解更多使用细节-

from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37886534

复制

相似问题

问从文本中提取国籍和国家
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文本中提取国籍和国家EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文本中提取国籍和国家
EN