我想使用nltk从文本中提取所有提到的国家和国籍,我使用了POS标签来提取所有GPE标记的令牌,但结果并不令人满意。
abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "
sent = nltk.tokenize.wordpunct_tokenize(abstract)
pos_tag = nltk.pos_tag(sent)
nes = nltk.ne_chunk(pos_tag)
places = []
for ne in nes:
if type(ne) is nltk.tree.Tree:
if (ne.label() == 'GPE'):
places.append(u' '.join([i[0] for i in ne.leaves()]))
if len(places) == 0:
places.append("N/A")得到的结果是:
['Thyroid', 'Australian', 'Caucasian', 'Graves']有些是民族的,但有些只是名词。
那么我到底做错了什么,还是有其他方法可以提取这些信息呢?
发布于 2016-06-22 20:40:01
因此,在发表了富有成效的评论后,我深入研究了不同的NER工具,以找到识别国籍和国家提及的最佳工具,并发现SPACY有一个NORP实体,可以有效地提取国籍。https://spacy.io/docs/usage/entity-recognition
发布于 2016-06-18 15:12:28
如果你想提取国家名称,你需要的是NER标记器,而不是POS标记器。
命名实体识别(NER)是信息抽取的一个子任务,它试图定位文本中的元素并将其分类为预定义的类别,如人名、组织、位置、时间表达式、数量、货币价值、百分比等。
看看斯坦福大学的NER tagger!
from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split()) 发布于 2016-06-21 19:13:40
下面是使用NLTK执行实体提取的geograpy。它以地名词典的形式存储所有地点和位置。然后,它在地名词典上执行查找,以获取相关的地点和位置。查看文档以了解更多使用细节-
from geograpy import extraction
e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness.
Multiple genetic loci have been associated with Graves' disease, but the genetic
basis for TO is largely unknown. This study aimed to identify loci associated with
TO in individuals with Graves' disease, using a genome-wide association scan
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was
performed on pooled DNA from an Australian Caucasian discovery cohort of 265
participants with Graves' disease and TO (cases) and 147 patients with Graves'
disease without TO (controls).")
e.find_entities()
print e.places()https://stackoverflow.com/questions/37886534
复制相似问题