首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用ne_chunks提取全名

使用ne_chunks提取全名
EN

Stack Overflow用户
提问于 2020-11-25 09:35:15
回答 1查看 81关注 0票数 1

新手来了。我正在尝试使用以下代码提取人员和组织的全名。

代码语言:javascript
复制
def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(' '.join([token for token, pos in i.leaves()]))
            if current_chunk:
                named_entity = ' '.join(current_chunk)
                if named_entity not in continuous_chunk:
                    continuous_chunk.append(named_entity)
                    current_chunk = []
                else:
                    continue
                return continuous_chunk

            
>>> my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
>>> get_continuous_chunks(my_sent)
['Toni']

正如您所看到的,它只返回第一个专有名词。不是全名,也不是字符串中的任何其他专有名词。

我做错了什么?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-25 11:28:00

下面是一些有效的代码。

最好的做法是逐步执行代码,并在不同的位置放置大量的print语句。您将看到我在何处打印了您正在迭代的项的type()str()值。我发现这有助于我可视化并更多地思考我正在编写的循环和条件句,如果我可以看到它们被列出的话。

另外,哦,我无意中将所有变量命名为“连续”而不是“连续”……不知道为什么..。邻接可能更准确。

代码:

代码语言:javascript
复制
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree


def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    current_chunk = []
    contiguous_chunk = []
    contiguous_chunks = []

    for i in chunked:
        print(f"{type(i)}: {i}")
        if type(i) == Tree:
            current_chunk = ' '.join([token for token, pos in i.leaves()])
            # Apparently, Tony and Morrison are two separate items,
            # but "Random House" and "New York City" are single items.
            contiguous_chunk.append(current_chunk)
        else:
            # discontiguous, append to known contiguous chunks.
            if len(contiguous_chunk) > 0:
                contiguous_chunks.append(' '.join(contiguous_chunk))
                contiguous_chunk = []
                current_chunk = []

    return contiguous_chunks

my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."


print()
contig_chunks = get_continuous_chunks(my_sent)
print(f"INPUT: My sentence: '{my_sent}'")
print(f"ANSWER: My contiguous chunks: {contig_chunks}")

Exection:

代码语言:javascript
复制
(venv) [ttucker@zim stackoverflow]$ python contig.py 

<class 'nltk.tree.Tree'>: (PERSON Toni/NNP)
<class 'nltk.tree.Tree'>: (PERSON Morrison/NNP)
<class 'tuple'>: ('was', 'VBD')
<class 'tuple'>: ('the', 'DT')
<class 'tuple'>: ('first', 'JJ')
<class 'tuple'>: ('black', 'JJ')
<class 'tuple'>: ('female', 'NN')
<class 'tuple'>: ('editor', 'NN')
<class 'tuple'>: ('in', 'IN')
<class 'tuple'>: ('fiction', 'NN')
<class 'tuple'>: ('at', 'IN')
<class 'nltk.tree.Tree'>: (ORGANIZATION Random/NNP House/NNP)
<class 'tuple'>: ('in', 'IN')
<class 'nltk.tree.Tree'>: (GPE New/NNP York/NNP City/NNP)
<class 'tuple'>: ('.', '.')
INPUT: My sentence: 'Toni Morrison was the first black female editor in fiction at Random House in New York City.'
ANSWER: My contiguous chunks: ['Toni Morrison', 'Random House', 'New York City']

我也不太清楚你到底在找什么,但从描述来看,这似乎就是它。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64997336

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档