首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python中WSD最大相似度的优化

Python中WSD最大相似度的优化
EN

Code Review用户
提问于 2014-07-24 08:30:55
回答 1查看 926关注 0票数 3

我有图书馆的简单词义消歧(WSD)。

我有一个WSD的功能,基于每个单词的最大相似分数之和。但是它很慢,因为它遍历输入句子中的所有单词,然后找出每个单词的每个意义之间的最大相似度分数。

如何加快max_similarity函数的速度?(欢迎Cython技巧)

代码语言:javascript
复制
#!/usr/bin/env python -*- coding: utf-8 -*-

"""
User requested feature. WSD by maximizing similarity. 
"""

from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wnic
from nltk.tokenize import word_tokenize

def similarity_by_path(sense1, sense2, option="path"):
    """ Returns maximum path similarity between two senses. """
    if option.lower() in ["path", "path_similarity"]: # Path similaritys
        return max(wn.path_similarity(sense1,sense2), 
                   wn.path_similarity(sense1,sense2))
    elif option.lower() in ["wup", "wupa", "wu-palmer", "wu-palmer"]: # Wu-Palmer 
        return wn.wup_similarity(sense1, sense2)
    elif option.lower() in ['lch', "leacock-chordorow"]: # Leacock-Chodorow
        if sense1.pos != sense2.pos: # lch can't do diff POS
            return 0
        return wn.lch_similarity(sense1, sense2)

def similarity_by_infocontent(sense1, sense2, option):
    """ Returns similarity scores by information content. """
    if sense1.pos != sense2.pos: # infocontent sim can't do diff POS.
        return 0

    info_contents = ['ic-bnc-add1.dat', 'ic-bnc-resnik-add1.dat', 
                     'ic-bnc-resnik.dat', 'ic-bnc.dat', 

                     'ic-brown-add1.dat', 'ic-brown-resnik-add1.dat', 
                     'ic-brown-resnik.dat', 'ic-brown.dat', 

                     'ic-semcor-add1.dat', 'ic-semcor.dat',

                     'ic-semcorraw-add1.dat', 'ic-semcorraw-resnik-add1.dat', 
                     'ic-semcorraw-resnik.dat', 'ic-semcorraw.dat', 

                     'ic-shaks-add1.dat', 'ic-shaks-resnik.dat', 
                     'ic-shaks-resnink-add1.dat', 'ic-shaks.dat', 

                     'ic-treebank-add1.dat', 'ic-treebank-resnik-add1.dat', 
                     'ic-treebank-resnik.dat', 'ic-treebank.dat']

    if option in ['res', 'resnik']:
        return wn.res_similarity(sense1, sense2, wnic.ic('ic-bnc-resnik-add1.dat'))
    #return min(wn.res_similarity(sense1, sense2, wnic.ic(ic)) \
    #             for ic in info_contents)

    elif option in ['jcn', "jiang-conrath"]:
        return wn.jcn_similarity(sense1, sense2, wnic.ic('ic-bnc-add1.dat'))

    elif option in ['lin']:
        return wn.lin_similarity(sense1, sense2, wnic.ic('ic-bnc-add1.dat'))

def sim(sense1, sense2, option="path"):
    """ Calculates similarity based on user's choice. """
    option = option.lower()
    if option.lower() in ["path", "path_similarity", 
                        "wup", "wupa", "wu-palmer", "wu-palmer",
                        'lch', "leacock-chordorow"]:
        return similarity_by_path(sense1, sense2, option) 
    elif option.lower() in ["res", "resnik",
                          "jcn","jiang-conrath",
                          "lin"]:
        return similarity_by_infocontent(sense1, sense2, option)

def max_similarity(context_sentence, ambiguous_word, option="path", 
                   pos=None, best=True):
    """
    Perform WSD by maximizing the sum of maximum similarity between possible 
    synsets of all words in the context sentence and the possible synsets of the 
    ambiguous words (see http://goo.gl/XMq2BI):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}
    """
    result = {}
    for i in wn.synsets(ambiguous_word):
        try:
            if pos and pos != str(i.pos()):
                continue
        except:
            if pos and pos != str(i.pos):
                continue
        result[i] = sum(max([sim(i,k,option) for k in wn.synsets(j)]+[0]) \
                        for j in word_tokenize(context_sentence))

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    print result
    if best: return result[0][1];
    return result


bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']
ans = max_similarity(bank_sents[0], 'bank', pos="n", option="res")
print ans
print ans[0][1].definition
EN

回答 1

Code Review用户

发布于 2014-07-31 09:05:35

为了提高性能,从上下文句子中删除停止词。

代码语言:javascript
复制
from nltk.corpus import stopwords

def is_stopword(x): return x not in stopwords.words('english')

words = filter(is_stopword,  word_tokenize(context_sentence))
result[i] = sum(max([sim(i,k,option) for k in wn.synsets(j)]+[0]) \
    for j in words)

nltk现在有下面的停止词,

“我”、“我们的”、“我们的”他们“、”他们“、”他们自己“、”什么“、”谁“、”这些“、”这些“、”那些“、”是“、”是“、”是“、”曾经“、”曾经“、”被“、”存在“、”拥有“、”已经“、”拥有“、”做过‘做’、‘a’、‘an’、‘’、‘和’、‘但是’、‘if’、“或‘、’因为‘、”表示“、’直到‘、”,而“、”在“、”由“、”表示“、”,“、’约‘、’相反‘、”表示“、”和“、’期间‘、’前‘、’后‘、’之上”,“下面”、“到”、“从”、“向上”、“向下”、“在”、“输出”、“在”、“和”、“的”、“下面”、“再一次”、“进一步”、“然后”、“一次”、“这里”、“那里”、“哪里”、“为什么”、“如何”、“所有”、“任何”、“两者”、“每一次”、“少数”上,“更多”、“大多数”、“其他”、“一些”、“这样”、“不”、“非”、“只”、“自己”、“相同”、“所以”、“比”、“太”、“非常”、“S、”t“、”能“、”威尔“、”只是“、”唐“、”应该“、”现在

停止词影响准确性取决于您正在使用的算法。我建议你应该用测试集测试停止词和算法。

票数 1
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/57903

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档