首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >NLTK path_similarity正在给ValueError

NLTK path_similarity正在给ValueError
EN

Stack Overflow用户
提问于 2020-04-24 18:20:24
回答 1查看 165关注 0票数 0

我目前正在用NLTK做一个coursera作业,在两个文档之间寻找Path_similarity,但被卡住了。

代码语言:javascript
复制
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None


def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    tokens=nltk.word_tokenize(doc)
    tokens=nltk.pos_tag(syn)
    
    updated_pos=[convert_tag(x[1]) for x in tokens]
    
    tokens=[(tokens[x][0],updated_pos[x]) for x in range(len(tokens))]
    
    
    
    ds=[wn.synsets(tokens[x][0],pos=tokens[x][1]) for x in range(len(tokens))]
    ds1=[]
    for x in range(len(tokens)):
        try:
            ds1.append(ds[x][0])
        except:
            continue
    return ds1
def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    maxscore=[]
    for x in s1:
        dis=[]
        for y in s2:
            dis.append(x.path_similarity(y))
            dis=[z for z in dis if z!=None]
        maxscore.append(max(dis))
    return sum(maxscore)/len(maxscore)
    
    
def document_path_similarity(doc1,doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2)+similarity_score(synsets2, synsets1))/ 2

 #This is a test Function to check wether the above funtion is correct or not
def test_document_path_similarity():
    doc1 = 'This is a function to test document_path_similarity.'
    doc2 = 'Use this function to see if your code in doc_to_synsets \
    and similarity_score is correct!'
    return document_path_similarity(doc1, doc2)

我的问题是,这个测试函数没有返回浮点值,而是给出了这个ValueError:

代码语言:javascript
复制
ValueError                                Traceback (most recent call last)
<ipython-input-61-6c20d7dcffc6> in <module>()
----> 1 test_document_path_similarity()

<ipython-input-60-9639d05f12da> in test_document_path_similarity()
      2     doc1 = 'This is a function to test document_path_similarity.'
      3     doc2 = 'Use this function to see if your code in doc_to_synsets     and similarity_score is correct!'
----> 4     return document_path_similarity(doc1, doc2)

<ipython-input-59-849dd19f38dc> in document_path_similarity(doc1, doc2)
     89     synsets2 = doc_to_synsets(doc2)
     90 
---> 91     return (similarity_score(synsets1, synsets2)+similarity_score(synsets2, synsets1))/ 2

<ipython-input-59-849dd19f38dc> in similarity_score(s1, s2)
     79             dis.append(x.path_similarity(y))
     80             dis=[z for z in dis if z!=None]
---> 81         maxscore.append(max(dis))
     82     return sum(maxscore)/len(maxscore)
     83 

ValueError: max() arg is an empty sequence

这基本上是因为synsets2-1和synsets1的每个synset之间的路径相似性被赋予了None值。但根据说明不应该是这样的,我试了这么多时间,但就是想不出如何避免这一点并得到一个浮点值。

由于讲师可能需要几天的时间才能做出回应,我来这里寻求帮助,如果可以的话,请调查一下这一点。编辑:这些是synsets2和synsets1。

代码语言:javascript
复制
synsets2=[Synset('use.v.01'),
  Synset('function.n.01'),
  Synset('see.v.01'),
  Synset('code.n.01'),
  Synset('inch.n.01'),
  Synset('be.v.01'),
  Synset('correct.a.01')],


synsets1=[Synset('be.v.01'),
  Synset('angstrom.n.01'),
  Synset('function.n.01'),
  Synset('test.v.01')]
EN

回答 1

Stack Overflow用户

发布于 2020-12-17 06:08:54

您的代码看起来很好,只是您需要添加以下语句: if(dis):因为您希望maxscore.append(max(dis))仅在"dis“有一个值时才会被计算,否则将不会被考虑。

因此,您需要修改的代码部分将如下所示:

代码语言:javascript
复制
maxscore=[]
for x in s1:
    dis=[]
    for y in s2:
        dis.append(x.path_similarity(y))
        dis=[z for z in dis if z!=None]
    if(dis):
        maxscore.append(max(dis))
return sum(maxscore)/len(maxscore)

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61406125

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档