文章/答案/技术大牛

发布

社区首页 >问答首页 >具有多语言预训练维基单词向量的ValueError

问具有多语言预训练维基单词向量的ValueError
EN

Stack Overflow用户

提问于 2021-08-10 15:30:45

回答 1查看 47关注 0票数 0

我正在尝试使用来自FastText (https://fasttext.cc/docs/en/pretrained-vectors.html)的多语言预训练的维基词向量。

我用下面的方法从网站上抓取了向量：

import requests

# link to vector file for German
url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.de.align.vec'
r = requests.get(url, stream = True)

if r.encoding is None:
    r.encoding = 'utf-8'

with open('/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/data/extract_DE.txt', 'w', encoding="utf-8") as fp:
    for line_num, vector in enumerate(r.iter_lines(decode_unicode = True)):
        fp.write(vector)
        fp.write('\n')
        # first 20,000 words
        if line_num == 20_001:
            break

并删除了第一行：

deu_input = open('/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/data/extract_DE.txt', 'r', encoding="utf-8").readlines()
with open('/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/data/extract_DE_nofirstline.txt', 'w', encoding="utf-8") as deu_output:
    for index, line in enumerate(deu_input):
        if index != 0:
            deu_output.write(line)

我正在做的事情在某些语言或一定数量的向量上工作得很好，但对于其他一些语言或超过一定数量的元素，我会得到以下错误：

Traceback (most recent call last):
  File "explorer_ES.py", line 22, in <module>
    ns = neighbours(vectors,w,20)  # neighbours is what I imported from utils, w is the word I entered, and I get 20 examples of nearest neighbours
  File "/mnt/c/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/utils.py", line 31, in neighbours
    cos = cosine_similarity(dm, w, k)
  File "/mnt/c/Users/LNV/OneDrive/Desktop/Jupiter_Notebook/Intro to ML/vector-biases/utils.py", line 21, in cosine_similarity
    num = np.dot(dm[w1],dm[w2])
  File "<__array_function__ internals>", line 5, in dot
ValueError: shapes (300,) and (299,) not aligned: 300 (dim 0) != 299 (dim 0)

例如，当我用我之前抓取的德语文件来尝试这段代码时，我得到了这个错误(我还删除了第一行)。我在其他语言中得到了同样的错误，但在其他一些语言中却没有。

from utils import readDM, cosine_similarity, neighbours  
import sys

fasttext_vecs="./data/extract_DE_nofirstline.txt"  
print("Reading vectors...")
vectors = readDM(fasttext_vecs)


f = ""

while f != 'q':
    f = input("\nWhat would you like to do? (n = nearest neighbours, s=similarity, q=quit) ")

    while f == 'n':
        w = input("Enter a word or 'x' to exit nearest neighbours: ")

        if w == 'x':
            f = 'x'
        else:
            ns = neighbours(vectors,w,20)  # neighbours is what I imported from utils, w is the word I entered, and I get 20 examples of nearest neighbours
            print(ns)

    while f == 's':
        w = input("Input two words separated by a space or 'x' to exit similarity: ")
        
        if w == 'x':
            f = 'x'
        else:
            w1,w2 = w.split()   # splits a string into a list
            if w1 in vectors and w2 in vectors:
                sim = cosine_similarity(vectors,w1,w2)
                print("SIM",w1,w2,sim)
            else:
                print("Word(s) not found in space.")

vector

multilingual

fasttext

python

回答 1

Stack Overflow用户

发布于 2021-08-11 17:38:00

由于您只使用纯文本全文向量，因此可以使用像Gensim这样的现成库来读取向量。

它的加载函数有一个limit选项，可以从文件的前面只读取前N个向量，以节省内存。因此，您不必修改任何文件(具有编码/重写问题的潜在风险)。

例如：

from gensim.models import KeyedVectors

# read 1st 20k word vectors
vecs_de_align = KeyedVectors.load_word2vec_format('wiki.de.align.vec', binary=False, limit=20000)

# get 20 nearest-neighbors of a word
similars = vecs_de_align.most_similar('Apfel')
print(similars)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68729653

复制

相似问题

问具有多语言预训练维基单词向量的ValueError
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问具有多语言预训练维基单词向量的ValueErrorEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问具有多语言预训练维基单词向量的ValueError
EN