我正在做一些行为分析,我跟踪行为随时间的推移,然后创建这些行为的n-克。
sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
['scratch', 'scratch', 'scratch', 'scratch', 'smell/sniff'],
['scratch', 'scratch', 'scratch', 'sit', 'stand']]我希望能够对这些n-图进行聚类,但是我需要使用自定义度量来创建一个预先计算的距离矩阵。我的度量似乎工作得很好,但是当我试图使用sklearn函数创建距离矩阵时,我得到了一个错误:
ValueError: could not convert string to float: 'scratch'我看过distances.html文档,在这个主题上不太清楚。
有谁知道如何正确使用这个吗?
完整的代码如下:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
import math
import hashlib
import networkx as nx
import itertools
import hdbscan
from sklearn.metrics.pairwise import pairwise_distances
def get_levenshtein_distance(path1, path2):
"""
https://en.wikipedia.org/wiki/Levenshtein_distance
:param path1:
:param path2:
:return:
"""
matrix = [[0 for x in range(len(path2) + 1)] for x in range(len(path1) + 1)]
for x in range(len(path1) + 1):
matrix[x][0] = x
for y in range(len(path2) + 1):
matrix[0][y] = y
for x in range(1, len(path1) + 1):
for y in range(1, len(path2) + 1):
if path1[x - 1] == path2[y - 1]:
matrix[x][y] = min(
matrix[x - 1][y] + 1,
matrix[x - 1][y - 1],
matrix[x][y - 1] + 1
)
else:
matrix[x][y] = min(
matrix[x - 1][y] + 1,
matrix[x - 1][y - 1] + 1,
matrix[x][y - 1] + 1
)
return matrix[len(path1)][len(path2)]
sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
['scratch', 'scratch', 'scratch', 'scratch', 'smell/sniff'],
['scratch', 'scratch', 'scratch', 'sit', 'stand']]
print("should be 0")
print(get_levenshtein_distance(sample_n_gram_list[1],sample_n_gram_list[1]))
print("should be 1")
print(get_levenshtein_distance(sample_n_gram_list[1],sample_n_gram_list[0]))
print("should be 2")
print(get_levenshtein_distance(sample_n_gram_list[0],sample_n_gram_list[2]))
clust_number = 2
distance_matrix = pairwise_distances(sample_n_gram_list, metric=get_levenshtein_distance)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_发布于 2018-12-17 09:21:14
这是因为sklearn中的pairwise_distances是为数值数组设计的(这样所有不同的内置距离函数都可以正常工作),但是您要将一个字符串列表传递给它。如果您可以将字符串转换为数字(将字符串编码为特定数字),然后传递它,则它将正常工作。
一种快速的粗制滥造的方法是:
# Get all the unique strings in the input data
uniques = np.unique(sample_n_gram_list)
# Output:
# array(['scratch', 'sit', 'smell/sniff', 'stand'])
# Encode the strings to numbers according to the indices in "uniques" array
X = np.searchsorted(uniques, sample_n_gram_list)
# Output:
# array([[0, 0, 0, 0, 0], <= scratch is assigned 0, sit = 1 and so on
[0, 0, 0, 0, 2],
[0, 0, 0, 1, 3]])
# Now this works
distance_matrix = pairwise_distances(X, metric=get_levenshtein_distance)
# Output
# array([[0., 1., 2.],
[1., 0., 2.],
[2., 2., 0.]])https://stackoverflow.com/questions/53808957
复制相似问题