首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python中更快的kNN算法

Python中更快的kNN算法
EN

Stack Overflow用户
提问于 2018-08-05 02:39:24
回答 2查看 11.1K关注 0票数 9

我想从头开始编写我自己的kNN算法,原因是我需要对特征进行加权。问题是,尽管删除了for循环并使用了内置的numpy功能,我的程序仍然非常慢。

有没有人能建议一种加速这个过程的方法?我没有使用np.sqrt作为L2距离,因为它是不必要的,而且实际上会使整个过程变慢很多。

代码语言:javascript
复制
class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """

        np.fromiter((self.__helper(qc) for qc in testing_data), float)  
        return self.predictions


    def __helper(self, qc):
        neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
        neighbours = np.array([neighbours]).T 
        indexes = np.array([range(len(self.X_train))]).T
        neighbours = np.append(indexes, neighbours, axis=1)

        # Sort by second column - distances
        neighbours = neighbours[neighbours[:,1].argsort()]  
        k_cases = neighbours[ :self.k]
        indexes = [x[0] for x in k_cases]

        y_answers = [self.y_train[int(x)] for x in indexes]
        answer = max(set(y_answers), key=y_answers.count)  # get most common value
        self.predictions.append(answer)


    def __weighted_euclidean(self, qc, other):
        """
        Custom weighted euclidean distance

        returns: floating point number
        """

        return np.sum( ((qc - other)**2) * self.weights )
EN

回答 2

Stack Overflow用户

发布于 2018-08-05 02:54:23

Scikit-learn使用KD树或球树在O[N log(N)]时间内计算最近邻居。您的算法是一种直接的方法,需要O[N^2]时间,并且还在Python生成器表达式中使用嵌套的for循环,与优化的代码相比,这将增加大量的计算开销。

如果您想要使用快速O[N log(N)]实现来计算加权k邻居分类,您可以将sklearn.neighbors.KNeighborsClassifier与加权minkowski度量一起使用,设置p=2 (对于欧几里德距离)并将w设置为您想要的权重。例如:

代码语言:javascript
复制
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(metric='wminkowski', p=2,
                             metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
票数 13
EN

Stack Overflow用户

发布于 2021-01-31 18:51:50

修改您的类并使用BallTree数据结构(使用build time O(n.(log n)^2),请参阅https://arxiv.org/ftp/arxiv/papers/1210/1210.6122.pdf)和自定义DistanceMetric (因为KDTree不支持度量参数中的可调用函数,这里作为注释:https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html),您也可以使用以下代码(还可以删除预测的循环):

代码语言:javascript
复制
from sklearn.neighbors import BallTree
from sklearn.neighbors import DistanceMetric
from scipy.stats import mode

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.tree = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights
        self.tree = BallTree(X_train, \
                             metric=DistanceMetric.get_metric('wminkowski', p=2, w=weights))

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """
        indexes = self.tree.query(testing_data, self.k, return_distance=False)
        y_answers = self.y_train[indexes]
        self.predictions = np.apply_along_axis(lambda x: mode(x)[0], 1, y_answers)
        return self.predictions

培训:

代码语言:javascript
复制
from time import time
n, d = 10000, 2
begin = time()
cls = GlobalWeightedKNN()
X_train = np.random.rand(n,d)
y_train = np.random.choice(2,n, replace=True)
cls.fit(X_train, y_train, k=3, weights=np.random.rand(d))
end = time()
print('time taken to train {} instances = {} s'.format(n, end - begin))
# time taken to train 10000 instances = 0.01998615264892578 s

测试/预测:

代码语言:javascript
复制
begin = time()
X_test = np.random.rand(n,d)
cls.predict(X_test)
end = time()
print('time taken to predict {} instances  = {} s'.format(n, end - begin))
#time taken to predict 10000 instances  = 3.732935905456543 s
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51688568

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档