文章/答案/技术大牛

发布

社区首页 >问答首页 >优化代码模拟学习kNN算法

问优化代码模拟学习kNN算法
EN

Stack Overflow用户

提问于 2017-11-17 06:57:17

回答 1查看 218关注 0票数 0

我已经编写了一个使用自制函数执行kNN分类的脚本。我已经将它的性能与类似的脚本进行了比较，但使用了sklearn包。

结果:自制~20秒滑雪板~2秒

所以，现在我想知道性能上的差异主要是因为sklearn是在一个较低的级别上执行的(据我所理解)，还是因为我的脚本效率低下。

如果你们中的一些人获得了为编写高效的Python脚本和程序提供信息的参考，我都知道。

以下是数据文件：DataFile

两个脚本中的文件名、os.environ'R_HOME‘、os.environ'R_USER'必须根据您的目录结构使其特定于用户

我的代码使用自制的kNN分类

#Start Timer
import time
tic = time.time() 

# Begin Script
import os
os.environ['R_HOME'] = r'C:\Users\MyUser\Documents\R\R-3.4.1' #setting temporary PATH variables : R_HOME
                                                                    #a permanent solution could be achieved but more complicated
os.environ['R_USER'] = r'C:\Users\MyUser\AppData\Local\Programs\Python\Python36\Lib\site-packages\rpy2'
                                                                    #same story
import rpy2.robjects as robjects
import numpy as np
import matplotlib.pyplot as plt

## Read R data from ESLII book
dir = os.path.dirname(__file__)
filename = os.path.join(dir, '../ESL.mixture.rda')
robjects.r['load'](filename) #load rda file in R workspace
rObject = robjects.r['ESL.mixture'] #read variable in R workspace and save it into python workspace

#Extract Blue and Orange classes data
classes = np.array(rObject[0]) #note that information about rObject are known by outputing the object into the console
                                #numpy is able to convert R data natively
BLUE = classes[0:100,:]
BLUE = np.concatenate((BLUE,np.zeros(np.size(BLUE,axis=0))[:,None]),axis=1) 
        #the [:,None] is necessary to make the 1D array 2D. 
        #Indeed concatenate requires identical dimensions
        #other functions exist such as np.columns_stack but they take more time to execute than basic concatenate
ORANGE = classes[100:200]
ORANGE = np.concatenate((ORANGE,np.ones(np.size(ORANGE,axis=0))[:,None]),axis=1)
trainingSet = np.concatenate((BLUE,ORANGE),axis=0)

##create meshgrid
minBound = -3
maxBound = 4.5
xmesh = np.linspace(minBound, maxBound, 100)
ymesh = np.linspace(minBound, maxBound, 100)
xv, yv = np.meshgrid(xmesh, ymesh)
gridSet =np.stack((xv.ravel(),yv.ravel())).T

def predict(trainingSet, queryPoint, k):
    # create list for distances and targets
    distances = []
        # compute euclidean distance
    for i in range (np.size(trainingSet,0)):
        distances.append(np.sqrt(np.sum(np.square(trainingSet[i,:-1]-queryPoint))))
    #find k nearest neighbors to the query point and compute its outcome
    distances=np.array(distances)
    indices = np.argsort(distances) #provides indices, sorted from short to long distances
    kindices = indices[0:k]
    kNN = trainingSet[kindices,:]
    queryOutput = np.average(kNN[:,2])
    return queryOutput

k = 1
gridSet = np.concatenate((gridSet,np.zeros(np.size(gridSet,axis=0))[:,None]),axis=1)
i=0
for point in gridSet[:,:-1]:
    gridSet[i,2] = predict(trainingSet, point, k)
    i+=1


#k = 1
#test = predict(trainingSet, np.array([4.0, 1.2]), k)

col = np.where(gridSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors
plt.scatter(gridSet[:,0],gridSet[:,1],c=col,s=0.2)
col = np.where(trainingSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors
plt.scatter(trainingSet[:,0],trainingSet[:,1],c=col,s=1.0)
plt.contour(xv,yv,gridSet[:,2].reshape(xv.shape),0.5)
plt.savefig('kNN_homeMade.png', dpi=600)
plt.show()
#
#Stop timer
toc = time.time()
print(toc-tic, 'sec Elapsed')

我的代码使用sklearn

#Start Timer
import time
tic = time.time() 

# Begin Script
import os
os.environ['R_HOME'] = r'C:\Users\MyUser\Documents\R\R-3.4.1' #setting temporary PATH variables : R_HOME
                                                                    #a permanent solution could be achieved but more complicated
os.environ['R_USER'] = r'C:\Users\MyUser\AppData\Local\Programs\Python\Python36\Lib\site-packages\rpy2'
                                                                    #same story
import rpy2.robjects as robjects
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

## Read R data from ESLII book
dir = os.path.dirname(__file__)
filename = os.path.join(dir, '../ESL.mixture.rda')
robjects.r['load'](filename) #load rda file in R workspace
rObject = robjects.r['ESL.mixture'] #read variable in R workspace and save it into python workspace

#Extract Blue and Orange classes data
classes = np.array(rObject[0]) #note that information about rObject are known by outputing the object into the console
                                #numpy is able to convert R data natively
BLUE = classes[0:100,:]
BLUE = np.concatenate((BLUE,np.zeros(np.size(BLUE,axis=0))[:,None]),axis=1) 
        #the [:,None] is necessary to make the 1D array 2D. 
        #Indeed concatenate requires identical dimensions
        #other functions exist such as np.columns_stack but they take more time to execute than basic concatenate
ORANGE = classes[100:200]
ORANGE = np.concatenate((ORANGE,np.ones(np.size(ORANGE,axis=0))[:,None]),axis=1)
trainingSet = np.concatenate((BLUE,ORANGE),axis=0)

##create meshgrid
minBound = -3
maxBound = 4.5
xmesh = np.linspace(minBound, maxBound, 100)
ymesh = np.linspace(minBound, maxBound, 100)
xv, yv = np.meshgrid(xmesh, ymesh)
gridSet =np.stack((xv.ravel(),yv.ravel())).T
gridSet = np.concatenate((gridSet,np.zeros(np.size(gridSet,axis=0))[:,None]),axis=1)

##classify using kNN
k = 1
clf = neighbors.KNeighborsClassifier(k, weights='uniform',algorithm='brute')
clf.fit(trainingSet[:,:-1],trainingSet[:,-1:].ravel()) #learn, ravel necessary to obtain (n,) shape instead of a vector (n,1)
gridSet[:,2]  = clf.predict(np.c_[xv.ravel(), yv.ravel()])

#Plot
col = np.where(gridSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors
plt.scatter(gridSet[:,0],gridSet[:,1],c=col,s=0.2)
col = np.where(trainingSet[:,2]<0.5,'b','r').flatten() #flatten is necessary. 2D arrays are only accepted with RBA colors
plt.scatter(trainingSet[:,0],trainingSet[:,1],c=col,s=1.0)
plt.contour(xv,yv,gridSet[:,2].reshape(xv.shape),0.5)
plt.savefig('kNN_sciKit.png', dpi=600)
plt.show()
#
#Stop timer
toc = time.time()
print(toc-tic, 'sec Elapsed')

python

scikit-learn

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-11-19 18:46:25

按照andrew_reece建议并对我的代码执行一些分析，我将把计算时间减少到大约2秒(而不是20秒)。

罪魁祸首是以下代码中的两个for循环：

    def predict(trainingSet, queryPoint, k):
    # create list for distances and targets
    distances = []
    #targets = []

        # compute euclidean distance
    for i in range (np.size(trainingSet,0)):
        distances.append(np.sqrt(np.sum(np.square(trainingSet[i,:-1]-queryPoint))))
    #find k nearest neighbors to the query point and compute its outcome
    distances=np.array(distances)
    indices = np.argsort(distances) #provides indices, sorted from short to long distances
    kindices = indices[0:k]
    kNN = trainingSet[kindices,:]
    queryOutput = np.average(kNN[:,2])
    return queryOutput

k = 1
gridSet = np.concatenate((gridSet,np.zeros(np.size(gridSet,axis=0))[:,None]),axis=1)
i=0
for point in gridSet[:,:-1]:
    gridSet[i,2] = predict(trainingSet, point, k)
    i+=1

从neighbors包中读取sklearn类时，我注意到它们以完全矢量化的方式执行欧几里德距离计算。所以我已经读过代码现在理解了这个函数..。但我懒得完全重写它。相反，我只是导入函数euclidean_distances并将其直接与数据一起使用，从而演示了这些改进。修改后的部分如下：

    def predict(trainingSet, queryPoints, k):
    # create list for distances and targets
    distances = euclidean_distances(trainingSet[:,:-1],queryPoints[:,:-1]) #provides distances between each training point and each query point
    #line i is distance between training i and all query (in columns)
    #so k neighbors of query j are k first lines for column j
    #find k nearest neighbors to the query point and compute its outcome
    indices = np.argsort(distances,axis=0) #provides indices, sorted from short to long distances for each query point
    #kindices = indices[0:k]
    kNN = trainingSet[indices[:k,:],2] #produce the kNN outputs
    queryOutput = np.average(kNN,0)
    return queryOutput

k = 1
gridSet = np.concatenate((gridSet,np.zeros(np.size(gridSet,axis=0))[:,None]),axis=1)
gridSet[:,2] = predict(trainingSet, gridSet, k)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47344959

复制

相似问题

问优化代码模拟学习kNN算法
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问优化代码模拟学习kNN算法EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问优化代码模拟学习kNN算法
EN