这是我第一次尝试用Python编写一些代码。我认为它给出了正确的答案,但可能需要一些“矢量化”。
import numpy as np
import math
import operator
data = np.genfromtxt("KNNdata.csv", delimiter = ',', skip_header = 1)
data = data[:,2:]
np.random.shuffle(data)
X = data[:, range(5)]
Y = data[:, 5]
def distance(instance1, instance2):
dist = 0.0
for i in range(len(instance1)):
dist += pow((instance1[i] - instance2[i]), 2)
return math.sqrt(dist)
# Calculating distances between all data, return sorted k-elements list (whole element and output)
def getNeighbors(trainingSetX, trainingSetY, testInstance, k):
distances = []
for i in range(len(trainingSetX)):
dist = distance(testInstance, trainingSetX[i])
distances.append((trainingSetX[i], dist, trainingSetY[i]))
distances.sort(key=operator.itemgetter(1))
neighbour = []
for elem in range(k):
neighbour.append((distances[elem][0], distances[elem][2]))
return neighbour
#return answer
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = int(neighbors[x][-1])
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse = True)
return sortedVotes[0][0]
#return accuracy, your predicitons and actual values
def getAccuracy(testSetY, predictions):
correct = 0
for x in range(len(predictions)):
if testSetY[x] == predictions[x]:
correct += 1
return (correct / (len(predictions))) * 100.0
def start():
trainingSetX = X[:2000]
trainingSetY = Y[:2000]
testSetX = X[2000:]
testSetY = Y[2000:]
# generate predictions
predictions = []
k = 4
for x in range(len(testSetX)):
neighbors = getNeighbors(trainingSetX, trainingSetY, testSetX[x], k)
result = getResponse(neighbors)
predictions.append(result)
accuracy = getAccuracy(testSetY, predictions)
print('Accuracy: ' + str(accuracy))
start()发布于 2017-02-07 11:42:19
首先是风格挑剔: Python有一个正式的样式指南PEP8,它建议使用lower_case_with_underscores作为变量和函数名,而不是camelCase。
第二,上面的注释应该变成docstrings。例如,在交互式会话中调用help(your_function)时会出现这些情况。只有一个字符串作为函数头下面的第一行,如下所示:
def f(a, b):
"""Returns the sum of `a` and `b`"""
return a + b建议始终使用三重双引号(即""")。
现在我将集中讨论距离计算。
首先,可以使用列表理解来极大地简化getNeighbors函数:
def getNeighbors(trainingSetX, trainingSetY, testInstance, k):
distances = sorted((distance(testInstance, x), x, y)
for x, y in zip(trainingSetX, trainingSetY)
return [(d[1], d[2]) for d in distances[:k]]这里我使用了这样一个事实:元组已经很自然地排序了,首先比较第一个索引,然后(如果它们相等的话)第二个,以此类推。所以我把距离作为元组的第一个索引,您不再需要键函数了。sorted可以直接获取生成器表达式并对其进行排序。我们还可以使用zip同时迭代多个可迭代性。
因为变量都是numpy数组,所以也可以进行更多的矢量化。为此,我首先重新定义距离函数以使用numpy函数:
def distance(x, y):
return np.sqrt(((x - y)**2).sum())然后把距离也放进一个numpy数组。然后,只有返回第二列和第三列的数组切片变得更容易。
def getNeighbors(trainingSetX, trainingSetY, testInstance, k):
distances = np.array([(distance(testInstance, x), x, y)
for x, y in zip(trainingSetX, trainingSetY])
distances.sort()
return distances[:k, 1:]这可能还可以通过尝试将distance调用向量化来进一步修改。
您的函数classVotes可以使用collections.Counter类进行简化,该类实现了您所做的事情:
def getResponse(neighbors):
classVotes = Counter(int(neighbor[-1]) for neighbor in neighbors)
return max(classVotes.iteritems(), key=itemgetter(1))[0]您的函数getAccuracy可以使用生成器表达式和sum稍微简化:
def getAccuracy(testSetY, predictions):
correct = sum(y == p for y, p in zip(testSetY, predictions))
return correct * 100.0 / len(predictions)最后,在start函数中,可以直接迭代testSetX的元素,使其成为生成器表达式,并使用print可以接受多个参数的事实:
def start():
trainingSetX = X[:2000]
trainingSetY = Y[:2000]
testSetX = X[2000:]
testSetY = Y[2000:]
# generate predictions
k = 4
predictions = (getResponse(getNeighbors(trainingSetX, trainingSetY, x, k)]
for x in testSetX)
accuracy = getAccuracy(testSetY, predictions)
print('Accuracy:', accuracy)https://codereview.stackexchange.com/questions/154609
复制相似问题