我正在努力完成Nvidia的“CUDA Python加速计算基础”课程,并完成了重构一些代码的简单版本的任务,这些代码执行在神经网络中创建隐藏层所需的工作:
import numpy as np
from numba import cuda, vectorize
n = 1000000
greyscales = np.floor(np.random.uniform(0, 255, n).astype(np.float32))
weights = np.random.normal(.5, .1, n).astype(np.float32)
from numpy import exp
def normalize(grayscales):
return grayscales / 255
def weigh(values, weights):
return values * weights
def activate(values):
return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )
def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
normalized = normalize(greyscales)
weighted = weigh(normalized, weights)
activated = activate(weighted)
return activated
arguments = {"n":n,
"greyscales": greyscales,
"weights": weights,
"exp": exp,
"normalize": normalize,
"weigh": weigh,
"activate": activate}
a = create_hidden_layer(**arguments)
print(a)我对代码进行了一些转换,修改后如下所示:
from math import exp
@vectorize(['float32(float32)'],target='cuda')
def normalize(grayscales):
return grayscales / 255
@vectorize(['float32(float32,float32)'],target='cuda')
def weigh(values, weights):
return values * weights
@vectorize(['float32(float32)'],target='cuda')
def activate(values):
return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )
def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
normalized = normalize(greyscales)
weighted = weigh(normalized, weights)
activated = activate(weighted)
return activated
greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)
normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)
activated = activated.copy_to_host()
arguments = {"n":n,
"greyscales": greyscales,
"weights": weights,
"exp": exp,
"normalize": normalize,
"weigh": weigh,
"activate": activate}
a = create_hidden_layer(**arguments)
print(a)在所有的转换之后,代码似乎工作得很好,但是有一段代码是.还不够快。在任务中,代码应该在小于1s内运行,而我的代码运行在1.23s.
也许有人知道我怎么能更多地重构我的代码?或者注意到我在代码中犯了什么愚蠢的错误?会非常感谢您的帮助!
发布于 2022-09-19 23:36:30
以下是您可以尝试加快代码速度的一些事情:
@cuda.jit编译内核,在内核中使用cuda.grid(2)获取2D线程索引,使用cuda.blockDim.x获取块中的线程数。使用这些方法计算数组的一维索引,并将其存储在共享内存数组中。内核中的cuda.synchronize()等待所有线程到达内核中的那个点。然后,使用共享内存数组访问全局内存中的数据。cuda.shared.array()和cuda.shared.to_device()创建共享内存数组并将其复制到GPU。cuda.synchronize()等待所有线程到达内核的末尾。然后,使用cuda.from_device()将数据复制回CPU。cuda.to_device()和cuda.from_device()在CPU和GPU之间复制数据,如果您愿意的话。cuda.device_array_like()在GPU上创建类似于CPU上的数组的数组。G 226
https://stackoverflow.com/questions/73778660
复制相似问题