我是一个在Visual Studio C#中使用OpenCL (带有OpenCL.NET库)的新手,目前正在开发一个可以计算大型3D矩阵的应用程序。在矩阵中的每个像素上,计算192个唯一值,然后求和以得到该像素的最终值。因此,在功能上,它类似于一个4维矩阵,(161x161x161)x192。
现在,我从我的主机代码中调用内核,如下所示:
//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}
//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);示例内核代码发布在下面。
__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
int N = 161; //Size of matrix edges
int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
int array_id = get_global_id(1); //The location within the otherArray
//Finding the x,y,z values of the pixel_id.
float3 p;
p.x = pixel_id % N;
p.y = ((pixel_id % (N*N))-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/(N*N);
float result;
//...
//Some long calculation for 'result' involving otherArray and p...
//...
BigMatrix[pixel_id] += result;
}我的代码目前可以工作,但是我正在为这个应用程序寻找速度,我不确定我的工人/组设置是否是最好的方法(例如,161*161*161和192对于工作池的大小)。
我见过将全局工作人员池组织为本地工作人员组以提高效率的其他示例,但我不太确定如何在OpenCL.NET中实现,我也不确定这与仅仅在工作人员池中创建另一个维度有什么不同。
所以,我的问题是:我可以在这里使用本地组吗?如果可以,我将如何组织它们?一般而言,使用本地组与仅调用n维工作池有何不同?(即调用Execute(args,new int[]{(N*N*N),192}),而本地工作组大小为192?)
谢谢你的帮助!
发布于 2012-05-02 23:11:07
我认为等待内存访问会损失很多性能。我已经回复了一个similar SO question.,希望我的帖子能帮到你。如果有任何问题,请提出来。
优化:
..。
#define N 161
#define Nsqr N*N
#define Ncub N*N*N
#define otherSize 192
__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
{
//using 1 quarter of the total size of the matrix
//this work item will be responsible for computing 4 consecutive values in BigMatrix
//also reduces global size to (N^3)/4 ~= 1043000 for N=161
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id;
//array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely
int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group
float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//cache the values in otherArray to local memory
//now each work item in the group will be able to read the values efficently
//each element in otherArray will be read a total of N^3 times, so this is important
//opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
__local float otherValues[otherSize];
for(i=local_id; i<otherSize; i+= local_size){
otherValues[i] = otherArray[i];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//now this work item can compute the complete result for pixel_id
for(j=0;j<4;j++){
result[j] = 0;
pixel_id = global_id + j;
//Finding the x,y,z values of the pixel_id.
//TODO: optimize the calculation of p.y and p.z
//they will be the same most of the time for a given work item
p.x = pixel_id % N;
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;
for(i=0;i<otherSize;i++){
//...
//Some long calculation for 'result' involving otherValues[i] and p...
//...
//result[j] += ...
}
}
//4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
BigMatrix[global_id] += result[0];
BigMatrix[global_id + 1] += result[1];
BigMatrix[global_id + 2] += result[2];
BigMatrix[global_id + 3] += result[3];
}备注:
__kernel void MyKernel(__global float * BigMatrix,__global float * otherArray) {__kernel global_id = get_global_id(0) * 4;//一维数组中第一个像素的位置int pixel_id = global_id;int local_id = get_local_id(0);//组内工作项id int local_size = get_local_size(0);//组float result4的大小;//缓存4个全局值int i,j;float3 p的结果;//查找pixel_id的初始x,y,z值p.x = pixel_id % N;p.y = ((pixel_id % Nsqr)-p.x)/N;p.z = (pixel_id - p.x - p.y*N)/Nsqr;//在此处缓存值。同上..。//现在此工作项可以计算(j=0;j<4;j++){ pixel_id = 0;//递增x、y和z值而不是从头开始全部计算的完整结果。p.x += 1;if(p.x >= N){ p.x = 0;p.y += 1;if(p.y >= N){ p.y += N){P.Y= 0;p.z += 1;}} for(i=0;i
发布于 2012-04-30 14:44:01
我有几个建议给你:
希望这能有所帮助!
https://stackoverflow.com/questions/10378068
复制相似问题