文章/答案/技术大牛

发布

社区首页 >问答首页 >基于CUDA GPU的Matlab + CUSP MEX解决方案的改进

问基于CUDA GPU的Matlab + CUSP MEX解决方案的改进
EN

Stack Overflow用户

提问于 2013-04-09 16:53:08

回答 1查看 1.7K关注 0票数 1

Matlab仍然不能计算CUDA GPU上的稀疏矩阵。也没有这样的工具箱(停止使用夹克)。这就是为什么我使用CUSP集成到Matlab通过MEX文件。然而，我开发的工具有两个问题：

大方程组(实际上从100个元素开始)是非常不稳定的，
它比Matlab替代方案慢了几十倍或几百倍。

我正在解A*x=b，其中A是一个稀疏的对称矩阵，b是一个向量。

硬件规格:英特尔i7 3630 GB，GT640M 2G，8GB DDR3。软件: Windows 8 64位，Matlab R2012b 64位，CUDA 5.0 64位，CUSP 0.3.1，WindowsSDKV7.0，VS2010编译器。

MEX代码：

#include<cusp/csr_matrix.h>
#include <cusp/krylov/bicgstab.h>
#include <matrix.h>
#include <mex.h> 
#include <time.h>

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])
{
        double t1 =  clock();
          // data from Matlab       
        double *b = mxGetPr(prhs[1]);
        double *A = mxGetPr(prhs[0]);
        int n = mxGetM(prhs[0]);
        mwIndex *ir = mxGetIr(prhs[0]);
        mwIndex *jc = mxGetJc(prhs[0]);
        int N = jc[n];
        t1 = clock() - t1;

        double t2 =  clock();
          // initialization of matrix A in CSR format (jc and ir are exchanged, because Matlab uses CSC format
        cusp::csr_matrix<int,float,cusp::device_memory> Ag(n,n,3*n-2);
        thrust::copy(jc, jc + n + 1, Ag.row_offsets.begin());
        thrust::copy(ir, ir + N,     Ag.column_indices.begin());
        thrust::copy(A,  A  + N,     Ag.values.begin()); 
          // initialization of vector b
        cusp::array1d<float, cusp::device_memory> bg (b, b+n);
        cusp::array1d<float, cusp::device_memory> xg (n, 0);
        t2 = clock() - t2;

        double t3 =  clock();
          // bicgstab algorithm solution for vector x, when using 0.001 accuracy and precondition M
          // this is the slowest part, much slower than others
        cusp::verbose_monitor<float> monitor(bg, 5000, 1e-3);
        cusp::identity_operator<float, cusp::device_memory> M(n, n);
        cusp::krylov::bicgstab(Ag, xg, bg, monitor, M);        
        t3 = clock() - t3;

        double t4 =  clock();     
          // gathering solution vector bact on host to Matlab array T
        mxArray *T = mxCreateDoubleMatrix(n, 1, mxREAL);
        double *x  = mxGetPr(T);
        thrust::copy(xg.begin(), xg.end(), x);
        t4 = clock() - t4;

          // gathering execution times to Matlab array times
        mxArray *times=mxCreateDoubleMatrix(5, 1, mxREAL);
        double *timesb=mxGetPr(times);
        timesb[0]=t1; timesb[1]=t2; timesb[2]=t3; timesb[3]=t4; timesb[4]=monitor.iteration_count();

          // sending data back to Matlab
        plhs[0] = times; 
        plhs[1] = T;
}

在Matlab上用以下命令(必要时更改第二个命令32位)在MEX文件(ex.cu)中编译这段代码：

>> !nvcc -c -arch sm_20 ex.cu -Xcompiler -fPIC -I "C:\Program Files\MATLAB\R2012b\extern\include" -I "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include
>> mex ex.obj -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\lib\x64" -lcudart

样本矩阵、向量和编译的64位MEX函数：http://www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm

使用：

tic; [times,x]=ex(K',F); toc;   %K has to be transposed for CSR

其中，时间-单独的执行时间，其中的最后一个元素-迭代计数(bicgstab监视器)用于解决方案，结果- K*x=F的解决方案。

结果(http://www.failai.lt/rupaliln7kfb/results.7z.htm)：

K_int_6，F_int_6 - ok
K_11，F_11 - x(1)错了，其他没问题
K_100000，F_100000 - x(1)错误，其他从一开始就没问题，但与正确的结果相比，后来下降了。
K_100000，F_100000 -在GPU (MEX)上执行0.6s，在CPU上持续0.014 s (tic;xcpu=K\F;toc;)

你能看看这段代码吗，也许可以试试MEX函数，报告你的结果，建议如何改进这个函数？也许你知道在GPU上实现稀疏计算的替代方案？我希望，在Matlab发布其对GPU上稀疏矩阵的兼容性之前，它将对每个人都有用:)

gpu

mex

sparse-matrix

cula

matlab

回答 1

Stack Overflow用户

发布于 2013-12-02 19:08:04

看看Matlab文件交换，gpus的gpus稀疏类，支持单精度，真实/复杂：http://www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid。

稀疏矩阵向量乘用CUSP过载。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15907874

复制

相似问题

问基于CUDA GPU的Matlab + CUSP MEX解决方案的改进
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于CUDA GPU的Matlab + CUSP MEX解决方案的改进EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于CUDA GPU的Matlab + CUSP MEX解决方案的改进
EN