问我的代码是否运行1000次非线性递归方程的1000次迭代？
EN

Stack Overflow用户

提问于 2012-08-13 18:03:06

回答 1查看 144关注 0票数 0

根据我对CUDA C的理解，每个线程执行一个等式的实例。但是我如何打印出所有的整数值呢？代码实际工作，但真的需要有人为我审查它，请确认我的结果实际上是内联的，我开始设计。

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <conio.h>
#include <cuda.h>
#include <cutil.h>

__global__ void my_compute(float *y_d,float *theta_d,float *u_d)
{
    int idx=threadIdx.x+blockIdx.x*gridDim.x;

    for (idx=7;idx<1000;idx++)
    {
        y_d[idx]=theta_d[0]*y_d[idx-1]+theta_d[1]*y_d[idx-3]+
            theta_d[2]*u_d[idx-5]*u_d[idx-4]+theta_d[3]+
            theta_d[4]*u_d[idx-6]+theta_d[5]*u_d[idx-4]*y_d[idx-6]+
            theta_d[6]*u_d[idx-7]+theta_d[7]*u_d[idx-7]*u_d[idx-6]+
            theta_d[8]*y_d[idx-4]+theta_d[9]*y_d[idx-5]+
            theta_d[10]*u_d[idx-4]*y_d[idx-5]+theta_d[11]*u_d[idx-4]*y_d[idx-2]+
            theta_d[12]*u_d[idx-7]*u_d[idx-3]+theta_d[13]*u_d[idx-5]+
            theta_d[14]*u_d[idx-4];
    }
}

int main(void)
{   
    float y[1000];
    FILE *fpoo;
    FILE *u;
    float theta[15];
    float u_data[1000];
    float *y_d;
    float *theta_d;
    float *u_d;

    cudaEvent_t start,stop;
    float time;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // memory allocation

    cudaMalloc((void**)&y_d,1000*sizeof(float));
    cudaMalloc((void**)&theta_d,15*sizeof(float));
    cudaMalloc((void**)&u_d,1000*sizeof(float));
    cudaEventRecord( start, 0 );

    // importing data for theta and input of model//

    fpoo= fopen("c:\\Fly_theta.txt","r");
    u= fopen("c:\\Fly_u.txt","r");

    for (int k=0;k<15;k++)
    {
        fscanf(fpoo,"%f\n",&theta[k]);
    }
    for (int k=0;k<1000;k++)
    {
        fscanf(u,"%f\n",&u_data[k]);
    }

    //NB: pls does this for loop below make my equation run 1000000
    // instances as oppose to the 1000  instances i desire?
    for (int i=0;i<1000;i++)  
    {
        //i initialised the first 7 values of y because the equation output
        //starts form y(8)

        for (int k=0;k<8;k++)
        {
            y[k]=0;

            cudaMemcpy(y_d,y,1000*sizeof(float),cudaMemcpyHostToDevice);
            cudaMemcpy(theta_d,theta,15*sizeof(float),cudaMemcpyHostToDevice);
            cudaMemcpy(u_d,u_data,1000*sizeof(float),cudaMemcpyHostToDevice);

            //calling kernel function//
            my_compute<<<200,5>>>(y_d,theta_d,u_d);
            cudaMemcpy(y,y_d,1000*sizeof(float),cudaMemcpyDeviceToHost);
        }
        printf("\n\n*******Iteration %i*******\n", i);
        //does this actually print all the values from the threads? 

        for(int i=0;i<1000;i++)
        {
            printf("%f",y[i]);
        }
    }
    cudaEventRecord( stop, 0 );
    cudaEventSynchronize( stop );
    cudaEventElapsedTime( &time, start, stop );

    cudaEventDestroy( start );
    cudaEventDestroy( stop );
    printf("Time to generate:  %3.1f ms \n", time);

    cudaFree(y_d);
    cudaFree(theta_d);
    cudaFree(u_d);
    fclose(u);
    fclose(fpoo);
    //fclose();
    _getche();

    return (0);

}

cuda

回答 1

Stack Overflow用户

发布于 2012-08-13 18:45:57

如何打印出所有的整数值。

那么，你可以把它复制到主机上(你已经这么做了)，然后正常打印出来？

然而，出于以下几个原因，我对您的代码感到担忧：

只有属于相同warp的线程才能真正并行执行。一条经线是32条相邻线的集合。(类似于warpId = threadIdx.x/32)。属于不同warp的线程可以以任何顺序执行，除非你应用了上面提到的一些同步functions.
Because，否则在计算y_d[idx]时，你不能说太多关于y_d[idx-1]的信息。y_d[idx-1]已经被其他线程计算/覆盖了吗？
你的块中只有5个线程(<<<200,5>>>)，但是因为块可以以扭曲粒度(32的倍数)启动，所以它只会让5个线程运行，每个启动的块有27个线程空闲。
你根本没有使用并行性！您有一个for循环，它将由所有1000个线程执行。所有1000个线程计算完全相同的事情(对竞争条件取模)。计算线程索引idx，然后完全忽略它，并将所有线程的idx设置为7。

我强烈建议-作为启动配置、同步、线程索引的练习-实现并行prefix-sum算法，并且只有在确认它工作正常之后，才能做更高级的事情……

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/11932064

复制

相似问题

问我的代码是否运行1000次非线性递归方程的1000次迭代？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我的代码是否运行1000次非线性递归方程的1000次迭代？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我的代码是否运行1000次非线性递归方程的1000次迭代？
EN