文章/答案/技术大牛

发布

社区首页 >问答首页 >__ldg()内部执行和正常执行有什么不同？

问__ldg()内部执行和正常执行有什么不同？
EN

Stack Overflow用户

提问于 2014-10-28 15:48:19

回答 2查看 10.5K关注 0票数 7

我正在尝试探索“__ldg intrinsic”。我已经阅读了NVIDIA的文档，但没有得到任何关于它的使用和实现的令人满意的答案。此外，参考THIS，我尝试在一个简单的1024x1024矩阵乘法示例中实现__ldg。

#include<stdio.h>
#include<stdlib.h>

__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
        float pvalue=0;
        //find Row and Column corresponding to a data element for each thread
        int Row = blockIdx.y * blockDim.y + threadIdx.y;
        int Col = blockIdx.x * blockDim.x + threadIdx.x;
        //calculate dot product of Row of First Matrix and Column of Second Matrix
        for(int i=0;i< N;++i)
        {
//   I tried with executing this first:
            float m=__ldg(&ad[Row * N+i]);
            float n=__ldg(&bd[i * N + Col]);

//Then I executed this as a normal execution:
//          float m = ad[Row * N+i];
//          float n = bd[i * N + Col];

            pvalue += m * n;
         }
        //store dot product at corresponding position in resultant Matrix
        cd[Row * N + Col] = pvalue;
}

int main()
{
    int N = 1024,i,j;               //N == size of square matrix

    float *a,*b;
    float *ad,*bd,*cd,*c;

    //open a file for outputting the result
    FILE *f;
    f=fopen("Parallel Multiply_ldg.txt","w");

    size_t size=sizeof(float)* N * N;

    //allocate host side memory
    a=(float*)malloc(size);
    b=(float*)malloc(size);
    c=(float*)malloc(size);

    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
        {
            a[i*N+j]=2.0;   //(float)(i*N+j);       //initializing each value with its own index
            b[i*N+j]=1.0;   //(float)(i*N+j);       //random functions can be used alternatively
        }
    }

    //allocate device memory
    cudaMalloc(&ad,size);
    //printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&bd,size);
    //printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&cd,size);
    //printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //copy value from host to device
    cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
    cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);

    printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //calculate execution configuration
    dim3 blocksize(16,16);              //each block contains 16 * 16 (=256) threads
    dim3 gridsize(N/16,N/16);           //creating just sufficient no of blocks

    //GPU timer code
    float time;
    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);

    matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
    cudaDeviceSynchronize();
    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time,start,stop);         //time taken in kernel call calculated
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    //copy back results
    cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);

    printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //output results in output_file
    fprintf(f,"Array A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",a[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nArray B was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",b[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nMultiplication of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",c[i*N+j]);              //if correctly computed, then all values must be N
        fprintf(f,"\n");
    }
    printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
    printf("\n\nTime taken is %f (ms)\n",time);
    fprintf(f,"\n\nTime taken is %f (ms)\n",time);
    fclose(f);

    cudaThreadExit();
    //cudaFree(ad); cudaFree(bd); cudaFree (cd);
    free(a);free(b);free(c);
    //_getch();
    return 1;
}

我注释了内核中的__ldg部分，并通过正常执行来执行，反之亦然。在这两种情况下，它都给出了正确的乘法结果。我对这些执行之间的时间差感到困惑，因为它几乎超过100倍！

在__ldg的情况下，它给我：Time taken is 0.014432 (ms)

在没有__ldg的情况下正常执行，它给出了：Time taken is 36.858398 (ms)

这是使用__ldg intrisic的确切方式吗？__ldg内部函数的意义是什么?正确的使用方法是什么？显然，我在上面的代码中所做的是错误和幼稚的。我正在寻找解释和例子。提前谢谢。

cuda

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-10-28 16:36:18

从CUDA C Programming Guide

计算能力为3.x的设备的

全局内存访问缓存在L2中，而计算能力为3.5的设备的全局内存访问也可能缓存在上一节所述的只读数据缓存中；它们不会缓存在L1中。

..。

对于在整个内核生命周期内都是只读的

数据，也可以通过使用__ldg()函数(请参见只读数据缓存加载函数)将其缓存到上一节所述的只读数据缓存中。当编译器检测到某些数据满足只读条件时，它将使用__ldg()读取这些数据。对于某些数据，编译器可能并不总是能够检测到满足只读条件。使用const和__restrict__限定符标记用于加载此类数据的指针会增加编译器检测到只读条件的可能性。

只读高速缓存访问具有比全局存储器访问低得多的等待时间。由于矩阵乘法多次从内存中访问相同的值，因此在只读缓存中进行缓存会带来极大的加速(在内存受限的应用程序中)。

票数 13

Stack Overflow用户

发布于 2020-06-08 23:00:23

在NVIDIA GPU中，有一个纹理图像，它具有特殊的处理图像的逻辑。

这种纹理内存是图形处理器中提供的另一种类型的内存。具体地说，常量、全局和寄存器堆存储器与该纹理存储器没有任何关系。

开普勒GPU和更高版本增加了从"GPU纹理流水线“使用此内存的能力。

但让我们指定常量缓存和只读缓存之间的区别。

常量缓存

通过常量缓存加载的数据必须相对较小，并且必须以这样的方式访问，即warp的所有线程在任何给定时间都应该访问相同的位置。

只读缓存或纹理内存缓存

缓存可以大得多，并且可以以非均匀模式访问。只读缓存的粒度为32字节。

您可以将其用作CUDA内核的“只读缓存”。

1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the 
   duration of a kernel execution in GPU.

有两种方法可以实现这一点。

A.使用内部函数__ldg

Example: output[i] += __ldg(&input[j]);

B.限定指向全局内存的指针

const float* __restrict__ input
output[idx] += input[idx];

Comparision:

出于深层次的编译器原因，内部__ldg是更好的选择。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/26603188

复制

相似问题

问__ldg()内部执行和正常执行有什么不同？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问__ldg()内部执行和正常执行有什么不同？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问__ldg()内部执行和正常执行有什么不同？
EN