首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >阵列乘法与sse本质乘法的时序?

阵列乘法与sse本质乘法的时序?
EN

Stack Overflow用户
提问于 2014-10-21 19:31:37
回答 1查看 832关注 0票数 4

为了测试我对sse本质的理解,我创建了下面的代码。代码编译和运行正确,但是使用sse的改进不是很大。使用sse的本质是接近的。快20%。它不应该大约快4倍或400%的速度提高吗?编译器正在优化标量循环吗?如果是这样的话,如何才能将其禁用?我编写的sse_mult()函数有问题吗?

代码语言:javascript
复制
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <emmintrin.h>
// gcc options -mfpmath=sse -mmmx -msse -msse2 \ Not sure if any are needed have been using -msse2

/*--------------------------------------------------------------------------------------------------
 * SIMD intrinsics header files
 * 
 * <mmintrin.h>  MMX
 *
 * <xmmintrin.h> SSE
 *
 * <emmintrin.h> SSE2
 *
 * <pmmintrin.h> SSE3
 *
 * <tmmintrin.h> SSE3
 *
 * <smmintrin.h> SSE4.1
 *
 * <nmmintrin.h> SSE4.2
 *
 * <ammintrin.h> SSE4A
 *
 * <wmmintrin.h> AES
 *
 * <immintrin.h> AVX
 *------------------------------------------------------------------------------------------------*/

#define n 1000000

// Global variables
float a[n]; // array to hold random numbers
float b[n]; // array to hold random numbers
float c[n]; // array to hold product a*b for scalar multiply
__declspec(align(16)) float d[n] ; // array to hold product a*b for sse multiply
// Also possible to use __attribute__((aligned(16))); to force correct alignment

// Multiply using loop
void loop_mult() {
    int i; // Loop index

    clock_t begin_loop, end_loop; // clock_t is type returned by clock()
    double time_spent_loop;

    // Time multiply operation
    begin_loop = clock();   
        // Multiply two arrays of doubles
        for(i = 0; i < n; i++) {
            c[i] = a[i] * b[i];
        }
    end_loop = clock();

    // Calculate time it took to run loop. Type int CLOCK_PER_SEC is # of clock ticks per second.
    time_spent_loop = (double)(end_loop - begin_loop) / CLOCKS_PER_SEC;
    printf("Time for scalar loop was %f seconds\n", time_spent_loop);
}

// Multiply using sse
void sse_mult() {
    int k,i; // Index
    clock_t begin_sse, end_sse; // clock_t is type returned by clock()
    double time_spent_sse;

    // Time multiply operation
    begin_sse = clock();    
        // Multiply two arrays of doubles
        __m128 x,y,result; // __m128 is a data type, can hold 4 32 bit floating point values
        result = _mm_setzero_ps(); // set register to hold all zeros
        for(k = 0; k <= (n-4); k += 4) {
            x = _mm_load_ps(&a[k]); // Load chunk of 4 floats into register
            y = _mm_load_ps(&b[k]);
            result = _mm_mul_ps(x,y); // multiply 4 floats
            _mm_store_ps(&d[k],result); // store result in array
        }
        int extra = n%4; // If array size isn't exactly a multiple of 4 use scalar ops for remainder
        if(extra!=0) {
            for(i = (n-extra); i < n; i++) {
                d[i] = a[i] * b[i];
            }
        }
    end_sse = clock();

    // Calculate time it took to run loop. Type int CLOCK_PER_SEC is # of clock ticks per second.
    time_spent_sse = (double)(end_sse - begin_sse) / CLOCKS_PER_SEC;
    printf("Time for sse was %f seconds\n", time_spent_sse);
}

int main() {
    int i; // Loop index

    srand((unsigned)time(NULL)); // initial value that rand uses, called the seed
        // unsigned garauntees positive values
        // time(NULL) uses the system clock as the seed so values will be different each time

    for(i = 0; i < n; i++) {
        // Fill arrays with random numbers
        a[i] = ((float)rand()/RAND_MAX)*10; // rand() returns an integer value between 0 and RAND_MAX
        b[i] = ((float)rand()/RAND_MAX)*20;
    }

    loop_mult();
    sse_mult();
    for(i=0; i<n; i++) {
        // printf("a[%d] = %f\n", i, a[i]); // print values to check
        // printf("b[%d] = %f\n", i, b[i]);
        // printf("c[%d] = %f\n", i, c[i]);
        // printf("d[%d] = %f\n", i, d[i]);
        if(c[i]!=d[i]) {
            printf("Error with sse multiply.\n");
            break;
        }
    }


    return 0;
}
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-10-21 20:41:35

你的程序是内存绑定的。SSE并没有产生很大的影响,因为大部分时间都是从RAM中读取那些大数组。减小这些数组的大小,这样它们才能适应高速缓存。增加通行证的数量。当所有数据都在缓存中时,SSE版本应该执行得更快。

请记住,可能还涉及其他因素:

  • GCC可以(在某种程度上)自动将循环矢量化。(我认为它需要-O3 )
  • 第一个测试方法将比较慢,因为缓存尚未填充。您可能需要交替多次运行这两种方法。
票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26494785

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档