首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >等效于SSE本质的氖

等效于SSE本质的氖
EN

Stack Overflow用户
提问于 2012-07-02 11:37:45
回答 2查看 2.1K关注 0票数 8

我正在尝试使用neon本质将c代码转换为优化代码。

这是c码,它的运算量超过两个运算子,而不是运算子的向量。

代码语言:javascript
复制
uint16_t mult_z216(uint16_t a,uint16_t b){
unsigned int c1 = a*b;
    if(c1)
    {
        int c1h = c1 >> 16;
        int c1l = c1 & 0xffff;
        return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff;
    }
    return (1-a-b) & 0xffff;
}

以下操作已经实现了此操作的优化版本:

代码语言:javascript
复制
#define MULT_Z216_SSE(a, b, c) \
    t0  = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
    (c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers
    (a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers
    (b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates
    (b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
    (b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits
    (c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.
    (a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
    (c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b.
    t0  = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
    (c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

我几乎用霓虹灯的内在原理来改造这个:

代码语言:javascript
复制
#define MULT_Z216_NEON(a, b, out) \
    temp = vorrq_u16 (*a, *b); \
    // ??
    // ??
    *b = vsubq_u16(*out, *a); \
    *b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \
    *b = vshrq_n_u16(*b, 15); \
    *out = vsubq_s16(*out, *a); \
    *a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \
    *c = vaddq_s16(*c, *b); \
    *temp = vandq_u16(*temp, *a); \
    *out = vsubq_s16(*out, *a);

我只想念_mm_mullo_epi16 ((a), (b));_mm_mulhi_epu16 ((a), (b));的霓虹灯等价物。要么我误解了什么,要么霓虹灯里没有这样的本质。如果没有等效的,如何使用NEONS,本质,如何存档这些步骤?

最新情况:

我忘记强调以下一点:函数的运算子是,uint16x8_t,,NEON向量(每个元素是0到65535之间的uint16_t =>整数)。在一个答案中,有人建议使用固有的vqdmulhq_s16()。这种方法的使用与给定的实现不匹配,因为乘法内禀会将向量解释为有符号值,并产生错误的输出。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2012-07-02 18:30:50

您可以使用:

代码语言:javascript
复制
uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t) 

它返回32位产品的向量。如果你想把结果分解成高低部分,你可以使用霓虹灯解压缩内蕴。

票数 5
EN

Stack Overflow用户

发布于 2012-07-02 22:02:13

vmulq_s16()相当于_mm_mullo_epi16。没有与_mm_mulhi_epu16完全等价的;最接近的指令是vqdmulhq_s16(),它是“饱和、倍、乘、返回高部分”。它只对有符号的16位值进行操作,您需要将输入或输出除以2才能使加倍无效。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/11292884

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档