文章/答案/技术大牛

发布

社区首页 >问答首页 >x86-64是否在端口1上执行地址计算mov，即mov i(r，r，i)，r？或者它仍然是p0156？

问x86-64是否在端口1上执行地址计算mov，即mov i(r，r，i)，r？或者它仍然是p0156？
EN

Stack Overflow用户

提问于 2020-09-20 10:41:42

回答 1查看 233关注 0票数 2

我在问是否需要计算地址的mov指令(在at&t语法中

mov i(r, r, i), reg或mov reg, i(r, reg, i)

必须在端口1上执行，因为它们实际上是具有/3个操作数+ MOV的LEA，或者如果它们可以在端口0156上自由执行。

如果它们确实在端口1上执行了LEA部分，那么一旦地址计算完成，端口1将被解除阻塞，还是需要先完成整个内存加载。

在ICL上，似乎p7可以做索引地址模式？

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>


#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TERMS 3

void BENCH_ATTR
test_store_port() {
    const uint32_t N = (1 << 29);

    uint64_t dst, loop_cnt;
    uint64_t src[16] __attribute__((aligned(64)));

    asm volatile(
        "movl %[N], %k[loop_cnt]\n\t"
        ".p2align 5\n\t"
        "1:\n\t"

        "movl %k[loop_cnt], %k[dst]\n\t"
        "andl $15, %k[dst]\n\t"
#if TERMS == 3
        "movl %k[dst], (%[src], %[dst], 4)\n\t"
#else
        "movl %k[dst], (%[src])\n\t"
#endif


        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ dst ] "+r"(dst), [ loop_cnt ] "+r"(loop_cnt)
        : [ N ] "i"(N), [ src ] "r"(src), "m"(*((const uint32_t(*)[16])src))
        : "cc");
}

int
main(int argc, char ** argv) {
    test_store_port();
}

使用#define TERMS 3的结果

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           297,191      uops_dispatched.port_2_3                                    
       537,039,830      uops_dispatched.port_7_8                                    
     2,149,098,661      uops_issued.any                                             
       761,661,276      cpu-cycles                                                  

       0.210463841 seconds time elapsed

       0.210366000 seconds user
       0.000000000 seconds sys

使用#define TERMS 1的结果

perf stat -e uops_dispatched.port_2_3 -e uops_dispatched.port_7_8 -e uops_issued.any -e cpu-cycles ./bsf_dep

 Performance counter stats for './bsf_dep':

           291,370      uops_dispatched.port_2_3                                    
       537,040,822      uops_dispatched.port_7_8                                    
     2,148,947,408      uops_issued.any                                             
       761,476,510      cpu-cycles                                                  

       0.202235307 seconds time elapsed

       0.202209000 seconds user
       0.000000000 seconds sys

assembly

x86

intel

cpu-architecture

回答 1

Stack Overflow用户

发布于 2020-09-20 10:54:11

所有CPU都在加载或存储地址端口中的AGU上为加载/存储uop生成地址，而不是在ALU端口上。只有LEA使用ALU执行端口进行移位和加法运算。

如果复杂寻址模式需要端口1，https://uops.info/和/或https://agner.org/optimize/将在其指令表中说明。但它们不需要:加载只需要p23，而存储只需要存储地址的p237 +存储数据的p4。

实际上只是索引存储的p23；端口7上的简单存储地址AGU (Haswell到Skylake)只能处理reg+constant，这意味着如果您在代码中使用索引寻址模式，则地址生成可能是一个瓶颈，否则可能会支持每个时钟2次加载+1次存储。

(早期的沙桥家族，SnB和IvB，甚至会un-laminate indexed stores，所以也有前端成本。)

冰湖改变了，在端口7和8上有2个专用的存储AGU。存储地址微处理器不能再借用负载AGU，所以存储AGU必须是全功能的。https://uops.info/html-tp/ICL/MOV_M32_R32-Measurements.html确认索引寻址模式的存储确实在ICL上以2/时钟运行，因此这两个存储AGU都是全功能的。例如mov [r14+r13*1+0x4],r8d。(uops.info没有测试比例因子大于1，但我假设两个store-AGUs是相同的，在这种情况下，它们都可以处理它。)

不幸的是，HSW/SKL还需要很多年才能变得不重要，因为英特尔仍然在销售Skylake衍生的微体系结构，所以它们将是桌面软件安装基础的一大部分。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63975049

复制

相似问题

问x86-64是否在端口1上执行地址计算mov，即mov i(r，r，i)，r？或者它仍然是p0156？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问x86-64是否在端口1上执行地址计算mov，即mov i(r，r，i)，r？或者它仍然是p0156？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问x86-64是否在端口1上执行地址计算mov，即mov i(r，r，i)，r？或者它仍然是p0156？
EN