有比使用simd指令更好的方法来实现这一点吗?
处理不能被8整除的数组的最佳方法是什么,就像在代码中,如果剩下的字节少于8个字节,它们只会得到1乘1的零。
也许更快的方法是检查还有多少个字节,然后一次将它们为零--2字节或4字节?
检查是否超过了1乘1的成本?
这只是一个测试,我试着学习组装,所以任何,即使是小的,改进和提示是非常感谢的。
谢谢
.code
ZeroArray proc
cmp edx, 0
jle Finished ; Check if count is 0
cmp edx, 8
jl SetupLessThan8Bytes ; Check if counter is less than 8
mov r8d, edx ; Storing the original count
shr edx, 3 ; Bit shifts the counter to the right by 3 (equal to dividing by 8), works because 2^3 is equal to 8
mov r9d, edx ; Stores the divided count to be able to check how many single byte zeros the program has to do
MainLoop:
mov qword ptr [rcx], 0 ; Set the next 8 bytes (qword) to 0
add rcx, 8 ; Move pointer along the array by 8 bytes
dec edx ; Decrement the counter
jnz MainLoop ; If counter is not equal to 0 jump to MainLoop
shl r9d, 3 ; Bit shifts the stored divided counter to the left by 3 (equal to multiplying by 8), 2^3 again
sub r8d, r9d ; Subs the counts from eachother, if it equals zero all bytes are zeroed, otherwise r8d equals the amount of bytes left
je Finished
SetFinalBytesLoop:
mov byte ptr [rcx], 0 ; Sets the last byte of the array to 0
inc rcx
dec r8d
jnz SetFinalBytesLoop
Finished:
ret
SetupLessThan8Bytes:
mov r8d, edx ; Mov the value of edx into r8d so the same code can be used in SetFinalBytesLoop
jmp SetFinalBytesLoop
ZeroArray endp
end发布于 2017-10-08 18:56:06
cmp edx, 0 jle Finished ; Check if count is 0
使用cmp当然不是错误的,但是检查任何不适当的计数器值的最佳方法是使用test指令。
test edx, edx
jle Finished ; Check if count is 0当计数器为零时绕过是好的,但是也许负计数器值应该被认为是一个错误并相应地处理?
周围乱跳
cmp edx, 8 jl SetupLessThan8Bytes ; Check if counter is less than 8 mov r8d, edx ; Storing the original count ... ... SetupLessThan8Bytes: mov r8d, edx jmp SetFinalBytesLoop
当EDX中的计数器小于8时,您跳转到SetupLessThan8Bytes,在那里您只需复制一个计数器,然后再次跳转到SetFinalBytesLoop。
如果将生成原始计数器副本的指令移至将计数器与8进行比较的位置,则可以避免编写3行代码(标签、mov和jmp)。此外,程序变得更加清晰。
mov r8d, edx ; Storing the original count
cmp edx, 8
jl SetFinalBytesLoop ; Check if counter is less than 8当您将EDX中的计数器向右移动3次以找出您必须处理多少qword时,您可以查看零标志。如果设置了ZF (意味着根本没有qword),则立即知道计数器在1,7范围内,因此上面的代码段变成:
mov r8d, edx ; Storing the original count
shr edx, 3 ; Equal to dividing by 8
jz SetFinalBytesLoop ; Jump if counter is less than 8的计算
mov r9d, edx ... shl r9d, 3 sub r8d, r9d je Finished SetFinalBytesLoop:
找出剩余字节数的方法太复杂了。这是正确的,但不必要地牵扯进来。基本上,它所需的是与7的原始计数器,以提取最低3位。更简单、更短、更少使用一个寄存器,这在将来的程序中总是很方便的:
and r8d, 7
jz Finished
SetFinalBytesLoop:使用32位直接值,mov指令在MainLoop中是相当长的(7个字节)。您可以将零存储在RAX中并将其移动到内存中。这也消除了提及"qword ptr“的必要性:
xor rax, rax ; Equivalent to MOV EAX, 0
MainLoop:
mov [rcx], rax ; Set the next 8 bytes (qword) to 0
add rcx, 8 ; Move pointer along the array by 8 bytes
dec edx ; Decrement the counter
jnz MainLoop ; If counter is not equal to 0 jump to MainLoop xor rax, rax
test edx, edx
jle Finished ; Check if count is LE 0
mov r8d, edx ; Copy of the original count
shr edx, 3 ; Gives number of qwords
jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
add rcx, 8 ; Step per 8 bytes
dec edx ; Dec the counter
jnz MainLoop
and r8d, 7 ; Remainder from division by 8
jz Finished
SetFinalBytesLoop:
mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
inc rcx ; Step per 1 byte
dec r8d ; Dec counter
jnz SetFinalBytesLoop
Finished:
ret我在代码中将xor rax, rax移到了更高的位置,这样SetFinalBytesLoop就可以从使用寄存器AL而不是直接使用0中获益。
您可以应用于程序的最重要的优化是确保您编写的qword值在qword边界上对齐,因此一个可被8除的内存地址。
额外的对齐循环最多迭代7次。
xor rax, rax
test edx, edx
jle Finished ; Check if count is LE 0
jmp TestAligned
AlignLoop:
mov [rcx], al
inc rcx
dec edx
jz Finished
TestAligned:
test rcx, 7 ; Is this a qword aligned address?
jnz AlignLoop ; Not yet!
mov r8d, edx ; Copy of the (reduced) original count
shr edx, 3 ; Gives number of qwords
jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
add rcx, 8 ; Step per 8 bytes
dec edx ; Dec the counter
jnz MainLoop
and r8d, 7 ; Remainder from division by 8
jz Finished
SetFinalBytesLoop:
mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
inc rcx ; Step per 1 byte
dec r8d ; Dec counter
jnz SetFinalBytesLoop
Finished:
rethttps://codereview.stackexchange.com/questions/175484
复制相似问题