我在努力确保gcc把我的循环矢量化。事实证明,通过使用-march=znver1 (或-march=native),gcc跳过了一些循环,尽管它们可以被向量化。这一切为什么要发生?
在这段代码中,第二个循环将每个元素乘以一个标量,而不是矢量法:
#include <stdio.h>
#include <inttypes.h>
int main() {
const size_t N = 1000;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (size_t i = 0; i < N; ++i)
arr[i] *= 5;
for (size_t i = 0; i < N; ++i)
printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}gcc -O3 -fopt-info-vec-all -mavx2 main.c
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DIgcc -O3 -fopt-info-vec-all -march=znver1 main.c
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI-march=znver1包括-mavx2,因此我认为gcc出于某种原因选择不将其矢量化:
~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mamx-bf16 [disabled]
-mamx-int8 [disabled]
-mamx-tile [disabled]
-mandroid [disabled]
-march= znver1
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [enabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [disabled]
-mavx512cd [disabled]
-mavx512dq [disabled]
-mavx512er [disabled]
-mavx512f [disabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [disabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mavxvnni [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [disabled]
-mclzero [enabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [disabled]
-mhreset [disabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mkl [disabled]
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [enabled]
-mneeded [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= 128
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [disabled]
-msahf [enabled]
-mserialize [disabled]
-msgx [disabled]
-msha [enabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [enabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtsxldtrk [disabled]
-mtune-ctrl=
-mtune= znver1
-muclibc [disabled]
-muintr [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mwidekl [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known choices for return instrumentation with -minstrument-return=:
call none nop5
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option):
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native
Known valid arguments for -mtune= option:
generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3我也尝试了clang,在这两种情况下,循环都是由32字节向量( remark: vectorized loop (vectorization width: 4, interleaved count: 4) )来向量化的。
我用gcc 11.2.0
编辑:根据彼得·科德斯的要求,我意识到我实际上是在用4乘一段时间进行基准测试。
Makefile:
all:
gcc -O3 -mavx2 main.c -o 3
gcc -O3 -march=znver2 main.c -o 32
gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
gcc -O3 -march=znver1 main.c -o 31
gcc -O2 -mavx2 main.c -o 2
gcc -O2 -march=znver2 main.c -o 22
gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
gcc -O2 -march=znver1 main.c -o 21
hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
clean:
rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21代码:
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>
int main() {
const size_t N = 500;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (int j = 0; j < 20000000; ++j)
for (size_t i = 0; i < N; ++i)
arr[i] *= 4;
srand(time(0));
printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}N = 500, arr[i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.780 s ± 0.011 s [User: 1.778 s, System: 0.000 s]
Range (min … max): 1.763 s … 1.791 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.785 s ± 0.016 s [User: 1.783 s, System: 0.000 s]
Range (min … max): 1.773 s … 1.810 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.740 s ± 0.026 s [User: 1.737 s, System: 0.000 s]
Range (min … max): 1.724 s … 1.785 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.757 s ± 0.022 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.727 s … 1.785 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.467 s ± 0.031 s [User: 3.462 s, System: 0.000 s]
Range (min … max): 3.443 s … 3.519 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.475 s ± 0.028 s [User: 3.469 s, System: 0.001 s]
Range (min … max): 3.447 s … 3.512 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.464 s ± 0.034 s [User: 3.459 s, System: 0.001 s]
Range (min … max): 3.431 s … 3.509 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.465 s ± 0.013 s [User: 3.460 s, System: 0.001 s]
Range (min … max): 3.443 s … 3.475 s 5 runsN = 500, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.789 s ± 0.004 s [User: 1.786 s, System: 0.001 s]
Range (min … max): 1.783 s … 1.793 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.772 s ± 0.017 s [User: 1.769 s, System: 0.000 s]
Range (min … max): 1.755 s … 1.800 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.911 s ± 0.023 s [User: 2.907 s, System: 0.001 s]
Range (min … max): 2.880 s … 2.943 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 2.924 s ± 0.013 s [User: 2.921 s, System: 0.000 s]
Range (min … max): 2.906 s … 2.934 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.850 s ± 0.029 s [User: 3.846 s, System: 0.000 s]
Range (min … max): 3.823 s … 3.896 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.816 s ± 0.036 s [User: 3.812 s, System: 0.000 s]
Range (min … max): 3.777 s … 3.855 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.813 s ± 0.026 s [User: 3.809 s, System: 0.000 s]
Range (min … max): 3.780 s … 3.834 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.783 s ± 0.010 s [User: 3.779 s, System: 0.000 s]
Range (min … max): 3.773 s … 3.798 s 5 runsN = 512, arr[i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.849 s ± 0.015 s [User: 1.847 s, System: 0.000 s]
Range (min … max): 1.831 s … 1.873 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.846 s ± 0.013 s [User: 1.844 s, System: 0.001 s]
Range (min … max): 1.832 s … 1.860 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.756 s ± 0.012 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.744 s … 1.771 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.788 s ± 0.012 s [User: 1.785 s, System: 0.001 s]
Range (min … max): 1.774 s … 1.801 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.476 s ± 0.015 s [User: 3.472 s, System: 0.001 s]
Range (min … max): 3.458 s … 3.494 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.449 s ± 0.002 s [User: 3.446 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.452 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.456 s ± 0.007 s [User: 3.453 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.462 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.547 s ± 0.044 s [User: 3.542 s, System: 0.001 s]
Range (min … max): 3.482 s … 3.600 s 5 runsN = 512, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.847 s ± 0.013 s [User: 1.845 s, System: 0.000 s]
Range (min … max): 1.836 s … 1.863 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.830 s ± 0.007 s [User: 1.827 s, System: 0.001 s]
Range (min … max): 1.820 s … 1.837 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.983 s ± 0.017 s [User: 2.980 s, System: 0.000 s]
Range (min … max): 2.966 s … 3.012 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 3.026 s ± 0.039 s [User: 3.021 s, System: 0.001 s]
Range (min … max): 2.989 s … 3.089 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 4.000 s ± 0.021 s [User: 3.994 s, System: 0.001 s]
Range (min … max): 3.982 s … 4.035 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.940 s ± 0.041 s [User: 3.934 s, System: 0.001 s]
Range (min … max): 3.890 s … 3.981 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.928 s ± 0.032 s [User: 3.922 s, System: 0.001 s]
Range (min … max): 3.898 s … 3.979 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.908 s ± 0.029 s [User: 3.904 s, System: 0.000 s]
Range (min … max): 3.879 s … 3.954 s 5 runs我认为-O2 -march=znver1和-O3 -march=znver1一样快的运行是我在文件命名上的一个错误,当时我还没有创建makefile,我使用的是shell的历史记录。
发布于 2022-04-09 21:42:45
默认的-mtune=generic有-mprefer-vector-width=256,-mavx2不会改变这一点。
znver1意味着-mprefer-vector-width=128,因为这是HW的所有本机宽度。使用32字节的YMM向量的指令至少解码到2 uops,如果是过车道混洗,则更多.对于像这样简单的垂直SIMD,32字节向量就可以了;管道可以有效地处理2-uop指令。(我认为只有6 uop宽,但只有5条指令宽,所以仅使用1 uop指令就无法获得最大的前端吞吐量)。但是,当矢量化需要洗牌时,例如在不同单元宽度的数组中,GCC码元可能会变得更混乱(256位或更宽)。
而vmovdqa ymm0, ymm1 mov消除只适用于低128位的Zen1。此外,通常使用256位向量将意味着一个人应该使用vzeroupper之后,以避免在其他CPU(但不是Zen1)的性能问题。
我不知道Zen1如何处理错对齐的32字节加载/存储,其中每个16字节的一半是对齐的,但在单独的缓存行中。如果表现良好,GCC可能会考虑将znver1 -mprefer-vector-width提高到256个。但是,如果不知道大小是向量宽度的倍数,则更宽的向量意味着更多的清理代码。
理想情况下,GCC将能够发现这样简单的情况,并在那里使用256位向量。(纯垂直,不混合元素宽度,常数大小为32字节的倍数)。至少在CPU上,这很好: znver1,但不是bdver2,例如,由于CPU设计错误,256位存储总是很慢。
您可以通过使用vmovdqu [rdx], xmm0将第一个循环memset样循环矢量化的方式看到这种选择的结果。https://godbolt.org/z/E5Tq7Gfzc
因此,GCC决定只使用128个位向量,它只能容纳两个uint64_t元素,因此它决定不值得使用vpsllq / vpaddd将qword *5实现为(v<<2) + v,而不是在一个LEA指令中使用整数。
在这种情况下,几乎可以肯定是错误的,因为它仍然需要为每个元素或一对元素单独加载和存储。(并且循环开销,因为GCC的默认是不展开,除非用PGO,-fprofile-use。SIMD类似于循环展开,特别是在CPU上,它将256位向量作为两个单独的uop处理。)
我不太清楚GCC所说的“不向量化:不支持的数据类型”到底是什么意思。x86在AVX-512之前还没有SIMD uint64_t乘法指令,所以GCC可能会根据the general case给它分配一个成本,即必须用多个32x32 => 64位pmuludq指令和一堆洗牌来模仿它。只有在它克服了这个驼峰之后,它才意识到,对于像5这样只有2位位的常数来说,它实际上是相当便宜的?
这可以解释GCC在这里的决策过程,但我不确定这是正确的解释。然而,这类因素是在像编译器这样复杂的机器中发生的。熟练的人可以很容易地做出更聪明的选择,但是编译器只是做一系列的优化传递,并不总是同时考虑全局和所有细节。
-mprefer-vector-width=256没有帮助:
不将uint64_t *= 5矢量化似乎是GCC9回归。
(问题中的基准测试证实了实际的Zen1 CPU得到了将近2倍的加速比,就像在6 uops中执行2倍的uint64,而在5 uops中使用标量时的1倍。或4倍的uint64_t在10个uop中有256位向量,包括两个128位存储,这将成为吞吐量瓶颈和前端。)
即使使用-march=znver1 -O3 -mprefer-vector-width=256,我们也不会使用GCC9、10或11或当前主干对*= 5循环进行矢量化。就像你说的,我们使用-march=znver2。https://godbolt.org/z/dMTh7Wxcq
uint32_t的这些选项(甚至将矢量宽度保留在128位)中的确实得到了矢量化。标量将花费每个向量uop (非指令)4个操作,而不管Zen1上的128位或256位向量化,因此这并不能告诉我们是*=使成本模型决定不向量化,还是仅仅是每128位内部uop中的2对4元素。
使用uint64_t,更改为arr[i] += arr[i]<<2;仍然没有矢量化,但是arr[i] <<= 1;是矢量化的。(https://godbolt.org/z/6PMn93Y5G)。即使是arr[i] <<= 2;和arr[i] += 123在同一个循环中也是矢量化的,按照GCC认为不值得用它来向量化*= 5的指令,只是不同的操作数,常数而不是原来的向量。(标量仍然可以使用一个LEA)。显然,成本模型并没有最终的x86 asm机器指令,但我不知道为什么arr[i] += arr[i]会被认为比arr[i] <<= 1;更昂贵--这是完全相同的事情。
GCC8确实将循环矢量化,即使使用128位矢量宽度: https://godbolt.org/z/5o6qjc7f6。
# GCC8.5 -march=znver1 -O3 (-mprefer-vector-width=128)
.L12: # do{
vmovups xmm1, XMMWORD PTR [rsi] # 16-byte load
add rsi, 16 # ptr += 2 elements
vpsllq xmm0, xmm1, 2 # v << 2
vpaddq xmm0, xmm0, xmm1 # tmp += v
vmovups XMMWORD PTR [rsi-16], xmm0 # store
cmp rax, rsi
jne .L12 # } while(p != endp)在-march=znver1 -mprefer-vector-width=256中,将存储作为vmovups xmm / vextracti128的两个16字节的一半,Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? znver1就意味着-mavx256-split-unaligned-store (当GCC不确定它是否对齐时,这会影响到每一家商店。因此,即使数据是对齐的,也需要额外的指令。
不过,znver1并不意味着-mavx256-split-unaligned-load,所以GCC愿意将负载作为内存源操作数折叠到ALU操作中,在代码中这是很有用的。
https://stackoverflow.com/questions/71811588
复制相似问题