5090 如果不限制 maxrregcount=102 或者不修改 numThreadsPerBlock=512 → 384,会 core dump

编译的时候加上 -G 会产生误导信息:CUDA Exception: Lane User Stack Overflow

设置环境变量:

ulimit -c unlimited
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 # 允许在 GPU 异常时生成 Core Dump
export CUDA_COREDUMP_FILE="./cuda_core.%p"  # %p 会被替换为进程 ID
export CUDA_LAUNCH_BLOCKING=1 # 使内核启动变为同步模式,错误能在触发点被立即报告

报错:

Running `target/release/zkm-gpu-perf --program hello-world --stage wrap`
fatal runtime error: Rust cannot catch foreign exceptions, aborting
Aborted (core dumped)

在 Cargo.toml 中添加:

[profile.release]
panic = "abort"

重新运行后报错:

     Running `target/release/zkm-gpu-perf --program hello-world --stage wrap`
terminate called after throwing an instance of 'sppark_error'
  what():  cudaGetLastError()@sppark/ntt/ntt.cuh:91 failed: "too many resources requested for launch"
Aborted (core dumped)

https://cuda-programming.blogspot.com/2013/01/handling-cuda-error-messages.html

Too Many Resources Requested for Launch - This error means that the number of
registers available on the multiprocessor is being exceeded. Reduce the number
of threads per block to solve the problem.

编译时加上 --ptxas-options=-v 参数可以查看 kernel 的资源使用情况。

5090 使用了 168 个寄存器:

warning: zkm-gpu-core@0.1.0: ptxas info    : Compiling entry function '_Z21bit_rev_permutation_zILj64EEvP6kb31_tPKS0_j' for 'sm_120'
warning: zkm-gpu-core@0.1.0: ptxas info    : Function properties for _Z21bit_rev_permutation_zILj64EEvP6kb31_tPKS0_j
warning: zkm-gpu-core@0.1.0:     8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
warning: zkm-gpu-core@0.1.0: ptxas info    : Used 168 registers, used 1 barriers, 8 bytes cumulative stack size
warning: zkm-gpu-core@0.1.0: ptxas info    : Compile time = 76.787 ms

4090 使用了 126 个寄存器:

warning: zkm-gpu-core@0.1.0: ptxas info    : Compiling entry function '_Z21bit_rev_permutation_zILj64EEvP6kb31_tPKS0_j' for 'sm_89'
warning: zkm-gpu-core@0.1.0: ptxas info    : Function properties for _Z21bit_rev_permutation_zILj64EEvP6kb31_tPKS0_j
warning: zkm-gpu-core@0.1.0:     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
warning: zkm-gpu-core@0.1.0: ptxas info    : Used 126 registers, used 1 barriers, 372 bytes cmem[0]
warning: zkm-gpu-core@0.1.0: ptxas info    : Compile time = 40.913 ms

5090 的 kernel 确实比 4090 使用了更多的寄存器

但两者的每个 SM 寄存器的总量是一样的,都是 65536;每个 SM 上的最大线程数也是一样的,都是 1536

Device Name: NVIDIA GeForce RTX 5090
Compute Capability: 12.0
Total Global Memory: 31.36 GB
Max Threads per Block: 1024
Multiprocessor Count: 170
regsPerBlock: 65536
regsPerMultiprocessor: 65536
maxThreadsPerMultiProcessor: 1536
sharedMemPerBlock: 49152
clockRate: 2407000
memoryClockRate: 14001000
memoryBusWidth: 512
l2CacheSize: 100663296
concurrentKernels: 1
computeMode: 0
Device Name: NVIDIA GeForce RTX 4090
Compute Capability: 8.9
Total Global Memory: 23.53 GB
Max Threads per Block: 1024
Multiprocessor Count: 128
regsPerBlock: 65536
regsPerMultiprocessor: 65536
maxThreadsPerMultiProcessor: 1536
sharedMemPerBlock: 49152
clockRate: 2520000
memoryClockRate: 10501000
memoryBusWidth: 384
l2CacheSize: 75497472
concurrentKernels: 1
computeMode: 0

导致 5090 上的寄存器资源更容易被耗尽

解决:限制 maxrregcount=102 或者修改 numThreadsPerBlock=512 → 384