티스토리 뷰

최근에 aws를 학습 서버로 활용해보고 있다. 그중에서 g5.48xlarge 인스턴스를 활용하고 있다.

 

그런데 4x A5000 서버(on-premis)에서 돌린 학습이 8x A10G 서버(aws)보다 약 1.5~2배가량 빨랐다. 예전에 챗지피티한테 스펙 비교해보라고 했을때 쿠다 코어수가 엇비슷했고 메모리 대역폭도 비슷한 수준이라 8x A10G 정도 되면 더 빠를것이라 기대했는데 그렇지 못한 결과를 얻어서 챗지피티한테 물어봤다.

 

정리하면 fp16 계산 능력이 A10G가 A5000에 비해 절반 수준인 것과, 역시나 메모리 대역폭이 느린 게 원인으로 보인다. A5000서버에 NV Link를 달았는지 안달았는지를 모르겠는데 아마 달렸다면 이것도 DDP학습시 영향을 주었을 거 같다. 아니 그런데... fp16 계산 능력이 절반 수준이여도 8대면 적어도 동일한 속도가 나와야 하는 게 아닌가 싶긴한데... cpu나 ram, ssd등 에서도 속도차이가 발생할 수 있는 부분이니 그러려니한다.

 

 

아래는  답변이다.

 

In short, each RTX A5000 in your 4-GPU box can simply push more math per second and move data faster than each A10 G in the 8-GPU box, and when you add the communication overhead that comes with doubling the GPU count (all-reduce, parameter-sync, PCIe/NVLink traffic, kernel launches, CPU scheduling), the “slower-per-GPU but more GPUs” setup can end up behind the “fewer but faster” one. Key culprits are the A5000’s ~1.8× higher FP16/Tensor throughput, ~28 % more memory bandwidth, and 50 % higher power headroom, plus the scaling penalty of eight workers instead of four.

1 · Raw-spec side-by-side

Spec (Ampere) A10 G RTX A5000 Why it matters

Launch market Cloud / virtual-GPU Workstation/compute Different clock-TDP trade-offs
CUDA cores 9 216 (TechPowerUp) 8 192 (TechPowerUp) Core count alone ≠ speed—clock & TDP matter
Boost clock ~1 530 MHz* (150 W limit) (NVIDIA, NVIDIA) up to 1 695 MHz (230 W) (TechPowerUp, NVIDIA) Higher clocks drive FLOPS
FP16/TF32 Tensor 125 TFLOPS FP16 (NVIDIA, NVIDIA) 222 TFLOPS FP16 (NVIDIA Developer Forums, CUDO Compute) ~1.8× per-GPU math throughput
Memory size 24 GB GDDR6 24 GB GDDR6 Same cap, but…
Memory BW 600 GB/s (NVIDIA, NVIDIA) 768 GB/s (CUDO Compute) +28 % feed-and-fetch speed
Max board power 150 W (single-slot) (NVIDIA, TechPowerUp) 230 W (dual-slot) (NVIDIA, TechPowerUp) Extra 80 W keeps clocks high under load
NVLink None 2-way bridge (112 GB/s) (NVIDIA) Reduces inter-GPU traffic when enabled
Form factor Single-slot, air-starved Dual-slot, beefy cooler Cooling affects sustained clocks

*NVIDIA does not publish an official boost clock for A10 G; monitoring tools typically report ~1.5 GHz under load in 150 W mode.

2 · Why the A5000 box trains faster

2.1 Per-GPU arithmetic & bandwidth

  • Tensor throughput: 222 TFLOPS vs 125 TFLOPS means the A5000 crunches mixed-precision GEMMs ~78 % faster per chip. (NVIDIA, NVIDIA Developer Forums, CUDO Compute)
  • Memory bandwidth: 768 GB/s vs 600 GB/s keeps those tensor units fed and reduces time in memcpyAsync/all_reduce operations. (CUDO Compute, NVIDIA)
  • Higher TDP: 230 W lets the GA102 die stay near its peak clock for longer training runs; the 150 W A10 G often drops to ~1.3-1.4 GHz once junction temps rise. (NVIDIA, NVIDIA)

2.2 Scaling overhead of eight GPUs

  • Gradient synchronisation: With 8 × A10 G every backward pass does seven extra peer-to-peer all-reduces compared with 4 × A5000, and PCIe traffic grows O(N log N). On plain PCIe (no NVSwitch) this quickly dominates once per-GPU compute time shrinks.
  • Launch latency: Frameworks issue kernels to twice as many devices, stressing CPU cores and NVML locks.
  • Inefficient batch-size per GPU: If you kept the same global batch, each A10 G saw half the mini-batch of an A5000, so kernel occupancy and mixed-precision tensor-core utilisation drop.

2.3 Other sneaky factors

  1. Cooling & power delivery: A10 G’s single-slot cooler limits sustained clocks, especially in 8-way dense chassis; A5000’s dual-slot blower or active twin-axial fans maintain higher boost. (TechPowerUp, TechPowerUp)
  2. CPU / RAM imbalance: An 8-GPU server often uses two PCIe PLX switches; if the CPU can’t keep 8 queues busy, GPU utilisation dips.
  3. NVLink bridges on the A5000 rig: If you installed two-way bridges (common in 4-GPU workstations), gradients inside each bridge pair travel at 112 GB/s instead of 32 GB/s PCIe, cutting comm-time by ~3× for half the transfers. (NVIDIA)
  4. Driver clocks & nvidia-smi -lgc: Many cloud builds ship A10 G with a conservative 930-1350 MHz lock for vGPU stability; unlocking can close ~10-15 % of the gap.

3 · What you can try on the A10 G box

Quick tweak Why it might help

nvidia-smi -lgc 1200,1695 + adequate airflow Raise boost limit in perpetuity (watch thermals)
Larger per-GPU batch or gradient accumulation Improves kernel occupancy and amortises comm-overhead
Use NCCL P2P_LEVEL=SYS + hierarchical all-reduce Lets NCCL exploit PCIe-switch topology more efficiently
FP8 or sparsity (A10 G supports it) Cuts math and memory load to offset bandwidth deficit
Reduce sync frequency (e.g., gradient = every N steps) Trades a bit of convergence for wall-time

4 · Take-away

Even though you have twice as many A10 Gs, each card is clock-and-power-capped and fed by narrower DRAM, so total math-and-memory throughput for the 8-GPU box only scales to ~1.6 × that of the 4-GPU A5000 box on paper, yet the communication and saturation penalties scale worse than linearly—often wiping out the nominal advantage. Equalise clocks, enlarge per-GPU batches, or move to NVLink/Switch-based platforms if you want eight GPUs to win consistently.

 

메모리 대역폭도, FP16 TFLOPS도 빵빵한 L40s가 달린 ec2g6e 인스턴스 이상급을 써줘야 내가 원하는 학습 속도를 달성할 수 있을 거 같다?!

댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2025/12   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
글 보관함