티스토리 뷰
최근에 aws를 학습 서버로 활용해보고 있다. 그중에서 g5.48xlarge 인스턴스를 활용하고 있다.

그런데 4x A5000 서버(on-premis)에서 돌린 학습이 8x A10G 서버(aws)보다 약 1.5~2배가량 빨랐다. 예전에 챗지피티한테 스펙 비교해보라고 했을때 쿠다 코어수가 엇비슷했고 메모리 대역폭도 비슷한 수준이라 8x A10G 정도 되면 더 빠를것이라 기대했는데 그렇지 못한 결과를 얻어서 챗지피티한테 물어봤다.

정리하면 fp16 계산 능력이 A10G가 A5000에 비해 절반 수준인 것과, 역시나 메모리 대역폭이 느린 게 원인으로 보인다. A5000서버에 NV Link를 달았는지 안달았는지를 모르겠는데 아마 달렸다면 이것도 DDP학습시 영향을 주었을 거 같다. 아니 그런데... fp16 계산 능력이 절반 수준이여도 8대면 적어도 동일한 속도가 나와야 하는 게 아닌가 싶긴한데... cpu나 ram, ssd등 에서도 속도차이가 발생할 수 있는 부분이니 그러려니한다.
아래는 답변이다.
In short, each RTX A5000 in your 4-GPU box can simply push more math per second and move data faster than each A10 G in the 8-GPU box, and when you add the communication overhead that comes with doubling the GPU count (all-reduce, parameter-sync, PCIe/NVLink traffic, kernel launches, CPU scheduling), the “slower-per-GPU but more GPUs” setup can end up behind the “fewer but faster” one. Key culprits are the A5000’s ~1.8× higher FP16/Tensor throughput, ~28 % more memory bandwidth, and 50 % higher power headroom, plus the scaling penalty of eight workers instead of four.
1 · Raw-spec side-by-side
Spec (Ampere) A10 G RTX A5000 Why it matters
| Launch market | Cloud / virtual-GPU | Workstation/compute | Different clock-TDP trade-offs |
| CUDA cores | 9 216 (TechPowerUp) | 8 192 (TechPowerUp) | Core count alone ≠ speed—clock & TDP matter |
| Boost clock | ~1 530 MHz* (150 W limit) (NVIDIA, NVIDIA) | up to 1 695 MHz (230 W) (TechPowerUp, NVIDIA) | Higher clocks drive FLOPS |
| FP16/TF32 Tensor | 125 TFLOPS FP16 (NVIDIA, NVIDIA) | 222 TFLOPS FP16 (NVIDIA Developer Forums, CUDO Compute) | ~1.8× per-GPU math throughput |
| Memory size | 24 GB GDDR6 | 24 GB GDDR6 | Same cap, but… |
| Memory BW | 600 GB/s (NVIDIA, NVIDIA) | 768 GB/s (CUDO Compute) | +28 % feed-and-fetch speed |
| Max board power | 150 W (single-slot) (NVIDIA, TechPowerUp) | 230 W (dual-slot) (NVIDIA, TechPowerUp) | Extra 80 W keeps clocks high under load |
| NVLink | None | 2-way bridge (112 GB/s) (NVIDIA) | Reduces inter-GPU traffic when enabled |
| Form factor | Single-slot, air-starved | Dual-slot, beefy cooler | Cooling affects sustained clocks |
*NVIDIA does not publish an official boost clock for A10 G; monitoring tools typically report ~1.5 GHz under load in 150 W mode.
2 · Why the A5000 box trains faster
2.1 Per-GPU arithmetic & bandwidth
- Tensor throughput: 222 TFLOPS vs 125 TFLOPS means the A5000 crunches mixed-precision GEMMs ~78 % faster per chip. (NVIDIA, NVIDIA Developer Forums, CUDO Compute)
- Memory bandwidth: 768 GB/s vs 600 GB/s keeps those tensor units fed and reduces time in memcpyAsync/all_reduce operations. (CUDO Compute, NVIDIA)
- Higher TDP: 230 W lets the GA102 die stay near its peak clock for longer training runs; the 150 W A10 G often drops to ~1.3-1.4 GHz once junction temps rise. (NVIDIA, NVIDIA)
2.2 Scaling overhead of eight GPUs
- Gradient synchronisation: With 8 × A10 G every backward pass does seven extra peer-to-peer all-reduces compared with 4 × A5000, and PCIe traffic grows O(N log N). On plain PCIe (no NVSwitch) this quickly dominates once per-GPU compute time shrinks.
- Launch latency: Frameworks issue kernels to twice as many devices, stressing CPU cores and NVML locks.
- Inefficient batch-size per GPU: If you kept the same global batch, each A10 G saw half the mini-batch of an A5000, so kernel occupancy and mixed-precision tensor-core utilisation drop.
2.3 Other sneaky factors
- Cooling & power delivery: A10 G’s single-slot cooler limits sustained clocks, especially in 8-way dense chassis; A5000’s dual-slot blower or active twin-axial fans maintain higher boost. (TechPowerUp, TechPowerUp)
- CPU / RAM imbalance: An 8-GPU server often uses two PCIe PLX switches; if the CPU can’t keep 8 queues busy, GPU utilisation dips.
- NVLink bridges on the A5000 rig: If you installed two-way bridges (common in 4-GPU workstations), gradients inside each bridge pair travel at 112 GB/s instead of 32 GB/s PCIe, cutting comm-time by ~3× for half the transfers. (NVIDIA)
- Driver clocks & nvidia-smi -lgc: Many cloud builds ship A10 G with a conservative 930-1350 MHz lock for vGPU stability; unlocking can close ~10-15 % of the gap.
3 · What you can try on the A10 G box
Quick tweak Why it might help
| nvidia-smi -lgc 1200,1695 + adequate airflow | Raise boost limit in perpetuity (watch thermals) |
| Larger per-GPU batch or gradient accumulation | Improves kernel occupancy and amortises comm-overhead |
| Use NCCL P2P_LEVEL=SYS + hierarchical all-reduce | Lets NCCL exploit PCIe-switch topology more efficiently |
| FP8 or sparsity (A10 G supports it) | Cuts math and memory load to offset bandwidth deficit |
| Reduce sync frequency (e.g., gradient = every N steps) | Trades a bit of convergence for wall-time |
4 · Take-away
Even though you have twice as many A10 Gs, each card is clock-and-power-capped and fed by narrower DRAM, so total math-and-memory throughput for the 8-GPU box only scales to ~1.6 × that of the 4-GPU A5000 box on paper, yet the communication and saturation penalties scale worse than linearly—often wiping out the nominal advantage. Equalise clocks, enlarge per-GPU batches, or move to NVLink/Switch-based platforms if you want eight GPUs to win consistently.
메모리 대역폭도, FP16 TFLOPS도 빵빵한 L40s가 달린 ec2g6e 인스턴스 이상급을 써줘야 내가 원하는 학습 속도를 달성할 수 있을 거 같다?!
'Deep Learning' 카테고리의 다른 글
| peft, timm 기반 InternViT-300M-448px-V2.5 모델 lora 적용 코드 예시 (0) | 2025.06.03 |
|---|---|
| LORA, Adapter, Prompt Tuning 등 PEFT를 ViT에 적용 및 제안한 논문 (0) | 2025.06.01 |
| internvl3 이 나왔었네 (0) | 2025.05.04 |
| hugging face grounding dino base demo space 개설 (0) | 2025.05.02 |
| meta sam2 Config files are missing --> hydra.errors.MissingConfigException:Cannot find primary config 'sam2_hiera_l.yaml' 에러 (1) | 2025.04.30 |
- Total
- Today
- Yesterday
- cosine
- PyCharm
- FairMOT
- 백준 1766
- 단축키
- 조합
- 순열
- Lowest Common Ancestor
- 파이참
- 백준
- 인공지능을 위한 선형대수
- C++ Deploy
- 백준 11437
- 문제집
- 이분탐색
- 백준 11053
- 자료구조
- 백트래킹
- 가장 긴 증가하는 부분 수열
- MOT
- LCA
- ㅂ
- 위상 정렬 알고리즘
| 일 | 월 | 화 | 수 | 목 | 금 | 토 |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 | 30 | 31 |
