8x A10G (g5.48xlarge) vs 4x A5000 학습 뭐가 더 빠를까?

티스토리 뷰

Deep Learning

8x A10G (g5.48xlarge) vs 4x A5000 학습 뭐가 더 빠를까?

developer0hye 2025. 5. 10. 10:49

최근에 aws를 학습 서버로 활용해보고 있다. 그중에서 g5.48xlarge 인스턴스를 활용하고 있다.

그런데 4x A5000 서버(on-premis)에서 돌린 학습이 8x A10G 서버(aws)보다 약 1.5~2배가량 빨랐다. 예전에 챗지피티한테 스펙 비교해보라고 했을때 쿠다 코어수가 엇비슷했고 메모리 대역폭도 비슷한 수준이라 8x A10G 정도 되면 더 빠를것이라 기대했는데 그렇지 못한 결과를 얻어서 챗지피티한테 물어봤다.

정리하면 fp16 계산 능력이 A10G가 A5000에 비해 절반 수준인 것과, 역시나 메모리 대역폭이 느린 게 원인으로 보인다. A5000서버에 NV Link를 달았는지 안달았는지를 모르겠는데 아마 달렸다면 이것도 DDP학습시 영향을 주었을 거 같다. 아니 그런데... fp16 계산 능력이 절반 수준이여도 8대면 적어도 동일한 속도가 나와야 하는 게 아닌가 싶긴한데... cpu나 ram, ssd등 에서도 속도차이가 발생할 수 있는 부분이니 그러려니한다.

아래는 답변이다.

In short, each RTX A5000 in your 4-GPU box can simply push more math per second and move data faster than each A10 G in the 8-GPU box, and when you add the communication overhead that comes with doubling the GPU count (all-reduce, parameter-sync, PCIe/NVLink traffic, kernel launches, CPU scheduling), the “slower-per-GPU but more GPUs” setup can end up behind the “fewer but faster” one. Key culprits are the A5000’s ~1.8× higher FP16/Tensor throughput, ~28 % more memory bandwidth, and 50 % higher power headroom, plus the scaling penalty of eight workers instead of four.

1 · Raw-spec side-by-side

Spec (Ampere) A10 G RTX A5000 Why it matters

Launch market	Cloud / virtual-GPU	Workstation/compute	Different clock-TDP trade-offs
CUDA cores	9 216 (TechPowerUp)	8 192 (TechPowerUp)	Core count alone ≠ speed—clock & TDP matter
Boost clock	~1 530 MHz* (150 W limit) (NVIDIA, NVIDIA)	up to 1 695 MHz (230 W) (TechPowerUp, NVIDIA)	Higher clocks drive FLOPS
FP16/TF32 Tensor	125 TFLOPS FP16 (NVIDIA, NVIDIA)	222 TFLOPS FP16 (NVIDIA Developer Forums, CUDO Compute)	~1.8× per-GPU math throughput
Memory size	24 GB GDDR6	24 GB GDDR6	Same cap, but…
Memory BW	600 GB/s (NVIDIA, NVIDIA)	768 GB/s (CUDO Compute)	+28 % feed-and-fetch speed
Max board power	150 W (single-slot) (NVIDIA, TechPowerUp)	230 W (dual-slot) (NVIDIA, TechPowerUp)	Extra 80 W keeps clocks high under load
NVLink	None	2-way bridge (112 GB/s) (NVIDIA)	Reduces inter-GPU traffic when enabled
Form factor	Single-slot, air-starved	Dual-slot, beefy cooler	Cooling affects sustained clocks

*NVIDIA does not publish an official boost clock for A10 G; monitoring tools typically report ~1.5 GHz under load in 150 W mode.

2 · Why the A5000 box trains faster

2.1 Per-GPU arithmetic & bandwidth

Tensor throughput: 222 TFLOPS vs 125 TFLOPS means the A5000 crunches mixed-precision GEMMs ~78 % faster per chip. (NVIDIA, NVIDIA Developer Forums, CUDO Compute)
Memory bandwidth: 768 GB/s vs 600 GB/s keeps those tensor units fed and reduces time in memcpyAsync/all_reduce operations. (CUDO Compute, NVIDIA)
Higher TDP: 230 W lets the GA102 die stay near its peak clock for longer training runs; the 150 W A10 G often drops to ~1.3-1.4 GHz once junction temps rise. (NVIDIA, NVIDIA)

2.2 Scaling overhead of eight GPUs

Gradient synchronisation: With 8 × A10 G every backward pass does seven extra peer-to-peer all-reduces compared with 4 × A5000, and PCIe traffic grows O(N log N). On plain PCIe (no NVSwitch) this quickly dominates once per-GPU compute time shrinks.
Launch latency: Frameworks issue kernels to twice as many devices, stressing CPU cores and NVML locks.
Inefficient batch-size per GPU: If you kept the same global batch, each A10 G saw half the mini-batch of an A5000, so kernel occupancy and mixed-precision tensor-core utilisation drop.

2.3 Other sneaky factors

Cooling & power delivery: A10 G’s single-slot cooler limits sustained clocks, especially in 8-way dense chassis; A5000’s dual-slot blower or active twin-axial fans maintain higher boost. (TechPowerUp, TechPowerUp)
CPU / RAM imbalance: An 8-GPU server often uses two PCIe PLX switches; if the CPU can’t keep 8 queues busy, GPU utilisation dips.
NVLink bridges on the A5000 rig: If you installed two-way bridges (common in 4-GPU workstations), gradients inside each bridge pair travel at 112 GB/s instead of 32 GB/s PCIe, cutting comm-time by ~3× for half the transfers. (NVIDIA)
Driver clocks & nvidia-smi -lgc: Many cloud builds ship A10 G with a conservative 930-1350 MHz lock for vGPU stability; unlocking can close ~10-15 % of the gap.

3 · What you can try on the A10 G box

Quick tweak Why it might help

nvidia-smi -lgc 1200,1695 + adequate airflow	Raise boost limit in perpetuity (watch thermals)
Larger per-GPU batch or gradient accumulation	Improves kernel occupancy and amortises comm-overhead
Use NCCL P2P_LEVEL=SYS + hierarchical all-reduce	Lets NCCL exploit PCIe-switch topology more efficiently
FP8 or sparsity (A10 G supports it)	Cuts math and memory load to offset bandwidth deficit
Reduce sync frequency (e.g., gradient = every N steps)	Trades a bit of convergence for wall-time

4 · Take-away

Even though you have twice as many A10 Gs, each card is clock-and-power-capped and fed by narrower DRAM, so total math-and-memory throughput for the 8-GPU box only scales to ~1.6 × that of the 4-GPU A5000 box on paper, yet the communication and saturation penalties scale worse than linearly—often wiping out the nominal advantage. Equalise clocks, enlarge per-GPU batches, or move to NVLink/Switch-based platforms if you want eight GPUs to win consistently.

메모리 대역폭도, FP16 TFLOPS도 빵빵한 L40s가 달린 ec2g6e 인스턴스 이상급을 써줘야 내가 원하는 학습 속도를 달성할 수 있을 거 같다?!

'Deep Learning' 카테고리의 다른 글

peft, timm 기반 InternViT-300M-448px-V2.5 모델 lora 적용 코드 예시 (0)	2025.06.03
LORA, Adapter, Prompt Tuning 등 PEFT를 ViT에 적용 및 제안한 논문 (0)	2025.06.01
internvl3 이 나왔었네 (0)	2025.05.04
hugging face grounding dino base demo space 개설 (0)	2025.05.02
meta sam2 Config files are missing --> hydra.errors.MissingConfigException:Cannot find primary config 'sam2_hiera_l.yaml' 에러 (1)	2025.04.30

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

글 보관함

지속 가능한 꾸준함

티스토리 뷰

8x A10G (g5.48xlarge) vs 4x A5000 학습 뭐가 더 빠를까?

'Deep Learning' 카테고리의 다른 글

티스토리툴바