← Back to Blog
How-To

Why You Might Not Need That GPU

The assumption that GPUs are 1000× faster than CPUs is wrong for most quant workloads. Modern CPUs with AVX-512 and adjoint differentiation close the gap — at 30% lower infrastructure cost.

Dmitri Goloubentsev
Dmitri Goloubentsev
· 3 min read
CUDA GPU CPU AVX-512 migration cloud cost performance
CPU vs GPU: Decoding the Quant Performance Gap

The GPU pricing narrative in quantitative finance goes something like this: GPUs are massively parallel, CPUs are not, therefore GPUs win. The implication is that any compute-intensive workload (Monte Carlo, XVA, FRTB) should run on GPUs.

The reality is more nuanced.

The actual numbers

We benchmarked equivalent workloads on GPU and CPU:

PlatformTimeMonthly Cost
NVIDIA V100 GPU10.2 ms$1,300/month
CPU, 30 threads, AVX-51213.5 ms$915/month

32% slower. 30% cheaper. And the CPU version gets adjoint Greeks for free. The same compiled kernel that prices also differentiates. On GPU, you either write a separate CUDA kernel for Greeks or bump-and-revalue.

GPU liabilities nobody talks about

  1. Vendor lock-in. CUDA only runs on NVIDIA. When NVIDIA raises prices (they do), you have no alternative without a rewrite.
  2. Memory wall. GPU memory is limited (16-80GB). Large portfolios require batching and data transfer, which kills the theoretical throughput advantage.
  3. Double precision. Consumer GPUs (gaming cards repurposed for compute) are fast at single precision but 2× slower at double. Finance requires double precision.
  4. Developer cost. CUDA developers are expensive and scarce. Maintaining parallel CUDA and C++ codebases doubles the development burden.
  5. Cloud availability. GPU instances have spotty availability and premium pricing. CPU instances are commodity.

The CPU alternative

Modern CPUs with AVX-512 process 8 double-precision values per instruction. With compiled kernel replay (record once, evaluate millions of times with SIMD vectorization), the effective throughput approaches GPU levels for the workload patterns common in finance.

GPU advantage comes from massive parallelism across simple operations. But a compiled adjoint kernel on CPU also achieves massive parallelism (across SIMD lanes and OpenMP threads) while also providing exact sensitivities in the same pass.

When GPU still wins

  • Workloads with >10M independent paths and simple per-path logic
  • Neural network training (matrix multiply dominance)
  • When you already have CUDA code and no need for adjoint Greeks

When CPU wins

  • Workloads requiring adjoint Greeks (the backward pass is inherently sequential per path)
  • Complex per-path logic (branches, state machines, path dependence)
  • When infrastructure cost matters more than peak throughput
  • When you want one codebase, not two

Implemented using AADC, a commercial adjoint AD compiler (matlogica.com).

Want to see these results on your own portfolio?

Get in Touch

Interested in these opportunities?

Let's arrange a free demo for you and your team.

Book a Demo