Why You Might Not Need That GPU

The GPU pricing narrative in quantitative finance goes something like this: GPUs are massively parallel, CPUs are not, therefore GPUs win. The implication is that any compute-intensive workload (Monte Carlo, XVA, FRTB) should run on GPUs.

The reality is more nuanced.

The actual numbers

We benchmarked equivalent workloads on GPU and CPU:

Platform	Time	Monthly Cost
NVIDIA V100 GPU	10.2 ms	$1,300/month
CPU, 30 threads, AVX-512	13.5 ms	$915/month

32% slower. 30% cheaper. And the CPU version gets adjoint Greeks for free. The same compiled kernel that prices also differentiates. On GPU, you either write a separate CUDA kernel for Greeks or bump-and-revalue.

GPU liabilities nobody talks about

Vendor lock-in. CUDA only runs on NVIDIA. When NVIDIA raises prices (they do), you have no alternative without a rewrite.
Memory wall. GPU memory is limited (16-80GB). Large portfolios require batching and data transfer, which kills the theoretical throughput advantage.
Double precision. Consumer GPUs (gaming cards repurposed for compute) are fast at single precision but 2× slower at double. Finance requires double precision.
Developer cost. CUDA developers are expensive and scarce. Maintaining parallel CUDA and C++ codebases doubles the development burden.
Cloud availability. GPU instances have spotty availability and premium pricing. CPU instances are commodity.

The CPU alternative

Modern CPUs with AVX-512 process 8 double-precision values per instruction. With compiled kernel replay (record once, evaluate millions of times with SIMD vectorization), the effective throughput approaches GPU levels for the workload patterns common in finance.

GPU advantage comes from massive parallelism across simple operations. But a compiled adjoint kernel on CPU also achieves massive parallelism (across SIMD lanes and OpenMP threads) while also providing exact sensitivities in the same pass.

When GPU still wins

Workloads with >10M independent paths and simple per-path logic
Neural network training (matrix multiply dominance)
When you already have CUDA code and no need for adjoint Greeks

When CPU wins

Workloads requiring adjoint Greeks (the backward pass is inherently sequential per path)
Complex per-path logic (branches, state machines, path dependence)
When infrastructure cost matters more than peak throughput
When you want one codebase, not two

Implemented using AADC, a commercial adjoint AD compiler (matlogica.com).

Why You Might Not Need That GPU

The actual numbers

GPU liabilities nobody talks about

The CPU alternative

When GPU still wins

When CPU wins

Frequently Asked Questions

How much code change is required to run CUDA on CPU?

What is the real cost difference between GPU and CPU TFLOPS?

What are the main limitations of GPU for quantitative finance?

What performance can I expect running CUDA code on CPU with AADC?

More in How-To

Using Claude AI to Accelerate Python Models 345× with AADC

How to Accelerate Existing Quant Libraries with AADC

Accelerate QuantLib 6-100× Easily

Interested in these opportunities?