The GPU pricing narrative in quantitative finance goes something like this: GPUs are massively parallel, CPUs are not, therefore GPUs win. The implication is that any compute-intensive workload (Monte Carlo, XVA, FRTB) should run on GPUs.
The reality is more nuanced.
The actual numbers
We benchmarked equivalent workloads on GPU and CPU:
| Platform | Time | Monthly Cost |
|---|---|---|
| NVIDIA V100 GPU | 10.2 ms | $1,300/month |
| CPU, 30 threads, AVX-512 | 13.5 ms | $915/month |
32% slower. 30% cheaper. And the CPU version gets adjoint Greeks for free. The same compiled kernel that prices also differentiates. On GPU, you either write a separate CUDA kernel for Greeks or bump-and-revalue.
GPU liabilities nobody talks about
- Vendor lock-in. CUDA only runs on NVIDIA. When NVIDIA raises prices (they do), you have no alternative without a rewrite.
- Memory wall. GPU memory is limited (16-80GB). Large portfolios require batching and data transfer, which kills the theoretical throughput advantage.
- Double precision. Consumer GPUs (gaming cards repurposed for compute) are fast at single precision but 2× slower at double. Finance requires double precision.
- Developer cost. CUDA developers are expensive and scarce. Maintaining parallel CUDA and C++ codebases doubles the development burden.
- Cloud availability. GPU instances have spotty availability and premium pricing. CPU instances are commodity.
The CPU alternative
Modern CPUs with AVX-512 process 8 double-precision values per instruction. With compiled kernel replay (record once, evaluate millions of times with SIMD vectorization), the effective throughput approaches GPU levels for the workload patterns common in finance.
GPU advantage comes from massive parallelism across simple operations. But a compiled adjoint kernel on CPU also achieves massive parallelism (across SIMD lanes and OpenMP threads) while also providing exact sensitivities in the same pass.
When GPU still wins
- Workloads with >10M independent paths and simple per-path logic
- Neural network training (matrix multiply dominance)
- When you already have CUDA code and no need for adjoint Greeks
When CPU wins
- Workloads requiring adjoint Greeks (the backward pass is inherently sequential per path)
- Complex per-path logic (branches, state machines, path dependence)
- When infrastructure cost matters more than peak throughput
- When you want one codebase, not two
Implemented using AADC, a commercial adjoint AD compiler (matlogica.com).
Frequently Asked Questions
How much code change is required to run CUDA on CPU?
With AADC, minimal changes are required: override CUDA extensions using #define directives, change native types to AADC active types (double to idouble, bool to ibool), and add AADC kernel compilation. For vanilla options, no changes to kernel.cu are required. Path-dependent options need minimal modifications while maintaining GPU/CPU compatibility.
What is the real cost difference between GPU and CPU TFLOPS?
Based on cloud pricing analysis, the average cost of a CPU TFLOP is approximately 30% higher than GPU - not the 1000x often claimed. The NVIDIA V100 costs approximately 183 per TFLOP for Intel Xeon Platinum 9282. Total cost of ownership often favors CPU when accounting for development, maintenance, and specialist developer costs.
What are the main limitations of GPU for quantitative finance?
GPU limitations include strict memory volume constraints (typically 8-48GB), inability to support AAD effectively, vendor lock-in to CUDA/NVIDIA ecosystem, scarcity of CUDA specialists, higher maintenance costs, and hardware obsolescence issues as older GPU generations reach end-of-service.
What performance can I expect running CUDA code on CPU with AADC?
Real benchmarks show comparable performance: NVIDIA V100 GPU achieves 10.2ms execution time at 915/month for equity linked security pricing with 1080 timesteps and 100k Monte Carlo simulations. CPU offers better value considering AAD support and memory capacity.