Mixed Precision Savings Calculator

Result Summary

Calculated Savings Overview

This result compares a full precision baseline against a mixed precision setup using your current assumptions for weights, gradients, optimizer states, activations, and runtime overhead.

Baseline Training Memory

108.03 GB

Mixed Training Memory

107.40 GB

Training Delta

+0.63 GB

Training Savings

0.58%

Inference Savings

45.86%

Required GPUs

2 → 2

Estimated Cost Delta

$0.00

Max Batch on Selected GPU

0 → 0

Memory Component	Baseline Training	Mixed Training
Weights	26.08 GB	13.04 GB
Master Weights	0.00 GB	26.08 GB
Gradients	26.08 GB	13.04 GB
Optimizer States	52.15 GB	52.15 GB
Activations	1.72 GB	1.09 GB
Runtime Overhead	2.00 GB	2.00 GB

Calculator Inputs

Enter baseline and mixed precision assumptions. The calculator estimates training memory, inference memory, likely GPU count, maximum batch size, and budget impact.

Model Parameters (Millions)

Baseline Weight Bytes

Mixed Weight Bytes

Keep FP32 Master Weights

Master Weight Bytes

Baseline Gradient Bytes

Mixed Gradient Bytes

Optimizer States Per Parameter

Baseline Optimizer State Bytes

Mixed Optimizer State Bytes

Baseline Activation Per Sample (MB)

Mixed Activation Per Sample (MB)

Batch Size

Runtime Overhead (GB)

GPU Memory (GB)

GPU Hourly Rate ($)

Training Hours

Reset

Example Data Table

These examples show how memory drops when weights, gradients, and activations move to lower precision while optimizer states remain larger.

Profile	Parameters (M)	Batch Size	Baseline Train Memory	Mixed Train Memory	Training Delta	Training Savings
Vision Model	50	64	3.07 GB	2.63 GB	+0.44 GB	14.25%
Base Transformer	355	16	7.73 GB	7.38 GB	+0.34 GB	4.45%
7B Language Model	7,000	8	108.03 GB	107.40 GB	+0.63 GB	0.58%

Formula Used

Baseline Training Memory
Weights + Gradients + Optimizer States + Activations + Runtime Overhead

Mixed Training Memory
Mixed Weights + Master Weights + Mixed Gradients + Mixed Optimizer States + Mixed Activations + Runtime Overhead

Inference Memory
Weights + Activations + Runtime Overhead

Weights Memory
Parameters × Weight Bytes

Gradients Memory
Parameters × Gradient Bytes

Optimizer States Memory
Parameters × Optimizer States Per Parameter × State Bytes

Activation Memory
Activation Memory Per Sample × Batch Size

Training Savings Percentage
((Baseline Training Memory − Mixed Training Memory) ÷ Baseline Training Memory) × 100

Estimated GPU Count
Ceiling(Training Memory ÷ Selected GPU Memory)

Estimated Training Cost
Required GPU Count × GPU Hourly Rate × Training Hours

How to Use This Calculator

Enter your model size in millions of parameters.
Set baseline bytes for weights, gradients, and optimizer states.
Enter the mixed precision bytes for each matching component.
Choose whether training keeps FP32 master weights.
Add activation memory per sample for both setups.
Enter batch size, overhead, GPU memory, rate, and hours.
Click Calculate Savings to generate the comparison.
Review memory totals, GPU count, cost delta, and graph.
Use the CSV or PDF buttons to export results.

FAQs

1. What does mixed precision usually save?

Mixed precision often cuts weight, gradient, and activation memory. Savings depend on whether optimizer states stay larger and whether master weights are retained during training.

2. Why are optimizer states important here?

Optimizer states can dominate training memory, especially with Adam. If they remain in larger formats, total savings shrink even when weights and gradients use lower precision.

3. Why include master weights?

Many mixed precision training pipelines keep high precision master copies for stability. That extra copy reduces apparent savings, so it should be modeled explicitly.

4. Does this estimate inference memory too?

Yes. Inference memory uses weights, activations, and runtime overhead. Master weights are excluded because inference usually serves only the deployed precision format.

5. Can the calculator show worse results for mixed precision?

Yes. If master weights and optimizer states remain large, mixed precision may save less than expected. In unusual settings, total memory can even increase.

6. How should I estimate activation memory per sample?

Use profiler output, framework memory summaries, or empirical measurements from a representative batch. Divide the observed activation memory by batch size for a usable estimate.

7. Is the cost estimate exact?

No. It is a planning estimate based on memory-driven GPU count, hourly price, and training time. Networking, storage, utilization, and checkpoint overhead are excluded.

8. Can I use this for FP16, BF16, or FP8 planning?

Yes. Enter the byte assumptions that match your chosen format. The calculator works for any precision combination when your component sizes are realistic.