Calculated Savings Overview
This result compares a full precision baseline against a mixed precision setup using your current assumptions for weights, gradients, optimizer states, activations, and runtime overhead.
| Memory Component | Baseline Training | Mixed Training |
|---|---|---|
| Weights | 26.08 GB | 13.04 GB |
| Master Weights | 0.00 GB | 26.08 GB |
| Gradients | 26.08 GB | 13.04 GB |
| Optimizer States | 52.15 GB | 52.15 GB |
| Activations | 1.72 GB | 1.09 GB |
| Runtime Overhead | 2.00 GB | 2.00 GB |
Calculator Inputs
Enter baseline and mixed precision assumptions. The calculator estimates training memory, inference memory, likely GPU count, maximum batch size, and budget impact.
Example Data Table
These examples show how memory drops when weights, gradients, and activations move to lower precision while optimizer states remain larger.
| Profile | Parameters (M) | Batch Size | Baseline Train Memory | Mixed Train Memory | Training Delta | Training Savings |
|---|---|---|---|---|---|---|
| Vision Model | 50 | 64 | 3.07 GB | 2.63 GB | +0.44 GB | 14.25% |
| Base Transformer | 355 | 16 | 7.73 GB | 7.38 GB | +0.34 GB | 4.45% |
| 7B Language Model | 7,000 | 8 | 108.03 GB | 107.40 GB | +0.63 GB | 0.58% |
Formula Used
Weights + Gradients + Optimizer States + Activations + Runtime Overhead
Mixed Weights + Master Weights + Mixed Gradients + Mixed Optimizer States + Mixed Activations + Runtime Overhead
Weights + Activations + Runtime Overhead
Parameters × Weight Bytes
Parameters × Gradient Bytes
Parameters × Optimizer States Per Parameter × State Bytes
Activation Memory Per Sample × Batch Size
((Baseline Training Memory − Mixed Training Memory) ÷ Baseline Training Memory) × 100
Ceiling(Training Memory ÷ Selected GPU Memory)
Required GPU Count × GPU Hourly Rate × Training Hours
How to Use This Calculator
- Enter your model size in millions of parameters.
- Set baseline bytes for weights, gradients, and optimizer states.
- Enter the mixed precision bytes for each matching component.
- Choose whether training keeps FP32 master weights.
- Add activation memory per sample for both setups.
- Enter batch size, overhead, GPU memory, rate, and hours.
- Click Calculate Savings to generate the comparison.
- Review memory totals, GPU count, cost delta, and graph.
- Use the CSV or PDF buttons to export results.
FAQs
1. What does mixed precision usually save?
Mixed precision often cuts weight, gradient, and activation memory. Savings depend on whether optimizer states stay larger and whether master weights are retained during training.
2. Why are optimizer states important here?
Optimizer states can dominate training memory, especially with Adam. If they remain in larger formats, total savings shrink even when weights and gradients use lower precision.
3. Why include master weights?
Many mixed precision training pipelines keep high precision master copies for stability. That extra copy reduces apparent savings, so it should be modeled explicitly.
4. Does this estimate inference memory too?
Yes. Inference memory uses weights, activations, and runtime overhead. Master weights are excluded because inference usually serves only the deployed precision format.
5. Can the calculator show worse results for mixed precision?
Yes. If master weights and optimizer states remain large, mixed precision may save less than expected. In unusual settings, total memory can even increase.
6. How should I estimate activation memory per sample?
Use profiler output, framework memory summaries, or empirical measurements from a representative batch. Divide the observed activation memory by batch size for a usable estimate.
7. Is the cost estimate exact?
No. It is a planning estimate based on memory-driven GPU count, hourly price, and training time. Networking, storage, utilization, and checkpoint overhead are excluded.
8. Can I use this for FP16, BF16, or FP8 planning?
Yes. Enter the byte assumptions that match your chosen format. The calculator works for any precision combination when your component sizes are realistic.