Plan model runs without running out of memory. Adjust batch, precision, and optimizer settings instantly. See totals, margins, and download results in one click.
| Scenario | Mode | Params | Precision | Batch | Seq | Layers | Notes |
|---|---|---|---|---|---|---|---|
| LLM fine-tune | Training | 700M | FP16 | 8 | 2048 | 24 | Adam, moderate activations, typical overhead. |
| Long-context inference | Inference | 1.3B | INT8 | 2 | 8192 | 32 | KV cache dominates at high sequence lengths. |
| Vision model training | Training | 300M | BF16 | 16 | 1024 | 48 | Use manual activation elements if you profiled feature maps. |
Tip: Use the manual activation field when you have profiler numbers.
This estimator favors conservative, planning-friendly values over exact kernel-level accounting.
Memory limits decide if a run starts or fails. Training adds gradients and optimizer states. Inference adds cache at long contexts. Planning reduces restarts and wasted time.
Parameter memory equals parameters times bytes per value. A 1B model in FP16 stores about 2.00 GB for weights. In FP32 it stores about 4.00 GB. Quantized formats can reduce this further.
Training stores gradients for most parameters. Adam often stores two state buffers. That can add about 2× parameter memory. Mixed precision may keep a FP32 master copy. This adds about 4 bytes per parameter.
Activation memory grows with batch size, sequence length, hidden size, and layer count. It often dominates large batches. Checkpointing can cut activation storage. Use manual activation elements when you have profiler numbers.
Transformer inference stores key and value tensors. Cache scales with batch, sequence, hidden size, and layers. Doubling sequence length can nearly double cache use. Lower activation precision can reduce cache size.
Runtime overhead comes from buffers and fragmentation. A 10–20% overhead is common in practice. Add a safety margin to avoid edge failures. Use the graph to spot the biggest contributors and tune settings.
Training stores gradients and optimizer states. These can exceed weight memory. Inference usually stores weights and cache only. The gap grows with Adam and large batches.
It approximates extra tensors per layer. It captures attention and MLP intermediates. Values from 2.5 to 4.0 are common for transformers. Use profiling to refine it.
Use it when you measured activations with a profiler. This works well for vision models and custom graphs. It also helps when sequence length is not meaningful.
Quantization mostly reduces weight storage. Activations can still be FP16 or FP32 in many kernels. If your stack supports lower activation precision, set it here to estimate the impact.
Allocators fragment memory. Frameworks add buffers and workspaces. Overhead models this behavior. Safety margin adds headroom for spikes. Together they reduce out-of-memory errors.
It is a planning model, not a profiler. Exact usage depends on kernels, sharding, and runtime settings. Use it to compare scenarios. Then validate with a small test run.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.