Measure raw checkpoint size across weight formats. Compare gradients, optimizer states, shards, and metadata instantly. Plan storage efficiently before training, export, backup, and deployment.
Use the grid below. Large screens show three columns, smaller screens show two, and mobiles show one.
| Scenario | Trainable Params | Weight Precision | Optimizer | Raw Checkpoint | Compressed Checkpoint | Per Shard |
|---|---|---|---|---|---|---|
| Large mixed-precision training run | 1.30B + 10M adapters | FP16 / BF16 | AdamW, FP32 states | 19.63 GiB | 17.07 GiB | 2.13 GiB |
| Inference-only deployable artifact | 7.00B | INT8 | None | 6.77 GiB | 6.25 GiB | 1.56 GiB |
| Small experiment with resumable state | 220M | FP32 | Momentum | 3.28 GiB | 3.04 GiB | 0.76 GiB |
The example values are planning estimates. Actual file sizes depend on framework layout, serialization method, padding, and compression behavior.
1) Total stored weights
Weight Storage = (Trainable Parameters + Frozen Parameters + Adapter Parameters) × Weight Bytes
2) Gradient storage
Gradient Storage = Trainable Parameters × Gradient Bytes, only when gradients are saved
3) Optimizer state storage
Optimizer Storage = Trainable Parameters × Optimizer State Count × Optimizer State Bytes
4) Master weight storage
Master Weight Storage = Trainable Parameters × Master Weight Bytes, when enabled
5) EMA storage
EMA Storage = (Trainable Parameters + Adapter Parameters) × EMA Bytes, when enabled
6) Raw checkpoint size
Raw Checkpoint = Weights + Gradients + Optimizer States + Master Weights + EMA + Metadata
7) Compressed checkpoint size
Compressed Checkpoint = Raw Checkpoint ÷ Compression Ratio
8) Per-shard size
Per Shard = Compressed Checkpoint ÷ Number of Shards
9) Retained storage
Retained Storage = Compressed Checkpoint × Retained Checkpoints
10) Provisioned storage target
Provisioned Capacity = Retained Storage × Copies per Checkpoint × (1 + Safety Margin ÷ 100)
It estimates model checkpoint storage for weights, gradients, optimizer states, master weights, EMA copies, metadata, sharding, retention, and backup planning. It is designed for engineering estimates rather than framework-specific byte-perfect output.
Actual files vary because frameworks store tensors differently, add headers, align chunks, compress unevenly, and sometimes omit temporary states. Serialization format and filesystem behavior can also change the final size.
Include gradients when your checkpoint must resume training exactly from a saved step. For inference-only artifacts or lightweight deployment exports, gradients are usually unnecessary and can be excluded.
Master weights are higher-precision copies used during mixed precision training. They help numerical stability, but they also add storage overhead when saved inside resumable checkpoints.
Adam and AdamW commonly store two optimizer state tensors per trainable parameter. That usually makes resumable training checkpoints far larger than plain weight-only exports.
A compression ratio above 1 reduces the final file size. For example, a ratio of 1.20 means the compressed output is the raw size divided by 1.20.
Teams rarely keep only one checkpoint. Retention matters for rollback, auditing, recovery, and milestone comparison. Planning with retained copies avoids underestimating storage capacity needs.
Shards split one checkpoint into smaller files, which helps distribution, upload limits, parallel transfers, and recovery workflows. The total storage stays similar, but each file becomes easier to handle.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.