Checkpoint Size Calculator

Measure raw checkpoint size across weight formats. Compare gradients, optimizer states, shards, and metadata instantly. Plan storage efficiently before training, export, backup, and deployment.

Calculator Inputs

Use the grid below. Large screens show three columns, smaller screens show two, and mobiles show one.

Main trainable parameter count.
Stored in weights, but excluded from gradients.
LoRA, heads, embeddings, or custom additions.
Applies to all parameter fields.
Precision used for stored model weights.
Useful when saving resumable training states.
Applied only when gradients are included.
State count drives optimizer checkpoint storage.
Common training resumes use FP32 states.
Often used in mixed precision training.
Applied only when master weights exist.
Adds another stored parameter copy.
Applied only when EMA is enabled.
Config, tokenizer, manifest, and extra files.
Used for metadata overhead only.
Example: 1.15 means a 15% smaller file.
Compressed checkpoint is divided across these shards.
How many checkpoint versions you keep.
Primary plus backups or replicas.
Extra provisioning headroom for operations.
Controls how results are shown.

Example Data Table

Scenario Trainable Params Weight Precision Optimizer Raw Checkpoint Compressed Checkpoint Per Shard
Large mixed-precision training run 1.30B + 10M adapters FP16 / BF16 AdamW, FP32 states 19.63 GiB 17.07 GiB 2.13 GiB
Inference-only deployable artifact 7.00B INT8 None 6.77 GiB 6.25 GiB 1.56 GiB
Small experiment with resumable state 220M FP32 Momentum 3.28 GiB 3.04 GiB 0.76 GiB

The example values are planning estimates. Actual file sizes depend on framework layout, serialization method, padding, and compression behavior.

Formula Used

Total checkpoint size depends on parameter count, bytes per stored value, optimizer states, extra copies, metadata, compression, and retention strategy.

1) Total stored weights
Weight Storage = (Trainable Parameters + Frozen Parameters + Adapter Parameters) × Weight Bytes

2) Gradient storage
Gradient Storage = Trainable Parameters × Gradient Bytes, only when gradients are saved

3) Optimizer state storage
Optimizer Storage = Trainable Parameters × Optimizer State Count × Optimizer State Bytes

4) Master weight storage
Master Weight Storage = Trainable Parameters × Master Weight Bytes, when enabled

5) EMA storage
EMA Storage = (Trainable Parameters + Adapter Parameters) × EMA Bytes, when enabled

6) Raw checkpoint size
Raw Checkpoint = Weights + Gradients + Optimizer States + Master Weights + EMA + Metadata

7) Compressed checkpoint size
Compressed Checkpoint = Raw Checkpoint ÷ Compression Ratio

8) Per-shard size
Per Shard = Compressed Checkpoint ÷ Number of Shards

9) Retained storage
Retained Storage = Compressed Checkpoint × Retained Checkpoints

10) Provisioned storage target
Provisioned Capacity = Retained Storage × Copies per Checkpoint × (1 + Safety Margin ÷ 100)

How to Use This Calculator

  1. Enter the trainable, frozen, and extra parameter counts.
  2. Select the parameter unit, such as million or billion.
  3. Choose the stored precision for weights.
  4. Enable or disable gradients, master weights, and EMA copies.
  5. Select the optimizer type and optimizer state precision.
  6. Add metadata overhead for configs, tokenizers, and manifests.
  7. Set compression ratio, shard count, retention count, backup copies, and safety margin.
  8. Press the calculate button to show results above the form.
  9. Review the graph, summary table, and planning notes.
  10. Download a CSV or PDF file for documentation or budgeting.

Frequently Asked Questions

1) What does this calculator estimate?

It estimates model checkpoint storage for weights, gradients, optimizer states, master weights, EMA copies, metadata, sharding, retention, and backup planning. It is designed for engineering estimates rather than framework-specific byte-perfect output.

2) Why can real files differ from the estimate?

Actual files vary because frameworks store tensors differently, add headers, align chunks, compress unevenly, and sometimes omit temporary states. Serialization format and filesystem behavior can also change the final size.

3) When should I include gradients?

Include gradients when your checkpoint must resume training exactly from a saved step. For inference-only artifacts or lightweight deployment exports, gradients are usually unnecessary and can be excluded.

4) What are master weights?

Master weights are higher-precision copies used during mixed precision training. They help numerical stability, but they also add storage overhead when saved inside resumable checkpoints.

5) Why is Adam often much larger?

Adam and AdamW commonly store two optimizer state tensors per trainable parameter. That usually makes resumable training checkpoints far larger than plain weight-only exports.

6) What does the compression ratio mean?

A compression ratio above 1 reduces the final file size. For example, a ratio of 1.20 means the compressed output is the raw size divided by 1.20.

7) Why should I plan for retained checkpoints?

Teams rarely keep only one checkpoint. Retention matters for rollback, auditing, recovery, and milestone comparison. Planning with retained copies avoids underestimating storage capacity needs.

8) How do shards help in practice?

Shards split one checkpoint into smaller files, which helps distribution, upload limits, parallel transfers, and recovery workflows. The total storage stays similar, but each file becomes easier to handle.

Related Calculators

Model Training TimeInference Latency CalculatorLearning Rate FinderParameter Count CalculatorDataset Split CalculatorEpoch Time EstimatorCloud GPU CostThroughput CalculatorMemory Footprint CalculatorLatency Budget Planner

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.