Checkpoint Size Calculator

Calculated Results

The result block appears here after submission, above the form.

Detailed Summary

Planning Notes

Checkpoint Storage Graph

Calculator Inputs

Use the grid below. Large screens show three columns, smaller screens show two, and mobiles show one.

Trainable Parameters

Main trainable parameter count.

Frozen Parameters

Stored in weights, but excluded from gradients.

Adapter or Extra Parameters

LoRA, heads, embeddings, or custom additions.

Parameter Unit

Applies to all parameter fields.

Weight Precision

Precision used for stored model weights.

Include Gradients

Useful when saving resumable training states.

Gradient Precision

Applied only when gradients are included.

Optimizer Type

State count drives optimizer checkpoint storage.

Optimizer State Precision

Common training resumes use FP32 states.

Include Master Weights

Often used in mixed precision training.

Master Weight Precision

Applied only when master weights exist.

Include EMA Shadow Copy

Adds another stored parameter copy.

EMA Precision

Applied only when EMA is enabled.

Metadata Overhead

Config, tokenizer, manifest, and extra files.

Metadata Unit

Used for metadata overhead only.

Compression Ratio

Example: 1.15 means a 15% smaller file.

Shard Count

Compressed checkpoint is divided across these shards.

Retained Checkpoints

How many checkpoint versions you keep.

Copies per Checkpoint

Primary plus backups or replicas.

Safety Margin %

Extra provisioning headroom for operations.

Display Unit

Controls how results are shown.

Example Data Table

Scenario	Trainable Params	Weight Precision	Optimizer	Raw Checkpoint	Compressed Checkpoint	Per Shard
Large mixed-precision training run	1.30B + 10M adapters	FP16 / BF16	AdamW, FP32 states	19.63 GiB	17.07 GiB	2.13 GiB
Inference-only deployable artifact	7.00B	INT8	None	6.77 GiB	6.25 GiB	1.56 GiB
Small experiment with resumable state	220M	FP32	Momentum	3.28 GiB	3.04 GiB	0.76 GiB

The example values are planning estimates. Actual file sizes depend on framework layout, serialization method, padding, and compression behavior.

Formula Used

Total checkpoint size depends on parameter count, bytes per stored value, optimizer states, extra copies, metadata, compression, and retention strategy.

1) Total stored weights
Weight Storage = (Trainable Parameters + Frozen Parameters + Adapter Parameters) × Weight Bytes

2) Gradient storage
Gradient Storage = Trainable Parameters × Gradient Bytes, only when gradients are saved

3) Optimizer state storage
Optimizer Storage = Trainable Parameters × Optimizer State Count × Optimizer State Bytes

4) Master weight storage
Master Weight Storage = Trainable Parameters × Master Weight Bytes, when enabled

5) EMA storage
EMA Storage = (Trainable Parameters + Adapter Parameters) × EMA Bytes, when enabled

6) Raw checkpoint size
Raw Checkpoint = Weights + Gradients + Optimizer States + Master Weights + EMA + Metadata

7) Compressed checkpoint size
Compressed Checkpoint = Raw Checkpoint ÷ Compression Ratio

8) Per-shard size
Per Shard = Compressed Checkpoint ÷ Number of Shards

9) Retained storage
Retained Storage = Compressed Checkpoint × Retained Checkpoints

10) Provisioned storage target
Provisioned Capacity = Retained Storage × Copies per Checkpoint × (1 + Safety Margin ÷ 100)

How to Use This Calculator

Enter the trainable, frozen, and extra parameter counts.
Select the parameter unit, such as million or billion.
Choose the stored precision for weights.
Enable or disable gradients, master weights, and EMA copies.
Select the optimizer type and optimizer state precision.
Add metadata overhead for configs, tokenizers, and manifests.
Set compression ratio, shard count, retention count, backup copies, and safety margin.
Press the calculate button to show results above the form.
Review the graph, summary table, and planning notes.
Download a CSV or PDF file for documentation or budgeting.

Frequently Asked Questions

1) What does this calculator estimate?

It estimates model checkpoint storage for weights, gradients, optimizer states, master weights, EMA copies, metadata, sharding, retention, and backup planning. It is designed for engineering estimates rather than framework-specific byte-perfect output.

2) Why can real files differ from the estimate?

Actual files vary because frameworks store tensors differently, add headers, align chunks, compress unevenly, and sometimes omit temporary states. Serialization format and filesystem behavior can also change the final size.

3) When should I include gradients?

Include gradients when your checkpoint must resume training exactly from a saved step. For inference-only artifacts or lightweight deployment exports, gradients are usually unnecessary and can be excluded.

4) What are master weights?

Master weights are higher-precision copies used during mixed precision training. They help numerical stability, but they also add storage overhead when saved inside resumable checkpoints.

5) Why is Adam often much larger?

Adam and AdamW commonly store two optimizer state tensors per trainable parameter. That usually makes resumable training checkpoints far larger than plain weight-only exports.

6) What does the compression ratio mean?

A compression ratio above 1 reduces the final file size. For example, a ratio of 1.20 means the compressed output is the raw size divided by 1.20.

7) Why should I plan for retained checkpoints?

Teams rarely keep only one checkpoint. Retention matters for rollback, auditing, recovery, and milestone comparison. Planning with retained copies avoids underestimating storage capacity needs.

8) How do shards help in practice?

Shards split one checkpoint into smaller files, which helps distribution, upload limits, parallel transfers, and recovery workflows. The total storage stays similar, but each file becomes easier to handle.

Calculated Results

Detailed Summary

Planning Notes

Checkpoint Storage Graph

Calculator Inputs

Example Data Table

Formula Used

How to Use This Calculator

Frequently Asked Questions

1) What does this calculator estimate?

2) Why can real files differ from the estimate?

3) When should I include gradients?

4) What are master weights?

5) Why is Adam often much larger?

6) What does the compression ratio mean?

7) Why should I plan for retained checkpoints?

8) How do shards help in practice?

Related Calculators