Calculator inputs
Example data table
| Example model | Architecture | Layers | Hidden size | Heads | Total parameters | Active parameters | Weight memory |
|---|---|---|---|---|---|---|---|
| Compact Dense | DENSE | 24 | 2,048 | 16 | 1.15B | 1.15B | 2.14 GiB |
| General Dense | DENSE | 32 | 4,096 | 32 | 5.80B | 5.80B | 10.81 GiB |
| Sparse MoE | MOE | 32 | 4,096 | 32 | 26.84B | 9.93B | 50.00 GiB |
These rows are generated with the same estimator used by the calculator.
Formula used
Embedding parameters = vocabulary size × hidden size
Output head parameters = 0 when embeddings are tied, otherwise vocabulary size × hidden size
Attention parameters per layer = Q + K + V + O
Q = hidden × hidden
K = hidden × (KV heads × head dimension)
V = hidden × (KV heads × head dimension)
O = hidden × hidden
Head dimension = hidden size ÷ attention heads
KV dimension = head dimension × KV heads
Dense FFN block = linear projection count × hidden size × intermediate size
GELU and ReLU use two projections. SwiGLU and GEGLU use three projections.
MoE expert parameters = MoE layers × experts per layer × FFN block parameters
Active parameters estimate removes inactive experts from sparse layers.
KV cache memory = batch × context × layers × KV heads × head dimension × 2 × bytes per value
The factor of two stores both keys and values.
Total parameters = embeddings + output head + attention + FFN blocks + expert weights + router weights + normalization parameters
How to use this calculator
- Choose a dense or sparse MoE architecture.
- Enter vocabulary, hidden size, intermediate size, layers, and head counts.
- Select activation, normalization, precision, and deployment context length.
- If using MoE, set routed layers, experts per layer, and active experts.
- Click Calculate Parameters to display results above the form.
- Review total parameters, active parameters, memory estimates, and the component graph.
- Use the CSV and PDF buttons to save the result summary.
Frequently asked questions
1. What does this calculator estimate?
It estimates major trainable parameters for transformer language models, including embeddings, attention weights, feed-forward blocks, MoE experts, routers, and memory footprints for deployment planning.
2. Why are active parameters lower in sparse models?
Sparse MoE models usually activate only a subset of experts per token. Total stored parameters stay high, but active parameters drop because inactive experts are skipped during routing.
3. What is the difference between total and active parameters?
Total parameters represent all stored trainable weights. Active parameters estimate how many weights participate in one token path after MoE routing removes unused experts.
4. Why does tying embeddings reduce parameters?
Tied embeddings reuse the same matrix for token input representation and output logits. That removes a second large vocabulary projection and reduces total storage requirements.
5. Why do SwiGLU and GEGLU produce larger blocks?
They use gated feed-forward structures with an extra learned projection. That usually improves modeling quality, but increases parameter count compared with plain GELU or ReLU blocks.
6. Is the training memory estimate exact?
No. It is a simplified Adam-style estimate using model weights, gradients, and optimizer states. Real runs vary with sharding, checkpointing, mixed precision, and framework overhead.
7. What affects KV cache memory most?
Context length, batch size, number of layers, KV heads, and precision have the largest impact. Longer contexts can dominate inference memory even when weights are quantized.
8. Can I use this for architecture comparison?
Yes. It is useful for comparing dense and sparse designs, testing tied versus untied embeddings, studying grouped-query attention, and planning hardware budgets before training.