Advanced Large Language Model Parameters Calculator

Calculator inputs

Architecture type

Vocabulary size

Hidden size

Intermediate size

Number of layers

Attention heads

KV heads

Feed-forward block type

Normalization type

Precision for memory estimate

Batch size

Context length

MoE layers

Experts per MoE layer

Active experts per token

Tie input and output embeddings

Include linear layer biases

Example data table

Example model	Architecture	Layers	Hidden size	Heads	Total parameters	Active parameters	Weight memory
Compact Dense	DENSE	24	2,048	16	1.15B	1.15B	2.14 GiB
General Dense	DENSE	32	4,096	32	5.80B	5.80B	10.81 GiB
Sparse MoE	MOE	32	4,096	32	26.84B	9.93B	50.00 GiB

These rows are generated with the same estimator used by the calculator.

Formula used

Embedding parameters = vocabulary size × hidden size

Output head parameters = 0 when embeddings are tied, otherwise vocabulary size × hidden size

Attention parameters per layer = Q + K + V + O

Q = hidden × hidden

K = hidden × (KV heads × head dimension)

V = hidden × (KV heads × head dimension)

O = hidden × hidden

Head dimension = hidden size ÷ attention heads

KV dimension = head dimension × KV heads

Dense FFN block = linear projection count × hidden size × intermediate size

GELU and ReLU use two projections. SwiGLU and GEGLU use three projections.

MoE expert parameters = MoE layers × experts per layer × FFN block parameters

Active parameters estimate removes inactive experts from sparse layers.

KV cache memory = batch × context × layers × KV heads × head dimension × 2 × bytes per value

The factor of two stores both keys and values.

Total parameters = embeddings + output head + attention + FFN blocks + expert weights + router weights + normalization parameters

How to use this calculator

Choose a dense or sparse MoE architecture.
Enter vocabulary, hidden size, intermediate size, layers, and head counts.
Select activation, normalization, precision, and deployment context length.
If using MoE, set routed layers, experts per layer, and active experts.
Click Calculate Parameters to display results above the form.
Review total parameters, active parameters, memory estimates, and the component graph.
Use the CSV and PDF buttons to save the result summary.

Frequently asked questions

1. What does this calculator estimate?

It estimates major trainable parameters for transformer language models, including embeddings, attention weights, feed-forward blocks, MoE experts, routers, and memory footprints for deployment planning.

2. Why are active parameters lower in sparse models?

Sparse MoE models usually activate only a subset of experts per token. Total stored parameters stay high, but active parameters drop because inactive experts are skipped during routing.

3. What is the difference between total and active parameters?

Total parameters represent all stored trainable weights. Active parameters estimate how many weights participate in one token path after MoE routing removes unused experts.

4. Why does tying embeddings reduce parameters?

Tied embeddings reuse the same matrix for token input representation and output logits. That removes a second large vocabulary projection and reduces total storage requirements.

5. Why do SwiGLU and GEGLU produce larger blocks?

They use gated feed-forward structures with an extra learned projection. That usually improves modeling quality, but increases parameter count compared with plain GELU or ReLU blocks.

6. Is the training memory estimate exact?

No. It is a simplified Adam-style estimate using model weights, gradients, and optimizer states. Real runs vary with sharding, checkpointing, mixed precision, and framework overhead.

7. What affects KV cache memory most?

Context length, batch size, number of layers, KV heads, and precision have the largest impact. Longer contexts can dominate inference memory even when weights are quantized.

8. Can I use this for architecture comparison?

Yes. It is useful for comparing dense and sparse designs, testing tied versus untied embeddings, studying grouped-query attention, and planning hardware budgets before training.