Plan architectures with accurate parameter and size estimates. Choose layer types, activations, and precision fast. Download reports, share CSV, and refine models confidently together.
Add layers, estimate parameters, and size memory for deployment.
Sample architecture summary (illustrative values).
| # | Layer | Type | Trainable | Non-trainable |
|---|---|---|---|---|
| 1 | Token Embedding | embedding | 38,400,000 | 0 |
| 2 | Encoder Block × 12 | transformer_block | 85,054,464 | 0 |
| 3 | Final LayerNorm | layernorm | 1,536 | 0 |
Parameter totals are a proxy for model capacity, training cost, and deployment footprint. A 110M-parameter model at 32-bit weights stores 440 MB for weights alone, while a 7B-parameter model reaches about 28 GB. Early sizing prevents rework and helps you match hardware, latency, and cost constraints before code is written.
This estimator breaks a network into building blocks: dense layers, convolution kernels, embeddings, recurrent cells, batch normalization, and transformer sublayers. For transformers, it counts token embeddings, attention projections (Q, K, V, and output), feed-forward expansions, and optional layer norms. You can mix layer types to mirror hybrid stacks used in vision-language and speech models.
Beyond counts, the calculator converts parameters into storage using your chosen precision. For example, 500M parameters at 16-bit weights occupy about 1.0 GB, and at 8-bit about 0.5 GB. If you enable gradients, a training checkpoint often needs multiple copies for weights and gradients, plus optimizer state, so practical memory can be 4× to 12× weight size.
Training memory depends on how you optimize. Adam-style optimizers typically keep two additional moment tensors per weight, while SGD with momentum keeps one. With 1B parameters at 16-bit weights, Adam moments in 32-bit can add roughly 8 GB, even before activations. Mixed precision, quantization-aware finetuning, and low-rank adapters reduce memory without eliminating the need to track some states.
Use the layer table to compare design choices on equal terms. Increasing hidden width raises dense and attention matrices quadratically, while increasing depth scales nearly linearly. For a transformer block, doubling model width can raise attention and MLP parameters by about 4×, so the same depth jump becomes far more expensive. The per-layer breakdown highlights where growth concentrates.
The Plotly chart visualizes parameter contribution by component, making bottlenecks obvious at a glance. If embeddings dominate, reduce vocabulary size, tie input-output embeddings, or compress with subword strategies. If MLPs dominate, lower the expansion ratio or use gated variants selectively. Export CSV for audits and PDF for reviews, then keep estimates alongside experiment logs.
No. More parameters can help, but data quality, optimization, architecture, and regularization often matter more. Bigger models can also overfit or become unstable without the right training recipe.
Frameworks may fuse projections, tie embeddings, omit non-trainable buffers, or count certain normalization terms differently. Layer implementations can vary, so small mismatches are expected.
Many optimizers store extra tensors per weight, such as momentum and variance. These states increase memory beyond weights and gradients, especially for Adam-style methods.
Activation memory depends on batch size, sequence length, and saved tensors for backprop. Use this tool’s rough estimate for planning, then validate with a small run and profiler.
Yes. Add one representative block, duplicate it with the “Duplicate row” control, or copy the same settings across rows. Totals scale linearly with repeated blocks.
Model width, feed-forward expansion ratio, and number of layers drive parameters strongly. Vocab size affects embeddings, while head count mostly reshapes matrices if width stays constant.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.