Neural Network Size Calculator

Calculator Inputs

Global Settings

Parameter Precision

Batch Size

Sequence Length

Training Activation Multiplier

Used for checkpointing and retained intermediates.

Optimizer State Multiplier

Adam is commonly close to 2 extra parameter states.

Include bias terms

Dense Stack

Dense Input Size

Dense Layers

Use comma-separated units, for example: 1024,512,256,128

Conv2D Stack

Image Height

Image Width

Image Channels

Conv Layers

Use filters,kernel,stride,padding; filters,kernel,stride,padding. Example: 32,3,1,1;64,3,1,1;128,3,2,1

Embedding Layer

Vocabulary Size

Embedding Dimension

LSTM Stack

LSTM Input Size

LSTM Layers

Use comma-separated hidden units, for example: 256,128

Bidirectional LSTM

Transformer Block Stack

Transformer Blocks

d_model

Feedforward Size

Attention Heads

Output Head

Head Input Size

Leave as 0 to auto-use the latest valid feature size.

Head Output Size

Example Data Table

Example Model	Key Inputs	Main Purpose	What to Watch
Compact MLP Classifier	Input 1024, dense 512-256-128, head 10	Structured data classification	Dense layers grow parameters quickly.
Image CNN	224×224×3, conv 32-64-128, head 1000	Image recognition pipelines	Activation memory can dominate batch usage.
Transformer Encoder	6 blocks, d_model 512, d_ff 2048, seq 128	Language and sequence tasks	Sequence length strongly affects compute.

Formula Used

Dense layer parameters: parameters = input_units × output_units + output_units when bias is enabled.

Conv2D parameters: parameters = (kernel_height × kernel_width × input_channels + bias) × filters.

Conv2D output size: output = floor(((input + 2 × padding − kernel) ÷ stride) + 1).

Embedding parameters: parameters = vocabulary_size × embedding_dimension.

LSTM parameters: parameters = 4 × (input_size × hidden_size + hidden_size × hidden_size + bias_term), then multiplied by directions.

Transformer block estimate: parameters ≈ attention projections + feedforward layers + layer normalization weights.

Parameter memory: total_parameters × bytes_per_parameter.

Inference memory: parameter_memory + activation_memory_per_batch.

Training memory: parameter_memory × (2 + optimizer_multiplier) + activation_memory × activation_multiplier.

FLOPs estimate: the calculator adds section-level multiply-add approximations for one forward pass per sample, then scales to batch size.

How to Use This Calculator

Start by choosing the parameter precision, batch size, and sequence length. These settings shape storage, activation memory, and training memory.

Enter only the sections you plan to use. You can model dense, convolutional, embedding, LSTM, transformer, or mixed designs in one estimate.

Use comma-separated values for dense and LSTM layers. Use semicolon-separated convolution rows in the format filters,kernel,stride,padding.

Leave Head Input Size as zero when you want the tool to infer a reasonable final feature size from the latest valid section.

Press the calculate button to view totals above the form. Then inspect the section table, memory figures, and Plotly chart.

Use the CSV and PDF buttons to save the calculated results for documentation, design reviews, or deployment planning.

FAQs

1. What does this calculator estimate?

It estimates total parameters, parameter memory, activation memory, inference memory, training memory, FLOPs, and approximate disk size. It is designed for fast architecture planning before model training or deployment.

2. Does it support mixed architectures?

Yes. You can combine embedding, convolutional, dense, LSTM, transformer, and output head sections in the same run. That makes it useful for hybrid AI and machine learning model designs.

3. Why is training memory much larger than disk size?

Training stores weights, gradients, optimizer states, and retained activations. Disk size usually covers only parameter storage, while active training needs much more memory during forward and backward passes.

4. Why does sequence length matter so much?

Sequence length directly raises activation memory and often increases compute. In transformer models, attention complexity also grows with sequence interactions, making long contexts much more expensive.

5. Are the FLOPs exact?

No. They are engineering estimates for forward-pass cost. Real runtime also depends on kernel fusion, framework behavior, hardware efficiency, sparsity, caching, and quantization choices.

6. Should I count bias terms?

Usually yes, unless your architecture explicitly removes them. Bias terms are small compared with major weight matrices, but including them gives a more complete parameter estimate.

7. Can I use it for deployment planning?

Yes. The disk size and inference memory estimates help compare model versions, choose precision formats, and judge whether a model can fit target hardware more comfortably.

8. Why is this still an estimate?

Different frameworks store tensors differently, and actual memory depends on padding, graph execution, optimizer implementation, mixed precision, checkpointing, and custom operator behavior.