Neural Network Size Calculator

Measure model size, memory use, and compute needs. Compare options across dense and convolutional designs. Build better architectures with clearer sizing decisions from data.

Calculator Inputs

Global Settings

Used for checkpointing and retained intermediates.
Adam is commonly close to 2 extra parameter states.

Dense Stack

Use comma-separated units, for example: 1024,512,256,128

Conv2D Stack

Use filters,kernel,stride,padding; filters,kernel,stride,padding. Example: 32,3,1,1;64,3,1,1;128,3,2,1

Embedding Layer

LSTM Stack

Use comma-separated hidden units, for example: 256,128

Transformer Block Stack

Output Head

Leave as 0 to auto-use the latest valid feature size.

Example Data Table

Example Model Key Inputs Main Purpose What to Watch
Compact MLP Classifier Input 1024, dense 512-256-128, head 10 Structured data classification Dense layers grow parameters quickly.
Image CNN 224×224×3, conv 32-64-128, head 1000 Image recognition pipelines Activation memory can dominate batch usage.
Transformer Encoder 6 blocks, d_model 512, d_ff 2048, seq 128 Language and sequence tasks Sequence length strongly affects compute.

Formula Used

Dense layer parameters: parameters = input_units × output_units + output_units when bias is enabled.

Conv2D parameters: parameters = (kernel_height × kernel_width × input_channels + bias) × filters.

Conv2D output size: output = floor(((input + 2 × padding − kernel) ÷ stride) + 1).

Embedding parameters: parameters = vocabulary_size × embedding_dimension.

LSTM parameters: parameters = 4 × (input_size × hidden_size + hidden_size × hidden_size + bias_term), then multiplied by directions.

Transformer block estimate: parameters ≈ attention projections + feedforward layers + layer normalization weights.

Parameter memory: total_parameters × bytes_per_parameter.

Inference memory: parameter_memory + activation_memory_per_batch.

Training memory: parameter_memory × (2 + optimizer_multiplier) + activation_memory × activation_multiplier.

FLOPs estimate: the calculator adds section-level multiply-add approximations for one forward pass per sample, then scales to batch size.

How to Use This Calculator

Start by choosing the parameter precision, batch size, and sequence length. These settings shape storage, activation memory, and training memory.

Enter only the sections you plan to use. You can model dense, convolutional, embedding, LSTM, transformer, or mixed designs in one estimate.

Use comma-separated values for dense and LSTM layers. Use semicolon-separated convolution rows in the format filters,kernel,stride,padding.

Leave Head Input Size as zero when you want the tool to infer a reasonable final feature size from the latest valid section.

Press the calculate button to view totals above the form. Then inspect the section table, memory figures, and Plotly chart.

Use the CSV and PDF buttons to save the calculated results for documentation, design reviews, or deployment planning.

FAQs

1. What does this calculator estimate?

It estimates total parameters, parameter memory, activation memory, inference memory, training memory, FLOPs, and approximate disk size. It is designed for fast architecture planning before model training or deployment.

2. Does it support mixed architectures?

Yes. You can combine embedding, convolutional, dense, LSTM, transformer, and output head sections in the same run. That makes it useful for hybrid AI and machine learning model designs.

3. Why is training memory much larger than disk size?

Training stores weights, gradients, optimizer states, and retained activations. Disk size usually covers only parameter storage, while active training needs much more memory during forward and backward passes.

4. Why does sequence length matter so much?

Sequence length directly raises activation memory and often increases compute. In transformer models, attention complexity also grows with sequence interactions, making long contexts much more expensive.

5. Are the FLOPs exact?

No. They are engineering estimates for forward-pass cost. Real runtime also depends on kernel fusion, framework behavior, hardware efficiency, sparsity, caching, and quantization choices.

6. Should I count bias terms?

Usually yes, unless your architecture explicitly removes them. Bias terms are small compared with major weight matrices, but including them gives a more complete parameter estimate.

7. Can I use it for deployment planning?

Yes. The disk size and inference memory estimates help compare model versions, choose precision formats, and judge whether a model can fit target hardware more comfortably.

8. Why is this still an estimate?

Different frameworks store tensors differently, and actual memory depends on padding, graph execution, optimizer implementation, mixed precision, checkpointing, and custom operator behavior.

Related Calculators

neural architecture search toolneural network memory calculatorenergy consumption estimator

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.