Feature Store Sizing Calculator

Inputs

Fill values, then submit to estimate storage and bandwidth.

Engineering

Entities in online store

Unique keys whose latest features are served.

Offline rows per day

Daily feature records written to the offline store.

Features per row

Average number of features in one feature vector.

Average bytes per feature

Include encoding and type width, not text labels.

Entity key bytes (offline)

Key storage for offline rows (IDs, hashes, etc.).

Timestamp bytes

Event-time column size, often 8 bytes.

Row overhead bytes (offline)

Serialization, schema, row groups, and small headers.

Retention days (offline)

Historical window kept for training and audits.

Compression ratio

Size after compression, as a fraction of raw.

Replication (offline)

Copies across nodes or zones.

Overhead percent (offline)

Indexes, files, compaction slack, and tombstones.

Entity key bytes (online)

Key storage for online serving rows.

Row overhead bytes (online)

Metadata, TTL markers, and internal structures.

Replication (online)

Serving copies to meet latency and availability goals.

Overhead percent (online)

Key-value overhead, internal tables, and spare capacity.

Metadata & Governance

Optional sizing for definitions and lineage logs.

Feature definitions count

Definitions across teams, versions, and projects.

Metadata KB per definition

Schema, owners, tags, validations, docs, and stats.

Lineage events per day

Jobs, materializations, and access audit events.

Bytes per lineage event

Event payload after storage encoding.

Lineage retention days

Retention for governance and incident analysis.

Replication (metadata)

Copies for durability of definitions and logs.

Overhead percent (metadata)

Indexes, partitions, and query acceleration costs.

Workload

Bandwidth estimates help with networking and serving budgets.

Reads per second (online)

Average steady-state requests, not peak bursts.

Average features per read

Often fewer than stored features are requested.

Write batch rows

Used to estimate batch payload size for ingestion.

Planning

Add headroom to cover growth, spikes, and compaction lag.

Headroom percent

Common ranges: 20% to 50%.

Example Data Table

Illustrative scenarios to sanity-check ranges. Replace with your real traffic and schemas.

Scenario	Entities	Rows/day	Features	Bytes/feature	Retention	Compression	Replication	Estimated total (with headroom)
Pilot	50,000	200,000	40	12	90d	0.40	2x	~1–3 TiB
Production	500,000	2,500,000	60	16	365d	0.35	3x	~20–60 TiB
High frequency	2,000,000	25,000,000	120	16	365d	0.30	3x	~400–900 TiB

These ranges assume moderate overhead and ~30% headroom.

Formula Used

Feature vector bytes = features_per_row × avg_feature_bytes.
Offline row bytes = vector + key_bytes + timestamp_bytes + row_overhead_bytes.
Offline raw bytes = offline_row_bytes × rows_per_day × retention_days.
Offline compressed bytes = offline_raw_bytes × compression_ratio.
Offline total = compressed × (1 + overhead%) × replication.
Online total = online_row_bytes × entities × (1 + overhead%) × replication.
Metadata total = (defs + lineage) × (1 + overhead%) × replication.
Provisioned = combined × (1 + headroom%).
Write MiB/s ≈ offline_row_bytes × (rows_per_day ÷ 86,400).
Read MiB/s ≈ read_payload_bytes × reads_per_sec.

How to Use This Calculator

Set rows per day and retention for your training history needs.
Estimate bytes per feature using actual types and encodings.
Choose a realistic compression ratio after compaction and columnar formats.
Match replication to your availability and disaster recovery goals.
Include overhead for indexes, tombstones, and compaction slack.
Use reads per second and features per read for serving bandwidth.
Add headroom to cover growth, seasonality, and backfills.
Submit to view results, then export CSV or PDF for planning.

Sizing Inputs That Matter

Accurate sizing starts with clear data assumptions. Count online entities as unique keys served. Estimate offline rows per day from event volume and refresh frequency. Multiply features per row by average bytes per feature, using real encodings for integers, floats, and hashed categories. Add key, timestamp, and row overhead to capture serialization, schema headers, and small-file effects. These inputs drive storage and bandwidth projections. Validate with pilot runs and historical retention policies.

Offline Store Capacity Planning

The offline store holds historical feature values for training, backfills, and audits. Raw offline bytes equal offline row size times rows per day times retention days. Apply a compression ratio reflecting columnar formats and compaction; many pipelines land between 0.25 and 0.50. Include offline overhead for indexes, partition metadata, and compaction slack. Multiply by replication to reflect zone copies or erasure-coded equivalents.

Online Serving Footprint

The online store keeps the latest feature vector per entity, plus serving metadata. Online raw bytes equal online row size times entity count. Overhead is often higher than offline because key-value engines maintain internal tables and tombstones. Replication aligns with availability targets and latency, so two to three copies are typical. Use reads per second and average features per read to estimate payload bytes and sustained read bandwidth.

Governance and Metadata Overhead

Production feature platforms store definitions, validation rules, ownership, and lineage. Definitions scale with the number of features, versions, and projects; allocate kilobytes per definition for documentation and statistics. Lineage events scale with job runs, materializations, and access logs. Total governance bytes combine definitions and lineage, then apply overhead for indexing and partitions and replication for durability. Even when small, governance data is vital for incident response and audits.

Operational Bandwidth and Headroom

Storage is only half the story; ingestion and serving must move data reliably. The calculator converts rows per day to rows per second and estimates write MiB/s from offline row bytes. Batch payload size helps tune file sizes, buffering, and network windows. Add headroom to cover growth, late-arriving events, and compaction lag; 20–50% is common. Low headroom triggers alarms during spikes or backfills.

FAQs

What does compression ratio represent?

It is the stored size after compression divided by raw size. Lower values mean better compression. Use observed ratios from your files after compaction, not generic vendor claims.

How should I choose overhead percent?

Overhead covers indexes, partitions, tombstones, and compaction slack. Start with 10–30% for offline formats and 15–40% for online key-value stores, then tune using actual storage metrics.

Why are offline and online stores sized differently?

Offline storage tracks history for training and audits, so retention days dominate. Online storage usually holds the latest vector per entity, so entity count and replication drive most capacity.

What does replication mean in the results?

It multiplies the stored bytes to account for multiple copies across nodes or zones. Match it to your availability and recovery goals, and remember some systems use erasure coding instead of full copies.

How much headroom is reasonable?

Many teams plan 20–50% headroom to absorb growth, late data, and compaction lag. If you backfill often or face seasonal spikes, choose higher headroom and monitor usage trends weekly.

How do I interpret write and read bandwidth?

Write bandwidth estimates sustained ingestion from daily volume, while read bandwidth estimates serving throughput from requests per second and payload size. Compare these to network limits, disk throughput, and autoscaling targets.

Inputs

Metadata & Governance

Workload

Planning

Example Data Table

Formula Used

How to Use This Calculator

Sizing Inputs That Matter

Offline Store Capacity Planning

Online Serving Footprint

Governance and Metadata Overhead

Operational Bandwidth and Headroom

FAQs

What does compression ratio represent?

How should I choose overhead percent?

Why are offline and online stores sized differently?

What does replication mean in the results?

How much headroom is reasonable?

How do I interpret write and read bandwidth?

Related Calculators