Feature Store Sizing Calculator

Plan scalable feature storage for training and serving. Compare snapshots, histories, and governance overhead today. Export results to share, budget, and deploy confidently now.

Inputs

Fill values, then submit to estimate storage and bandwidth.
Engineering

Unique keys whose latest features are served.
Daily feature records written to the offline store.
Average number of features in one feature vector.
Include encoding and type width, not text labels.
Key storage for offline rows (IDs, hashes, etc.).
Event-time column size, often 8 bytes.
Serialization, schema, row groups, and small headers.
Historical window kept for training and audits.
Size after compression, as a fraction of raw.
Copies across nodes or zones.
Indexes, files, compaction slack, and tombstones.
Key storage for online serving rows.
Metadata, TTL markers, and internal structures.
Serving copies to meet latency and availability goals.
Key-value overhead, internal tables, and spare capacity.

Metadata & Governance

Optional sizing for definitions and lineage logs.
Definitions across teams, versions, and projects.
Schema, owners, tags, validations, docs, and stats.
Jobs, materializations, and access audit events.
Event payload after storage encoding.
Retention for governance and incident analysis.
Copies for durability of definitions and logs.
Indexes, partitions, and query acceleration costs.

Workload

Bandwidth estimates help with networking and serving budgets.
Average steady-state requests, not peak bursts.
Often fewer than stored features are requested.
Used to estimate batch payload size for ingestion.

Planning

Add headroom to cover growth, spikes, and compaction lag.
Common ranges: 20% to 50%.

Example Data Table

Illustrative scenarios to sanity-check ranges. Replace with your real traffic and schemas.
Scenario Entities Rows/day Features Bytes/feature Retention Compression Replication Estimated total (with headroom)
Pilot 50,000 200,000 40 12 90d 0.40 2x ~1–3 TiB
Production 500,000 2,500,000 60 16 365d 0.35 3x ~20–60 TiB
High frequency 2,000,000 25,000,000 120 16 365d 0.30 3x ~400–900 TiB
These ranges assume moderate overhead and ~30% headroom.

Formula Used

How to Use This Calculator

  1. Set rows per day and retention for your training history needs.
  2. Estimate bytes per feature using actual types and encodings.
  3. Choose a realistic compression ratio after compaction and columnar formats.
  4. Match replication to your availability and disaster recovery goals.
  5. Include overhead for indexes, tombstones, and compaction slack.
  6. Use reads per second and features per read for serving bandwidth.
  7. Add headroom to cover growth, seasonality, and backfills.
  8. Submit to view results, then export CSV or PDF for planning.

Sizing Inputs That Matter

Accurate sizing starts with clear data assumptions. Count online entities as unique keys served. Estimate offline rows per day from event volume and refresh frequency. Multiply features per row by average bytes per feature, using real encodings for integers, floats, and hashed categories. Add key, timestamp, and row overhead to capture serialization, schema headers, and small-file effects. These inputs drive storage and bandwidth projections. Validate with pilot runs and historical retention policies.

Offline Store Capacity Planning

The offline store holds historical feature values for training, backfills, and audits. Raw offline bytes equal offline row size times rows per day times retention days. Apply a compression ratio reflecting columnar formats and compaction; many pipelines land between 0.25 and 0.50. Include offline overhead for indexes, partition metadata, and compaction slack. Multiply by replication to reflect zone copies or erasure-coded equivalents.

Online Serving Footprint

The online store keeps the latest feature vector per entity, plus serving metadata. Online raw bytes equal online row size times entity count. Overhead is often higher than offline because key-value engines maintain internal tables and tombstones. Replication aligns with availability targets and latency, so two to three copies are typical. Use reads per second and average features per read to estimate payload bytes and sustained read bandwidth.

Governance and Metadata Overhead

Production feature platforms store definitions, validation rules, ownership, and lineage. Definitions scale with the number of features, versions, and projects; allocate kilobytes per definition for documentation and statistics. Lineage events scale with job runs, materializations, and access logs. Total governance bytes combine definitions and lineage, then apply overhead for indexing and partitions and replication for durability. Even when small, governance data is vital for incident response and audits.

Operational Bandwidth and Headroom

Storage is only half the story; ingestion and serving must move data reliably. The calculator converts rows per day to rows per second and estimates write MiB/s from offline row bytes. Batch payload size helps tune file sizes, buffering, and network windows. Add headroom to cover growth, late-arriving events, and compaction lag; 20–50% is common. Low headroom triggers alarms during spikes or backfills.

FAQs

What does compression ratio represent?

It is the stored size after compression divided by raw size. Lower values mean better compression. Use observed ratios from your files after compaction, not generic vendor claims.

How should I choose overhead percent?

Overhead covers indexes, partitions, tombstones, and compaction slack. Start with 10–30% for offline formats and 15–40% for online key-value stores, then tune using actual storage metrics.

Why are offline and online stores sized differently?

Offline storage tracks history for training and audits, so retention days dominate. Online storage usually holds the latest vector per entity, so entity count and replication drive most capacity.

What does replication mean in the results?

It multiplies the stored bytes to account for multiple copies across nodes or zones. Match it to your availability and recovery goals, and remember some systems use erasure coding instead of full copies.

How much headroom is reasonable?

Many teams plan 20–50% headroom to absorb growth, late data, and compaction lag. If you backfill often or face seasonal spikes, choose higher headroom and monitor usage trends weekly.

How do I interpret write and read bandwidth?

Write bandwidth estimates sustained ingestion from daily volume, while read bandwidth estimates serving throughput from requests per second and payload size. Compare these to network limits, disk throughput, and autoscaling targets.

Related Calculators

Inference Latency CalculatorParameter Count CalculatorDataset Split CalculatorEpoch Time EstimatorCloud GPU CostThroughput CalculatorMemory Footprint CalculatorLatency Budget PlannerModel Compression RatioPruning Savings Calculator

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.