Inputs
Example Data Table
| Scenario | Entities | Rows/day | Features | Bytes/feature | Retention | Compression | Replication | Estimated total (with headroom) |
|---|---|---|---|---|---|---|---|---|
| Pilot | 50,000 | 200,000 | 40 | 12 | 90d | 0.40 | 2x | ~1–3 TiB |
| Production | 500,000 | 2,500,000 | 60 | 16 | 365d | 0.35 | 3x | ~20–60 TiB |
| High frequency | 2,000,000 | 25,000,000 | 120 | 16 | 365d | 0.30 | 3x | ~400–900 TiB |
Formula Used
- Feature vector bytes = features_per_row × avg_feature_bytes.
- Offline row bytes = vector + key_bytes + timestamp_bytes + row_overhead_bytes.
- Offline raw bytes = offline_row_bytes × rows_per_day × retention_days.
- Offline compressed bytes = offline_raw_bytes × compression_ratio.
- Offline total = compressed × (1 + overhead%) × replication.
- Online total = online_row_bytes × entities × (1 + overhead%) × replication.
- Metadata total = (defs + lineage) × (1 + overhead%) × replication.
- Provisioned = combined × (1 + headroom%).
- Write MiB/s ≈ offline_row_bytes × (rows_per_day ÷ 86,400).
- Read MiB/s ≈ read_payload_bytes × reads_per_sec.
How to Use This Calculator
- Set rows per day and retention for your training history needs.
- Estimate bytes per feature using actual types and encodings.
- Choose a realistic compression ratio after compaction and columnar formats.
- Match replication to your availability and disaster recovery goals.
- Include overhead for indexes, tombstones, and compaction slack.
- Use reads per second and features per read for serving bandwidth.
- Add headroom to cover growth, seasonality, and backfills.
- Submit to view results, then export CSV or PDF for planning.
Sizing Inputs That Matter
Accurate sizing starts with clear data assumptions. Count online entities as unique keys served. Estimate offline rows per day from event volume and refresh frequency. Multiply features per row by average bytes per feature, using real encodings for integers, floats, and hashed categories. Add key, timestamp, and row overhead to capture serialization, schema headers, and small-file effects. These inputs drive storage and bandwidth projections. Validate with pilot runs and historical retention policies.
Offline Store Capacity Planning
The offline store holds historical feature values for training, backfills, and audits. Raw offline bytes equal offline row size times rows per day times retention days. Apply a compression ratio reflecting columnar formats and compaction; many pipelines land between 0.25 and 0.50. Include offline overhead for indexes, partition metadata, and compaction slack. Multiply by replication to reflect zone copies or erasure-coded equivalents.
Online Serving Footprint
The online store keeps the latest feature vector per entity, plus serving metadata. Online raw bytes equal online row size times entity count. Overhead is often higher than offline because key-value engines maintain internal tables and tombstones. Replication aligns with availability targets and latency, so two to three copies are typical. Use reads per second and average features per read to estimate payload bytes and sustained read bandwidth.
Governance and Metadata Overhead
Production feature platforms store definitions, validation rules, ownership, and lineage. Definitions scale with the number of features, versions, and projects; allocate kilobytes per definition for documentation and statistics. Lineage events scale with job runs, materializations, and access logs. Total governance bytes combine definitions and lineage, then apply overhead for indexing and partitions and replication for durability. Even when small, governance data is vital for incident response and audits.
Operational Bandwidth and Headroom
Storage is only half the story; ingestion and serving must move data reliably. The calculator converts rows per day to rows per second and estimates write MiB/s from offline row bytes. Batch payload size helps tune file sizes, buffering, and network windows. Add headroom to cover growth, late-arriving events, and compaction lag; 20–50% is common. Low headroom triggers alarms during spikes or backfills.
FAQs
What does compression ratio represent?
It is the stored size after compression divided by raw size. Lower values mean better compression. Use observed ratios from your files after compaction, not generic vendor claims.
How should I choose overhead percent?
Overhead covers indexes, partitions, tombstones, and compaction slack. Start with 10–30% for offline formats and 15–40% for online key-value stores, then tune using actual storage metrics.
Why are offline and online stores sized differently?
Offline storage tracks history for training and audits, so retention days dominate. Online storage usually holds the latest vector per entity, so entity count and replication drive most capacity.
What does replication mean in the results?
It multiplies the stored bytes to account for multiple copies across nodes or zones. Match it to your availability and recovery goals, and remember some systems use erasure coding instead of full copies.
How much headroom is reasonable?
Many teams plan 20–50% headroom to absorb growth, late data, and compaction lag. If you backfill often or face seasonal spikes, choose higher headroom and monitor usage trends weekly.
How do I interpret write and read bandwidth?
Write bandwidth estimates sustained ingestion from daily volume, while read bandwidth estimates serving throughput from requests per second and payload size. Compare these to network limits, disk throughput, and autoscaling targets.