Calculator
Example Data Table
Use these scenarios to sanity-check your inputs.
| Scenario | Modality | Key Inputs | Typical Outcome |
|---|---|---|---|
| Customer churn model | Tabular | 30 features, 8 classes, 600/class, float32, 10/10 splits | ~30k–60k samples, tens of MB |
| Defect detection | Images | 224×224×3, uint8, compression 8×, 12 classes, 800/class | ~15k–40k images, few GB |
| Intent classification | Text | 256 tokens, 3 bytes/token, compression 3×, 25 classes, 400/class | ~10k–30k samples, hundreds of MB |
Formula Used
This calculator estimates both sample counts and storage footprint.
- Classification Basis = classes × samples_per_class
- Regression Basis = features × samples_per_feature
- Unsupervised Basis = features × samples_per_feature × 0.8
How to Use This Calculator
- Select your data modality and objective.
- Set complexity and baseline sampling targets.
- Enter modality sizing values and compression ratio.
- Choose validation and test splits for evaluation.
- Add overhead, headroom, and replication if applicable.
- Press calculate, then export CSV or PDF.
Sample targets that match problem difficulty
This calculator starts with a baseline target and scales it using a complexity multiplier. For classification, the baseline is classes multiplied by samples per class, which keeps rare classes from being underrepresented. For regression and unsupervised work, features multiplied by samples per feature is used as a practical proxy for signal coverage. Moving from Moderate to Complex increases the target by roughly 67%, which often matches the jump in decision boundaries, noise, and edge cases.
Split strategy and statistical confidence
Validation and test splits reduce the usable training pool, so the tool back-calculates total samples from your training target. A common 80/10/10 split yields 25% more total data than the training count alone. When the dataset is small, pushing test below 10% can inflate variance; when the dataset is huge, 1–5% is typically enough. Keep splits time-aware for logs and sensor streams to avoid leakage. for most teams.
Bytes per sample across modalities
Payload size differs dramatically by modality. Tabular records scale with feature count and numeric encoding, so switching from float64 to float32 cuts payload in half. Images scale with width × height × channels and then shrink by the chosen compression ratio, which approximates JPEG or PNG storage. Text uses average tokens and bytes per token, capturing multilingual effects. Audio grows with sample rate, duration, channels, and bit depth, before compression is applied.
Overheads, headroom, and replicas for real storage
Raw payload is rarely the final footprint. File formats add headers and block padding, catalogs add indices, and projects need headroom for retries, sharding, and new labels. The calculator lets you model these as percentage overhead plus a replication factor. A 5% format overhead, 3% index overhead, and 10% headroom together add 18% before replication. With three replicas for durability, storage can exceed payload by more than 3.5×.
Throughput planning to keep GPUs busy
Sizing storage is only half the plan; delivery rate matters. The tool estimates transfer time from bandwidth in megabits per second, converting to megabytes per second for a realistic copy window. It also estimates batch memory as bytes-per-sample times batch size, with an additional pipeline buffer for prefetch and augmentation. If batch memory is near device limits, reduce batch size, raise compression, or stream from faster local media to avoid stalls.
FAQs
1) How does complexity change the recommended sample count?
Complexity scales the baseline target using a multiplier. Higher settings assume more edge cases, noisier labels, and broader coverage needs, so the tool recommends more effective training samples before splits and augmentation are applied.
2) Does on-the-fly augmentation reduce the data I must collect?
Yes. The calculator divides the effective target by the augmentation multiplier to estimate raw training samples needed. If you also store augmented copies, storage increases even though raw collection needs stay lower.
3) What compression ratio should I use?
Use an observed ratio from a small sample export. For images, compare raw pixel bytes to saved JPEG or PNG sizes. For text, compare uncompressed JSON to gzip. Keep the value conservative to avoid underestimating storage.
4) Why are validation and test splits included in total samples?
Splits remove samples from training, but you still need them for unbiased evaluation. The tool back-calculates the total dataset so the remaining training portion meets the recommended raw target.
5) How should I set replication factor and headroom?
Replication models copies for durability and faster access. Headroom covers growth, relabeling, and versioned datasets. If you use three replicas and 10% headroom, storage requirements can rise quickly, so validate against available capacity.
6) What does the batch memory estimate represent?
It approximates the data footprint of one training batch plus a pipeline buffer for prefetching and transforms. It is not total model memory. If it is high, reduce batch size, compress payloads, or optimize the data loader.