Indexing Time Calculator

Inputs

Total documents

Average size per document (KB)

Parse time per KB (ms)

Tokenize time per KB (ms)

Write time per KB (ms)

Batch size (docs)

Batch overhead (ms)

Parallel workers

Average CPU utilization (%)

I/O wait (%)

Throttle: pause minutes per hour

Warmup time (minutes)

Start date & time (optional)

Per‑doc time = size × (parse+tokenize+write) + (batch overhead / batch size). Adjusted for utilization and I/O wait.

Results

Enter your parameters and click Calculate to see throughput, total wall time, and ETA.

Quick tips

Increase workers to scale horizontally; watch I/O wait to avoid contention.
Right‑size batch size: too small increases overhead; too large raises latency/failure blast‑radius.
Tune per‑KB timings from profiling: parse/tokenization often dominates CPU.

How the math works

Per‑doc time (single)	sizeKB × (parse+tokenize+write) + (batchOverhead / batchSize)
Effective per‑doc time	perDocSingle ÷ (util%/100) × (1 + ioWait%/100)
Throughput	(1000 / effectivePerDocMs) × workers
Raw duration	totalDocs ÷ throughput
Wall time	(rawDuration × throttleFactor) + warmup
Throttle factor	60 ÷ (60 − pauseMinutesPerHour)

FAQs

1) What does I/O wait represent?

The fraction of time workers spend stalled on disk or network rather than executing compute. Higher values inflate effective per‑document time.

2) How should I pick batch size?

Choose a size that amortizes setup overhead without risking large retries on failure. Start with hundreds to a few thousand documents and adjust from error rates and latency targets.

3) Why is utilization below 100%?

Background services, context switches, GC, and coordination all reduce effective CPU time available to indexing threads.

4) Does compression change write time?

Yes. Compression trades CPU for I/O. If you compress postings or stored fields, increase write time per KB to reflect the extra work.

5) Can I model heterogeneous documents?

Approximate by computing a weighted average KB and timings across your corpus, or run multiple scenarios for clusters of similar documents.

6) What if workers are autoscaled?

Use the average expected number of workers across the run or run the calculator in phases with different worker counts and sum the durations.

7) How accurate is the ETA?

It is an estimate. For better accuracy, measure per‑KB timings on a representative sample, include realistic pauses, and monitor I/O contention in staging.