Why automated order selection matters
Auto ARIMA screens many ARIMA(p,d,q) and seasonal (P,D,Q)[m] options to reduce manual trial. With max p,q at 3 and seasonal P,Q at 1, the search evaluates up to 64 candidates. A smaller grid often beats one complex model because it limits overfitting on short series and stays interpretable for most teams.
Inputs that shape the search space
Differencing settings control stationarity. A common starting point is d=1 for trending monthly data and D=1 when a 12‑month pattern persists after detrending. Keep m smaller than the sample length; with 36 points, m=12 leaves only 24 seasonally differenced values. Forecast horizon h should match decisions; for inventory, 4–12 steps is typical, while budgeting may need 12–24. Compute budget caps runtime; 800–1500 ms is typical.
How scoring balances fit and complexity
This calculator ranks models using AIC or BIC. Both use the conditional log‑likelihood ℓ and parameter count k. AIC = 2k − 2ℓ favors predictive accuracy, while BIC = k·ln(n) − 2ℓ adds a stronger penalty as effective observations n grow. When n is near 30, BIC’s penalty is roughly 3.4k; at n=120 it is about 4.8k. If AIC and BIC disagree, prefer BIC for sparse data.
Reading diagnostics before deployment
After selection, check residual mean and standard deviation; a mean near zero indicates unbiased errors. The ACF lags 1–10 help detect leftover structure: values within ±0.2 are often acceptable for quick screening, and a slow decay can suggest under-differencing. If early lags stay high, raise d, enable seasonality, or allow p to increase. Also compare sigma² across top candidates; large jumps can flag unstable fits.
Practical workflow for model iteration
Start with conservative bounds, run selection, then refine around the top candidates by narrowing p and q to top values. If the best model has p=0 and q=0, try disabling the intercept to test a simpler baseline. For seasonality, confirm that m matches the real cycle (7 for daily‑weekly, 12 for monthly, 24 for hourly‑daily). Finally, backtest with a holdout window of 10–20% of observations, comparing MAE or MAPE alongside the chosen criterion, and keep the model that remains stable across at least two adjacent windows.