mogoz

Time Series / Anomaly Detection / Tabular Data

tags
Database , Prometheus , Machine Learning , Statistics , Probability , Representing Time and Date

See TS section in Machine Learning

FAQ

What are the different methods

Comparison Table

Method Category Desc/Features Use Cases/Strengths Weaknesses Introduced
Classical/Statistical
Naive/Seasonal Naive Statistical Forecast = last / last seasonal value Baseline; simple; fast Assumes persistence; often inaccurate Foundational
Simple Exp. Smooth (SES) Statistical Weighted avg; models level Univar; no trend/season; simple No trend/season handling ~1950s
Holt’s Linear Trend Statistical SES + linear trend Univar; trend; simple Assumes linear trend; no season ~1957
Holt-Winters Statistical Holt + seasonality (add/mult) Univar; trend & season; good benchmark Assumes fixed patterns ~1960
ETS Statistical State-space framework for ES; auto-selects General univar; robust; auto-select; prob. Univar only; assumes state-space ~2002
ARIMA/SARIMA Statistical Models autocorrelation (AR+I+MA); SARIMA=seasonal Univar; models autocorrelation; benchmark; prob. Requires stationarity; param tuning ~1970
Theta Method Statistical Decompose + damped linear extrapolation Univar; strong M3/M4 perf.; simple Less intuitive; mainly univar ~2000
VAR Statistical Multivariate AR; models linear interdep. Multivar linear interactions; interp. Assumes linearity; needs stationarity ~1980
TAR/SETAR/STAR Statistical Threshold AR; regime-switching; nonlinear Nonlinear univar w/ regimes Complex thresholds; mainly univar ~1978
INLA Bayesian Stat. Approx. Bayesian inference; latent Gaussian Complex models; hierarchy; uncertainty (prob.) Approx. method; learning curve ~2009
Prophet Statistical/Curve Fit Decompose trend/season/holidays; Bayesian Univar; strong season/holidays; robust; prob. Less accurate on some benchmarks ~2017
Machine Learning & DL (Often need more data; less interpretable) (Can model complex nonlinearity/interactions) (Compute intensive; tuning crucial)
Tree-based (RF, XGB…) ML Uses lagged/derived features in trees/ensembles Nonlinearity/interactions; feature imp.; robust Needs features; no trend extrap. ~1984+
SVR ML SVM for regression; uses tolerance margin Robust to outliers; high-dim features Less intuitive; kernel/param sensitive ~1996
Gaussian Processes (GP) Bayesian ML Non-parametric; models distribution over func. Probabilistic; complex nonlinear; flex. Slow (cubic); kernel tuning difficult ~2006
MLP DL Feedforward NN; needs lagged features General nonlinear; covariates Needs features; tuning; can overfit ~1980s
RNN DL NN w/ loops for sequence processing Sequential data; time dependencies Vanishing gradients; often outperformed ~1980s
LSTM DL RNN w/ gates for long dependencies Complex seq; long dependency; multivar Needs data; slow; tuning; can overfit ~1997
GRU DL Simpler LSTM variant; similar perf. Like LSTM; potentially faster Like LSTM; needs data; tuning ~2014
CNN (1D) DL Uses convolutions for sequence feature extraction Feature extraction; fast pattern recog. Less natural for long dependencies ~1989/2012
DeepAR/DeepVAR DL Autoregressive RNN outputs distribution params Probabilistic forecast; covariates; global Needs lots of data; complex; slow train ~2017
N-BEATS DL Non-recurrent NN; basis expansion; interp. Univar; state-of-art M4/M3; interp. Mainly univar; compute intensive ~2019
Transformer Variants DL Self-attention mechanism; parallel processing Long dependencies; parallel; multivar Data hungry; quadratic complexity ~2017+
Samformer DL Transformer variant (Specific capabilities TBD) (Likely transformer limitations) Recent
TabPFN (Time Series) DL Transformer for small tabular data; zero-shot TS Small datasets; little tuning needed Newer; focus on specific niche ~2024

Additional notes

For time-series forecasting, we can either use

  • Deep learning
  • Traditional ML/stats methods
  • “In my projects, DL models outperform both statistical and ML methods in datasets with higher frequencies (hourly or more). I use TFT, NHITS, and a customized TSMixer. The most underrated statistical model that I often use is DynamicOptimizedTheta.”
  • LLM Based

    The fundamental challenge is that LLMs like O1 and Claude 3.5 simply aren’t built for the unique structures of tabular data. When processing tables through LLMs, the inefficiencies quickly become apparent - tokenizing a 10,000 x 100 table as a sequence and numerical values as tokens creates massive inefficiencies.

    There’s some interesting work on using LLMs for tabular data (TabLLM: TabLLM: Few-shot Classification of Tabular Data with Large Language Modelsexternal link ), but this only works for datasets with tens of samples rather than the thousands of rows needed in real-world applications.

    What o1 and other LLMs typically do is wrap around existing tabular tools like XGBoost or scikit-learn. While this works, they’re ultimately constrained by these tools' limitations. We’re taking a fundamentally different approach - building foundation models that natively understand tabular relationships and patterns. Our approach combines the benefits of foundation models with architectures specifically designed for tabular data structures.

Things ppl say

  • An aha moment for me was realizing that the way you can think of anomaly models working is that they’re effectively forecasting the next N steps, and then noticing when the actual measured values are “different enough” from the expected. This is simple to draw on a whiteboard for one signal but when it’s multi variate, pretty neat that it works.

Links to this note