24 models explained through interactive visualizations. Real algorithms running in your browser — geometry, intuition, and math.
A linear model produces z = w·x + b — any number from -∞ to +∞. That's useless as a probability. The sigmoid squashes it into [0, 1]. Drag the slider to move your input point. Watch how the curve compresses extreme values toward 0 and 1 — and how the decision flips when σ(z) crosses 0.5.
Sigmoid: σ(z) = 1 / (1 + e^-z)
Properties: σ(0) = 0.5 (uncertain)
σ(+∞) → 1.0 (class 1)
σ(-∞) → 0.0 (class 0)
σ'(z) = σ(z)·(1 - σ(z)) ← used in backprop
Decision: ŷ = 1 if σ(z) ≥ threshold
ŷ = 0 otherwise
Switch between models and watch how the decision boundary changes over the same dataset. Logistic draws a straight line. Naive Bayes draws elliptical contours. KNN bends to fit every cluster — adjust K to control how smooth or jagged it gets.
Logistic: σ(w·x + b) = P(y=1|x)
where σ(z) = 1 / (1 + e^-z)
Naive Bayes: P(C|x) ∝ P(C) · ∏ᵢ P(xᵢ|C)
P(xᵢ|C) ~ N(μᵢ꜀, σᵢ꜀²)
KNN: ŷ = mode { yᵢ : i ∈ KNN(x, K) }
distance = √(Σ (xᵢ-xⱼ)²)
Naive Bayes fits a Gaussian (bell curve) to each feature for each class. The blue bell is Class 0, the orange bell is Class 1. Drag the slider to move your test point — the bars show the posterior probability for each class in real time. The overlap zone is where the model is uncertain.
Gaussian likelihood: P(x | C) = (1/√2πσ²) · exp(-(x-μ)²/2σ²) Posterior (Bayes rule): P(C | x) ∝ P(C) · P(x | C) "Naive" assumption: P(x₁,x₂,...,xₙ | C) = ∏ᵢ P(xᵢ | C) ← features are independent given the class Decision: ŷ = argmax_C log P(C) + Σᵢ log P(xᵢ|C)
Watch the decision tree grow one level at a time. Each split is chosen to minimize Gini impurity — a measure of how mixed the classes are. Then toggle to Random Forest and see how combining many trees smooths the boundary and reduces overfitting.
Gini impurity: G = 1 - Σ pₖ²
(0 = pure, 0.5 = maximally mixed)
Best split: argmin n_L·G_L + n_R·G_R
(f,t) ————————————————
n
Bootstrap: sample n points WITH replacement
→ each tree sees ~63% of data
Aggregation: ŷ = mode { ŷ_tree₁, ŷ_tree₂, ... }
The support vectors (highlighted points) are the only ones that define the boundary — remove any other point and nothing changes. The C parameter controls the tradeoff: low C allows some misclassifications to keep the margin wide; high C forces strict separation. Toggle to RBF kernel to see how SVM handles data that no straight line can separate.
Objective: min ½||w||² + C·Σ max(0, 1-yᵢ(w·xᵢ+b))
Margin: M = 2/||w|| ← maximize this
Support vectors: points where yᵢ(w·xᵢ+b) = 1
RBF kernel: K(x,x') = exp(-γ‖x-x'‖²)
maps data to infinite-dim. space
curved boundaries in original space
The red dashed lines are residuals — the vertical distance from each point to the regression line. OLS finds the line that minimizes their sum of squares (RSS). Toggle Show Residuals to see them, and drag any point to watch how the line reacts in real time. Watch R² — it tells you what fraction of variance the model explains: 0 means nothing, 1 means perfect.
Model: ŷ = β₀ + β₁x OLS solution (closed form): β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)² β₀ = ȳ - β₁·x̄ Loss: RSS = Σ(yᵢ - ŷᵢ)² ← minimize this Metrics: R² = 1 - RSS/TSS (0=bad, 1=perfect) TSS = Σ(yᵢ - ȳ)² (total variance) RMSE = √(RSS/n) (avg error in y units) R² interpretation: 0.0–0.3 → weak fit 0.3–0.7 → moderate fit 0.7–1.0 → strong fit
Two features, one outcome. The scatter shows X1 vs Y with points colored by X2. Toggle features on/off and watch how R² and the coefficients change. Adjusted R² penalizes adding useless variables — if it drops when you add X2, that feature isn't helping. This is the core of variable selection in Six Sigma MSA.
Model: ŷ = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ
Matrix form: β = (XᵀX)⁻¹Xᵀy
Adj. R² = 1 - (1-R²)·(n-1)/(n-p-1)
penalizes adding irrelevant features
Multicollinearity problem:
if X₁ ≈ X₂ → XᵀX near-singular → unstable β
Corr(X₁,X₂) > 0.8 → use Ridge or Lasso instead
Coefficient interpretation:
β₁ = change in y per unit X₁, holding X₂ constant
β₂ = change in y per unit X₂, holding X₁ constant
8 features — only 3 are truly relevant. The bars show each coefficient's magnitude. As you increase λ, watch what happens: Lasso drives irrelevant coefficients to exactly zero (feature selection). Ridge shrinks all of them but never reaches zero. ElasticNet does both. The active features count tells you how many survived the penalty.
OLS loss: RSS = Σ(yᵢ - ŷᵢ)²
Lasso: min RSS + λ·Σ|βⱼ| ← L1
→ some βⱼ = 0 exactly (sparse)
Ridge: min RSS + λ·Σβⱼ² ← L2
→ all βⱼ shrink, none → 0
ElasticNet: min RSS + λ·[α·Σ|βⱼ| + (1-α)·Σβⱼ²]
→ α=1: pure Lasso
→ α=0: pure Ridge
→ 0 < α < 1: hybrid
λ effect: λ→0: approaches OLS (no penalty)
λ→∞: all coefficients → 0
Each bar is one boosting round. The gold bars show XGBoost error dropping with each tree added — it learns from its own mistakes. The gray line is Random Forest at the same number of trees. Watch how boosting converges faster and lower by focusing on hard examples.
Boosting update:
F_m(x) = F_{m-1}(x) + η · h_m(x)
Where h_m fits the residuals:
rᵢ = yᵢ - F_{m-1}(xᵢ) ← what was wrong
Objective (XGBoost):
L = Σ loss(yᵢ, ŷᵢ) + Σ Ω(fₖ)
Ω(f) = γT + ½λ‖w‖² ← regularization
η (learning rate): smaller = slower but better
Trees: more = less bias, risk of overfitting
Solid points are labeled — the model knows their class. Hollow points are unlabeled. Each iteration, the model assigns pseudo-labels to the unlabeled points it's most confident about, then retrains. Watch the boundary sharpen as more points get labeled.
Algorithm:
1. Train f on labeled set L
2. For each x in unlabeled set U:
if max P(y|x) ≥ threshold:
add (x, argmax P(y|x)) to L
3. Remove added points from U
4. Repeat until U is empty or no
confident predictions remain
Risk: if initial model is wrong,
errors propagate through iterations
Switch between K-Means and DBSCAN on the same dataset. K-Means forces every point into a cluster — even outliers. DBSCAN identifies true cluster shapes and marks outliers as noise. Use the dataset toggle to see where each algorithm struggles.
K-Means objective:
min Σₖ Σ_{x∈Cₖ} ||x - μₖ||²
μₖ = centroid of cluster k
DBSCAN definitions:
Core point: ≥ minPts in ε-neighborhood
Border point: in ε-neighborhood of core
Noise point: neither core nor border
ε (epsilon): radius of neighborhood
minPts: minimum points to form a cluster
The blue arrow is PC1 — the direction that captures the most variance in the data. The gray arrow is PC2 — orthogonal to PC1. Drag the correlation slider to change the data shape and watch how the principal components rotate to always point along maximum variance.
Steps:
1. Center data: X = X - mean(X)
2. Covariance matrix: C = XᵀX / (n-1)
3. Eigendecomposition: C = VΛVᵀ
4. Sort eigenvectors by eigenvalue
5. Project: Z = X · V_k
PC1 = eigenvector with largest eigenvalue
= direction of maximum variance
Explained variance ratio:
EVR_k = λₖ / Σλᵢ
Same dataset, two detection methods. Z-Score flags points beyond ±Nσ from the mean — simple and fast but assumes Gaussian data. Isolation Forest builds random trees and isolates anomalies — works on any distribution and catches multivariate outliers Z-Score misses.
Z-Score: z = (x - μ) / σ flag if |z| > threshold (typically 3) assumes N(μ, σ²) distribution Isolation Forest: score(x) = 2^(-E[h(x)] / c(n)) h(x) = path length to isolate x c(n) = avg path length for n points anomaly if score → 1 (short path) normal if score → 0 (long path)
The orange agent starts at top-left. The green cell is the goal (+10 reward). Red cells are penalties (-5). Watch the ε parameter — it starts at 1.0 (fully random exploration) and decays as the agent learns. The arrows show what the Q-table has learned: the best action from each cell.
Bellman equation (Q-update): Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)] Where: α = learning rate (how fast to update) γ = discount factor (value of future rewards) r = immediate reward s' = next state after action a ε-greedy policy: with prob ε: explore (random action) with prob 1-ε: exploit (best Q action) ε decays over time: ε = ε · decay_rate
Everyone is talking about AI. Most treat it like an oracle. Here's what it actually is: Code. Statistics. Data. Things we've had for decades — we just weren't asking the right questions.
I've spent several years in Operational Excellence roles — Lean, Six Sigma, Continuous Improvement on the floor. Plants full of data. Machines sending signals. People making reactive decisions. So I decided to build the bridge.
Projects. Real use cases. Open code. Not a course. Not a pitch. Just the honest path — including the mistakes — so others can go further without hitting the same walls.
If you work in operations, maintenance, quality, or CI — let's connect, collaborate, and learn together. More projects are on the way.