Supervised Learning

Interactive ML Models

24 models explained through interactive visualizations. Real algorithms running in your browser — geometry, intuition, and math.

Before we compare the models

Understand the building block first

Think of it like...
A volume knob that only goes from 0 to 1. No matter how extreme the input signal is — very negative or very positive — the output always stays between 0 (definitely not) and 1 (definitely yes). That's the sigmoid.
σ(z) = 1 / (1 + e⁻ᶻ)
z = w·x + b
0.00
σ(z) = P(y=1)
0.500
Decision
Uncertain
Demo 00 — Logistic Regression

From a Line
to a Probability

A linear model produces z = w·x + b — any number from -∞ to +∞. That's useless as a probability. The sigmoid squashes it into [0, 1]. Drag the slider to move your input point. Watch how the curve compresses extreme values toward 0 and 1 — and how the decision flips when σ(z) crosses 0.5.

Input z (linear score) 0.0
Decision threshold 0.50
The math behind it
Sigmoid:    σ(z) = 1 / (1 + e^-z)

Properties: σ(0)   = 0.5   (uncertain)
            σ(+∞)  → 1.0   (class 1)
            σ(-∞)  → 0.0   (class 0)
            σ'(z)  = σ(z)·(1 - σ(z))  ← used in backprop

Decision:   ŷ = 1  if σ(z) ≥ threshold
            ŷ = 0  otherwise
Block 01 — Supervised Classification

Linear & Probabilistic Models

01
Cap. 01
Logistic Regression
Applies a sigmoid function to a linear combination of features to estimate class probabilities. The decision boundary is a hyperplane — a straight line in 2D. Simple, fast, and interpretable.
SupervisedClassifierParametricLinearProbabilistic
✓ Use when
Classes are linearly separable. You need probability estimates. Starting baseline.
✗ Avoid when
Boundary is non-linear. Features are highly correlated.
→ Demo 01
02
Cap. 02
Naive Bayes
Uses Bayes' theorem assuming features are conditionally independent given the class. Computes Gaussian likelihoods per class and picks the most probable one. Surprisingly powerful for its simplicity.
SupervisedClassifierProbabilisticGenerativeFast
✓ Use when
Small datasets. Real-time predictions. Text classification.
✗ Avoid when
Features are correlated. Independence assumption is clearly wrong.
→ Demo 01
03
Cap. 03
K-Nearest Neighbors
Stores all training points and classifies by majority vote among the K closest neighbors. No training phase — the model IS the data. Boundary adapts entirely to local density.
SupervisedClassifierNon-parametricInstance-basedLazy
✓ Use when
Non-linear boundaries. Small/medium datasets. Low-dimensional data.
✗ Avoid when
Large datasets (slow predictions). High-dimensional data (curse of dimensionality).
→ Demo 01
Full Repository
Think of it like...
Three different artists drawing the border between two countries — each follows a different rule, but they're all looking at the same map.
Logistic Regression
Model
Logistic
Accuracy
Boundary
Linear
Demo 01 — Supervised Classification

How Classifiers Draw Boundaries

Switch between models and watch how the decision boundary changes over the same dataset. Logistic draws a straight line. Naive Bayes draws elliptical contours. KNN bends to fit every cluster — adjust K to control how smooth or jagged it gets.

The math behind it
Logistic:   σ(w·x + b) = P(y=1|x)
            where σ(z) = 1 / (1 + e^-z)

Naive Bayes: P(C|x) ∝ P(C) · ∏ᵢ P(xᵢ|C)
             P(xᵢ|C) ~ N(μᵢ꜀, σᵢ꜀²)

KNN:         ŷ = mode { yᵢ : i ∈ KNN(x, K) }
             distance = √(Σ (xᵢ-xⱼ)²)
Think of it like...
Each class leaves a fingerprint on every feature — a bell curve showing where that class typically lives. When you see a new point, you check: which class's fingerprint fits it better? The one with the taller bell at that spot wins.
Gaussian Likelihood — Feature 1
x value
0.00
P(C0 | x)
P(C1 | x)
Predicted
Demo 01b — Naive Bayes

Why the
Gaussians?

Naive Bayes fits a Gaussian (bell curve) to each feature for each class. The blue bell is Class 0, the orange bell is Class 1. Drag the slider to move your test point — the bars show the posterior probability for each class in real time. The overlap zone is where the model is uncertain.

Test point x 0.0
Class 050%
Class 150%
The math behind it
Gaussian likelihood:
  P(x | C) = (1/√2πσ²) · exp(-(x-μ)²/2σ²)

Posterior (Bayes rule):
  P(C | x) ∝ P(C) · P(x | C)

"Naive" assumption:
  P(x₁,x₂,...,xₙ | C) = ∏ᵢ P(xᵢ | C)
  ← features are independent given the class

Decision:
  ŷ = argmax_C  log P(C) + Σᵢ log P(xᵢ|C)
Block 02 — Supervised Classification

Kernel & Tree-Based Models

04
Cap. 04
Support Vector Machine
Finds the hyperplane that maximizes the margin between classes. Only the "support vectors" — the closest points to the boundary — define the model. Kernel functions allow non-linear separation by lifting data to higher dimensions.
SupervisedClassifierMax-MarginKernelHigh-dimensional
✓ Use when
High-dimensional data. Clear margin between classes. Small-medium datasets.
✗ Avoid when
Very large datasets. Need probability outputs. Noisy data with heavy overlap.
→ Demo 03
05
Cap. 05
Decision Tree
Recursively splits the feature space with axis-aligned cuts, choosing each split to minimize Gini impurity. The result is a set of human-readable if/then rules. You can trace every single prediction.
SupervisedClassifierInterpretableRule-basedNon-parametric
✓ Use when
Rules matter. Mixed data types. Explainability is required by stakeholders.
✗ Avoid when
Data is noisy. You need stable predictions. Prone to overfitting.
→ Demo 02
06
Cap. 06
Random Forest
Builds many decision trees on bootstrap samples using random feature subsets. Each tree votes, majority wins. The variance drops dramatically compared to a single tree — without significantly increasing bias.
SupervisedClassifierEnsembleBaggingRobust
✓ Use when
Accuracy is the priority. Noisy data. Large feature sets. Robust baseline.
✗ Avoid when
Need fast real-time predictions. Full interpretability is required.
→ Demo 02
Think of it like...
A game of 20 questions — the computer learns which yes/no question cuts the most confusion in half at each step. Then it invites a hundred friends to play, and takes the majority answer.
Decision Tree
Depth
0
Leaves
1
Gini
Mode
Tree
Demo 02 — Tree-Based Models

The Tree That Learns to Split

Watch the decision tree grow one level at a time. Each split is chosen to minimize Gini impurity — a measure of how mixed the classes are. Then toggle to Random Forest and see how combining many trees smooths the boundary and reduces overfitting.

Max depth 1
The math behind it
Gini impurity:  G = 1 - Σ pₖ²
                (0 = pure, 0.5 = maximally mixed)

Best split:     argmin  n_L·G_L + n_R·G_R
               (f,t)     ————————————————
                               n

Bootstrap:     sample n points WITH replacement
               → each tree sees ~63% of data

Aggregation:   ŷ = mode { ŷ_tree₁, ŷ_tree₂, ... }
Think of it like...
Finding the widest road that separates two neighborhoods — not just any dividing line, but the one with the most breathing room on both sides. The houses right on the edge of that road are the support vectors.
SVM — Linear
Kernel
Linear
Support Vecs
C (slack)
1.0
Demo 03 — Support Vector Machine

The Margin That Separates Everything

The support vectors (highlighted points) are the only ones that define the boundary — remove any other point and nothing changes. The C parameter controls the tradeoff: low C allows some misclassifications to keep the margin wide; high C forces strict separation. Toggle to RBF kernel to see how SVM handles data that no straight line can separate.

Regularization C 1.0
The math behind it
Objective:  min  ½||w||² + C·Σ max(0, 1-yᵢ(w·xᵢ+b))

Margin:     M = 2/||w||   ← maximize this

Support vectors: points where yᵢ(w·xᵢ+b) = 1

RBF kernel: K(x,x') = exp(-γ‖x-x'‖²)
            maps data to infinite-dim. space
            curved boundaries in original space
Block 03 — Supervised Regression

Predicting Continuous Values

07
Cap. 07
Linear Regression
Fits a straight line through data by minimizing the sum of squared residuals (RSS). Every prediction is a weighted combination of inputs plus a bias term. The foundation of all regression.
SupervisedRegressionParametricLinearInterpretable
✓ Use when
Linear relationship exists. Need interpretable coefficients. Baseline for any regression task.
✗ Avoid when
Non-linear patterns. Many correlated features. Outliers dominate the data.
→ Demo 04
08
Cap. 08
Multiple Regression
Extends linear regression to p features. The model fits a hyperplane in p+1 dimensions. Each coefficient β tells you the expected change in y per unit increase in that feature, holding all others constant.
SupervisedRegressionParametricMultivariateOLS
✓ Use when
Multiple drivers explain the outcome. Need to isolate each variable's impact.
✗ Avoid when
Features are multicollinear. More features than observations (p > n).
→ Demo 05
Full Repository
09
Cap. 09
Lasso Regression
Adds an L1 penalty (λ·Σ|βj|) to OLS. This forces some coefficients to exactly zero — performing automatic feature selection. The higher λ, the more features are eliminated from the model entirely.
SupervisedRegressionRegularizedL1 PenaltyFeature Selection
✓ Use when
Many features, few relevant. Need automatic feature selection. Sparse solutions preferred.
✗ Avoid when
All features are relevant. Correlated features (picks one randomly, drops others).
→ Demo 06
10
Cap. 10
Ridge Regression
Adds an L2 penalty (λ·Σβj²) to OLS. Unlike Lasso, Ridge shrinks all coefficients toward zero but never to exactly zero — keeping all features in the model with reduced magnitude.
SupervisedRegressionRegularizedL2 PenaltyStable
✓ Use when
Multicollinearity is a problem. All features are relevant. Stable, continuous shrinkage needed.
✗ Avoid when
You need feature selection (Ridge never zeroes out coefficients).
→ Demo 06
11
Cap. 11
ElasticNet
Combines L1 and L2 penalties with a mixing ratio α. When α=1 it becomes pure Lasso; when α=0 it becomes pure Ridge. Captures both benefits: sparse solutions and stability with correlated features.
SupervisedRegressionRegularizedL1+L2Hybrid
✓ Use when
Correlated features exist AND some are irrelevant. Best of both worlds.
✗ Avoid when
You need pure interpretability — two hyperparameters (λ, α) complicate explanation.
→ Demo 06
+
Coming soon
More models on the way
This page is updated continuously as new projects are published. Follow on GitHub to stay current.
→ github.com/LozanoLsa
Think of it like...
Stretching a rubber band between two poles and letting it settle where it pulls equally from all data points. The line finds the position that makes everyone equally unhappy — minimizing the total tension.
Linear Regression — OLS
RMSE
Slope β₁
Intercept β₀
Demo 04 — Linear Regression

The Line That
Minimizes Error

The red dashed lines are residuals — the vertical distance from each point to the regression line. OLS finds the line that minimizes their sum of squares (RSS). Toggle Show Residuals to see them, and drag any point to watch how the line reacts in real time. Watch — it tells you what fraction of variance the model explains: 0 means nothing, 1 means perfect.

Noise level low
The math behind it
Model:    ŷ = β₀ + β₁x

OLS solution (closed form):
  β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)²
  β₀ = ȳ - β₁·x̄

Loss:     RSS = Σ(yᵢ - ŷᵢ)²   ← minimize this

Metrics:
  R²   = 1 - RSS/TSS    (0=bad, 1=perfect)
  TSS  = Σ(yᵢ - ȳ)²    (total variance)
  RMSE = √(RSS/n)       (avg error in y units)

R² interpretation:
  0.0–0.3  → weak fit
  0.3–0.7  → moderate fit
  0.7–1.0  → strong fit
Think of it like...
A machine downtime isn't caused by just one thing — temperature, speed, load, and vibration all play a role. Multiple regression isolates how much each driver contributes independently, holding the others constant.
Multiple Regression — 2 features
Adj. R²
β₁ (X1)
β₂ (X2)
Demo 05 — Multiple Regression

When More Variables
Change Everything

Two features, one outcome. The scatter shows X1 vs Y with points colored by X2. Toggle features on/off and watch how and the coefficients change. Adjusted R² penalizes adding useless variables — if it drops when you add X2, that feature isn't helping. This is the core of variable selection in Six Sigma MSA.

Correlation X1↔X2 low
The math behind it
Model:   ŷ = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ

Matrix form:   β = (XᵀX)⁻¹Xᵀy

Adj. R²  = 1 - (1-R²)·(n-1)/(n-p-1)
           penalizes adding irrelevant features

Multicollinearity problem:
  if X₁ ≈ X₂ → XᵀX near-singular → unstable β
  Corr(X₁,X₂) > 0.8 → use Ridge or Lasso instead

Coefficient interpretation:
  β₁ = change in y per unit X₁, holding X₂ constant
  β₂ = change in y per unit X₂, holding X₁ constant
Think of it like...
You have 20 process variables but suspect only 4 actually drive defects. Regularization adds a tax on complexity — forcing the model to justify every variable it keeps. Lasso eliminates them entirely. Ridge just limits their influence. ElasticNet does both.
Lasso — λ = 0.10
Method
Lasso
Active features
λ
0.10
Demo 06 — Lasso · Ridge · ElasticNet

The Penalty That
Shapes the Model

8 features — only 3 are truly relevant. The bars show each coefficient's magnitude. As you increase λ, watch what happens: Lasso drives irrelevant coefficients to exactly zero (feature selection). Ridge shrinks all of them but never reaches zero. ElasticNet does both. The active features count tells you how many survived the penalty.

Regularization λ 0.10
The math behind it
OLS loss:     RSS = Σ(yᵢ - ŷᵢ)²

Lasso:   min RSS + λ·Σ|βⱼ|       ← L1
         → some βⱼ = 0 exactly (sparse)

Ridge:   min RSS + λ·Σβⱼ²        ← L2
         → all βⱼ shrink, none → 0

ElasticNet: min RSS + λ·[α·Σ|βⱼ| + (1-α)·Σβⱼ²]
         → α=1: pure Lasso
         → α=0: pure Ridge
         → 0 < α < 1: hybrid

λ effect:  λ→0: approaches OLS (no penalty)
           λ→∞: all coefficients → 0
Block 04 — Supervised · Boosting

Gradient Boosting

12
Cap. 12
XGBoost
Builds trees sequentially — each one correcting the errors of the previous. Uses gradient descent in function space to minimize loss. One of the most powerful and widely-used models in industry and competitions.
SupervisedEnsembleBoostingGradient DescentHigh Performance
✓ Use when
Tabular data. Competition-level accuracy needed. Mixed feature types.
✗ Avoid when
Need full interpretability. Very small datasets. Image or text data.
→ Demo 07
+
Coming soon
More boosting variants
LightGBM, CatBoost and other gradient boosting frameworks will be added as projects are completed.
Think of it like...
A team of students taking an exam in sequence. The first student answers. The second only focuses on what the first got wrong. The third fixes what the second missed. Each one specializes in the previous team's mistakes — together they get nearly everything right.
XGBoost — Boosting Steps
Trees built
0
Train error
vs Random Forest
Demo 07 — XGBoost

How Boosting
Corrects Itself

Each bar is one boosting round. The gold bars show XGBoost error dropping with each tree added — it learns from its own mistakes. The gray line is Random Forest at the same number of trees. Watch how boosting converges faster and lower by focusing on hard examples.

Learning rate η 0.30
The math behind it
Boosting update:
  F_m(x) = F_{m-1}(x) + η · h_m(x)

Where h_m fits the residuals:
  rᵢ = yᵢ - F_{m-1}(xᵢ)   ← what was wrong

Objective (XGBoost):
  L = Σ loss(yᵢ, ŷᵢ) + Σ Ω(fₖ)
  Ω(f) = γT + ½λ‖w‖²  ← regularization

η (learning rate): smaller = slower but better
Trees: more = less bias, risk of overfitting
Block 05 — Semi-Supervised Learning

Learning with Few Labels

13
Cap. 13
Self Training
Trains on labeled data, then predicts labels for unlabeled data with high confidence and adds them to training. Iterates until no more confident predictions remain. The model teaches itself.
Semi-SupervisedIterativeLabel PropagationFew Labels
✓ Use when
Few labeled examples, many unlabeled. Labeling is expensive or slow.
✗ Avoid when
Initial model is weak — wrong labels amplify errors.
→ Demo 08
14
Cap. 14
Co-Training
Uses two independent models trained on different feature views. Each model labels data for the other — they teach each other. Requires the two views to be conditionally independent given the class.
Semi-SupervisedMulti-ViewCollaborativeFeature Split
✓ Use when
Data has two natural independent views (e.g., sensor A and sensor B).
✗ Avoid when
Feature views are correlated — the independence assumption breaks down.
→ Demo 08
Think of it like...
You label 10 items in a factory. A model learns from those 10. It then looks at 200 unlabeled items and says "I'm 95% sure this one is defective." You trust it, add that label, retrain. Repeat. Now you effectively have hundreds of labeled examples — from just 10.
Self Training — Iteration 0
Labeled
10
Unlabeled
90
Iteration
0
Accuracy
Demo 08 — Semi-Supervised

The Model That
Labels Itself

Solid points are labeled — the model knows their class. Hollow points are unlabeled. Each iteration, the model assigns pseudo-labels to the unlabeled points it's most confident about, then retrains. Watch the boundary sharpen as more points get labeled.

Confidence threshold 0.80
The math behind it
Algorithm:
  1. Train f on labeled set L
  2. For each x in unlabeled set U:
       if max P(y|x) ≥ threshold:
         add (x, argmax P(y|x)) to L
  3. Remove added points from U
  4. Repeat until U is empty or no
     confident predictions remain

Risk: if initial model is wrong,
errors propagate through iterations
Block 06 — Unsupervised Learning

Finding Structure Without Labels

15
Cap. 15
K-Means
Groups data into K clusters by minimizing intra-cluster distance. Assigns each point to the nearest centroid, then recalculates centroids. Repeats until convergence. Simple, fast, and widely used.
UnsupervisedClusteringCentroid-basedEuclidean
✓ Use when
Clusters are roughly spherical and similar size. K is known or estimable.
✗ Avoid when
Clusters have irregular shapes. Outliers present. K unknown.
→ Demo 09
16
Cap. 16
DBSCAN
Density-Based Spatial Clustering. Groups points that are closely packed together and marks outliers as noise. Finds clusters of arbitrary shape — no need to specify K. Points are core, border, or noise.
UnsupervisedClusteringDensity-basedOutlier Detection
✓ Use when
Arbitrary cluster shapes. Outliers present. K unknown. Dense regions visible.
✗ Avoid when
Varying density clusters. High-dimensional data (distance loses meaning).
→ Demo 09
17
Cap. 17
PCA
Principal Component Analysis finds the directions of maximum variance in data and projects it onto fewer dimensions. Each principal component is orthogonal to the others. Reduces noise and reveals hidden structure.
UnsupervisedDimensionality ReductionLinearVariance
✓ Use when
Too many features. Features are correlated. Visualization of high-dim data needed.
✗ Avoid when
Non-linear structure in data. Interpretability of original features required.
→ Demo 10
18
Cap. 18
ICA
Independent Component Analysis separates a multivariate signal into additive, statistically independent components. Unlike PCA (which finds uncorrelated components), ICA finds truly independent sources. Classic use: separating mixed audio signals.
UnsupervisedDimensionality ReductionSource SeparationNon-Gaussian
✓ Use when
Multiple independent signals are mixed together. Non-Gaussian sources expected.
✗ Avoid when
Sources are Gaussian — ICA cannot separate them. Small sample sizes.
→ Demo 10
19
Cap. 19
Apriori
Discovers frequent itemsets and association rules from transactional data. Uses support, confidence, and lift to find "if A then B" patterns. The algorithm behind market basket analysis and recommendation systems.
UnsupervisedAssociation RulesFrequent PatternsTransactions
✓ Use when
Transactional data. Finding co-occurrence patterns. Recommendation engines.
✗ Avoid when
Continuous data. Very large item sets (exponential search space).
→ Demo 09
Think of it like...
K-Means draws circles around groups — it works perfectly when groups are round and similar in size. DBSCAN finds dense neighborhoods — it works on any shape and marks lonely points as outliers. Same data, completely different logic.
K-Means — K=3
Algorithm
K-Means
Clusters
3
Noise points
Inertia
Demo 09 — Clustering

K-Means vs
DBSCAN

Switch between K-Means and DBSCAN on the same dataset. K-Means forces every point into a cluster — even outliers. DBSCAN identifies true cluster shapes and marks outliers as noise. Use the dataset toggle to see where each algorithm struggles.

K clusters 3
The math behind it
K-Means objective:
  min Σₖ Σ_{x∈Cₖ} ||x - μₖ||²
  μₖ = centroid of cluster k

DBSCAN definitions:
  Core point: ≥ minPts in ε-neighborhood
  Border point: in ε-neighborhood of core
  Noise point: neither core nor border

ε (epsilon): radius of neighborhood
minPts: minimum points to form a cluster
Think of it like...
You have a shadow on the wall. The 3D object cast a 2D shadow — information was lost, but the important shape is preserved. PCA finds the angle that preserves the most information when projecting data to fewer dimensions.
PCA — Variance Explained
Original dims
2
PC1 variance
PC2 variance
Demo 10 — PCA · ICA

Finding the
Direction of Most Variance

The blue arrow is PC1 — the direction that captures the most variance in the data. The gray arrow is PC2 — orthogonal to PC1. Drag the correlation slider to change the data shape and watch how the principal components rotate to always point along maximum variance.

Feature correlation 0.80
Show projection onto PC1 ON
The math behind it
Steps:
  1. Center data: X = X - mean(X)
  2. Covariance matrix: C = XᵀX / (n-1)
  3. Eigendecomposition: C = VΛVᵀ
  4. Sort eigenvectors by eigenvalue
  5. Project: Z = X · V_k

PC1 = eigenvector with largest eigenvalue
     = direction of maximum variance

Explained variance ratio:
  EVR_k = λₖ / Σλᵢ
Block 07 — Anomaly Detection

Finding the Unexpected

20
Cap. 20
Z-Score
Measures how many standard deviations a point is from the mean. Points beyond a threshold (typically ±3σ) are flagged as anomalies. Assumes Gaussian distribution. Fast, interpretable, and a strong baseline.
Anomaly DetectionStatisticalUnivariateGaussian
✓ Use when
Data is approximately Gaussian. Single variable. Need an interpretable baseline.
✗ Avoid when
Non-Gaussian distributions. Multivariate anomalies. Context-dependent outliers.
→ Demo 11
21
Cap. 21
Isolation Forest
Anomalies are easier to isolate than normal points. Builds random trees that split features randomly — anomalies need fewer splits to be isolated. The anomaly score is the inverse of the average path length.
Anomaly DetectionTree-basedMultivariateUnsupervised
✓ Use when
Multivariate anomalies. Non-Gaussian data. Large datasets. No labeled anomalies.
✗ Avoid when
Very high-dimensional data. Anomalies cluster together (they become hard to isolate).
→ Demo 11
Think of it like...
Z-Score is like asking "how far is this from the average?" Isolation Forest asks "how quickly can I separate this point from everyone else?" Both find outliers — but one uses statistics, the other uses trees.
Z-Score — threshold ±3σ
Method
Z-Score
Anomalies
Normal
Threshold
Demo 11 — Anomaly Detection

Z-Score vs
Isolation Forest

Same dataset, two detection methods. Z-Score flags points beyond ±Nσ from the mean — simple and fast but assumes Gaussian data. Isolation Forest builds random trees and isolates anomalies — works on any distribution and catches multivariate outliers Z-Score misses.

Sigma threshold 3.0σ
The math behind it
Z-Score:
  z = (x - μ) / σ
  flag if |z| > threshold (typically 3)
  assumes N(μ, σ²) distribution

Isolation Forest:
  score(x) = 2^(-E[h(x)] / c(n))
  h(x) = path length to isolate x
  c(n) = avg path length for n points
  anomaly if score → 1 (short path)
  normal  if score → 0 (long path)
Block 08 — Reinforcement Learning

Learning by Trial and Error

22
Cap. 22
Q-Learning
An agent learns a Q-table — the expected reward for taking action A in state S. It explores the environment, receives rewards, and updates Q-values using the Bellman equation. Model-free: no prior knowledge of the environment needed.
Reinforcement LearningModel-FreeQ-TableDiscrete
✓ Use when
Discrete state and action spaces. Sequential decision making. Reward signal available.
✗ Avoid when
Continuous high-dimensional state spaces (use Deep RL instead).
→ Demo 12
23
Cap. 23
Policy Optimization
Directly optimizes the policy π(a|s) — the probability of taking action A in state S. Uses gradient ascent on expected cumulative reward. REINFORCE and PPO are key algorithms. Works in continuous action spaces.
Reinforcement LearningPolicy GradientContinuousREINFORCE
✓ Use when
Continuous action spaces. Stochastic policies needed. Robotics and control tasks.
✗ Avoid when
Discrete simple problems — Q-learning is more efficient there.
→ Demo 12
24
Cap. 24
Model-Based RL
The agent builds an internal model of the environment — learning transition dynamics P(s'|s,a) and reward function R(s,a). Uses the model to plan ahead before acting. More sample-efficient than model-free approaches.
Reinforcement LearningModel-BasedPlanningSample Efficient
✓ Use when
Simulations available. Sample efficiency critical. Environment is learnable.
✗ Avoid when
Environment is stochastic and hard to model. Model errors compound badly.
→ Demo 12
Think of it like...
A robot in a maze with no map. It moves randomly at first, bumping into walls (negative reward) and occasionally finding the exit (positive reward). Over hundreds of episodes it builds a memory — a table of which direction to go from each cell to maximize reward. That table is the Q-table.
Q-Learning — Episode 0
Episode
0
Steps
Total reward
ε (explore)
1.00
Demo 12 — Reinforcement Learning

The Agent That
Learns by Doing

The orange agent starts at top-left. The green cell is the goal (+10 reward). Red cells are penalties (-5). Watch the ε parameter — it starts at 1.0 (fully random exploration) and decays as the agent learns. The arrows show what the Q-table has learned: the best action from each cell.

Learning rate α 0.50
Discount factor γ 0.90
The math behind it
Bellman equation (Q-update):
  Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]

Where:
  α  = learning rate (how fast to update)
  γ  = discount factor (value of future rewards)
  r  = immediate reward
  s' = next state after action a

ε-greedy policy:
  with prob ε: explore (random action)
  with prob 1-ε: exploit (best Q action)
  ε decays over time: ε = ε · decay_rate
Not an oracle.
Not a black box.
Just code, statistics,
and the right questions.
About this project

Everyone is talking about AI. Most treat it like an oracle. Here's what it actually is: Code. Statistics. Data. Things we've had for decades — we just weren't asking the right questions.

I've spent several years in Operational Excellence roles — Lean, Six Sigma, Continuous Improvement on the floor. Plants full of data. Machines sending signals. People making reactive decisions. So I decided to build the bridge.

Projects. Real use cases. Open code. Not a course. Not a pitch. Just the honest path — including the mistakes — so others can go further without hitting the same walls.

If you work in operations, maintenance, quality, or CI — let's connect, collaborate, and learn together. More projects are on the way.