ML Models — LozanoLsa

Before we compare the models

Understand the building block first

Think of it like...

A volume knob that only goes from 0 to 1. No matter how extreme the input signal is — very negative or very positive — the output always stays between 0 (definitely not) and 1 (definitely yes). That's the sigmoid.

σ(z) = 1 / (1 + e⁻ᶻ)

z = w·x + b

0.00

σ(z) = P(y=1)

0.500

Decision

Uncertain

Demo 00 — Logistic Regression

From a Line
to a Probability

A linear model produces z = w·x + b — any number from -∞ to +∞. That's useless as a probability. The sigmoid squashes it into [0, 1]. Drag the slider to move your input point. Watch how the curve compresses extreme values toward 0 and 1 — and how the decision flips when σ(z) crosses 0.5.

Input z (linear score) 0.0

Decision threshold 0.50

The math behind it↓

Sigmoid:    σ(z) = 1 / (1 + e^-z)

Properties: σ(0)   = 0.5   (uncertain)
            σ(+∞)  → 1.0   (class 1)
            σ(-∞)  → 0.0   (class 0)
            σ'(z)  = σ(z)·(1 - σ(z))  ← used in backprop

Decision:   ŷ = 1  if σ(z) ≥ threshold
            ŷ = 0  otherwise

Block 01 — Supervised Classification

Linear & Probabilistic Models

Cap. 01

Logistic Regression

Applies a sigmoid function to a linear combination of features to estimate class probabilities. The decision boundary is a hyperplane — a straight line in 2D. Simple, fast, and interpretable.

SupervisedClassifierParametricLinearProbabilistic

✓ Use when

Classes are linearly separable. You need probability estimates. Starting baseline.

✗ Avoid when

Boundary is non-linear. Features are highly correlated.

→ Demo 01

Cap. 02

Naive Bayes

Uses Bayes' theorem assuming features are conditionally independent given the class. Computes Gaussian likelihoods per class and picks the most probable one. Surprisingly powerful for its simplicity.

SupervisedClassifierProbabilisticGenerativeFast

✓ Use when

Small datasets. Real-time predictions. Text classification.

✗ Avoid when

Features are correlated. Independence assumption is clearly wrong.

→ Demo 01

Cap. 03

K-Nearest Neighbors

Stores all training points and classifies by majority vote among the K closest neighbors. No training phase — the model IS the data. Boundary adapts entirely to local density.

SupervisedClassifierNon-parametricInstance-basedLazy

✓ Use when

Non-linear boundaries. Small/medium datasets. Low-dimensional data.

✗ Avoid when

Large datasets (slow predictions). High-dimensional data (curse of dimensionality).

→ Demo 01

Full Repository

Think of it like...

Three different artists drawing the border between two countries — each follows a different rule, but they're all looking at the same map.

Logistic Regression

Model

Logistic

Accuracy

—

Boundary

Linear

Demo 01 — Supervised Classification

How Classifiers Draw Boundaries

Switch between models and watch how the decision boundary changes over the same dataset. Logistic draws a straight line. Naive Bayes draws elliptical contours. KNN bends to fit every cluster — adjust K to control how smooth or jagged it gets.

The math behind it↓

Logistic:   σ(w·x + b) = P(y=1|x)
            where σ(z) = 1 / (1 + e^-z)

Naive Bayes: P(C|x) ∝ P(C) · ∏ᵢ P(xᵢ|C)
             P(xᵢ|C) ~ N(μᵢ꜀, σᵢ꜀²)

KNN:         ŷ = mode { yᵢ : i ∈ KNN(x, K) }
             distance = √(Σ (xᵢ-xⱼ)²)

Think of it like...

Each class leaves a fingerprint on every feature — a bell curve showing where that class typically lives. When you see a new point, you check: which class's fingerprint fits it better? The one with the taller bell at that spot wins.

Gaussian Likelihood — Feature 1

x value

0.00

P(C0 | x)

—

P(C1 | x)

—

Predicted

—

Demo 01b — Naive Bayes

Why the
Gaussians?

Naive Bayes fits a Gaussian (bell curve) to each feature for each class. The blue bell is Class 0, the orange bell is Class 1. Drag the slider to move your test point — the bars show the posterior probability for each class in real time. The overlap zone is where the model is uncertain.

Test point x 0.0

              Class 050%
            

              Class 150%
            

The math behind it↓

Gaussian likelihood:
  P(x | C) = (1/√2πσ²) · exp(-(x-μ)²/2σ²)

Posterior (Bayes rule):
  P(C | x) ∝ P(C) · P(x | C)

"Naive" assumption:
  P(x₁,x₂,...,xₙ | C) = ∏ᵢ P(xᵢ | C)
  ← features are independent given the class

Decision:
  ŷ = argmax_C  log P(C) + Σᵢ log P(xᵢ|C)

Block 02 — Supervised Classification

Kernel & Tree-Based Models

Cap. 04

Support Vector Machine

Finds the hyperplane that maximizes the margin between classes. Only the "support vectors" — the closest points to the boundary — define the model. Kernel functions allow non-linear separation by lifting data to higher dimensions.

SupervisedClassifierMax-MarginKernelHigh-dimensional

✓ Use when

High-dimensional data. Clear margin between classes. Small-medium datasets.

✗ Avoid when

Very large datasets. Need probability outputs. Noisy data with heavy overlap.

→ Demo 03

Cap. 05

Decision Tree

Recursively splits the feature space with axis-aligned cuts, choosing each split to minimize Gini impurity. The result is a set of human-readable if/then rules. You can trace every single prediction.

SupervisedClassifierInterpretableRule-basedNon-parametric

✓ Use when

Rules matter. Mixed data types. Explainability is required by stakeholders.

✗ Avoid when

Data is noisy. You need stable predictions. Prone to overfitting.

→ Demo 02

Cap. 06

Random Forest

Builds many decision trees on bootstrap samples using random feature subsets. Each tree votes, majority wins. The variance drops dramatically compared to a single tree — without significantly increasing bias.

SupervisedClassifierEnsembleBaggingRobust

✓ Use when

Accuracy is the priority. Noisy data. Large feature sets. Robust baseline.

✗ Avoid when

Need fast real-time predictions. Full interpretability is required.

→ Demo 02

Think of it like...

A game of 20 questions — the computer learns which yes/no question cuts the most confusion in half at each step. Then it invites a hundred friends to play, and takes the majority answer.

Decision Tree

Depth

Leaves

Gini

—

Mode

Tree

Demo 02 — Tree-Based Models

The Tree That Learns to Split

Watch the decision tree grow one level at a time. Each split is chosen to minimize Gini impurity — a measure of how mixed the classes are. Then toggle to Random Forest and see how combining many trees smooths the boundary and reduces overfitting.

Max depth 1

The math behind it↓

Gini impurity:  G = 1 - Σ pₖ²
                (0 = pure, 0.5 = maximally mixed)

Best split:     argmin  n_L·G_L + n_R·G_R
               (f,t)     ————————————————
                               n

Bootstrap:     sample n points WITH replacement
               → each tree sees ~63% of data

Aggregation:   ŷ = mode { ŷ_tree₁, ŷ_tree₂, ... }

Think of it like...

Finding the widest road that separates two neighborhoods — not just any dividing line, but the one with the most breathing room on both sides. The houses right on the edge of that road are the support vectors.

SVM — Linear

Kernel

Linear

Support Vecs

—

C (slack)

1.0

Demo 03 — Support Vector Machine

The Margin That Separates Everything

The support vectors (highlighted points) are the only ones that define the boundary — remove any other point and nothing changes. The C parameter controls the tradeoff: low C allows some misclassifications to keep the margin wide; high C forces strict separation. Toggle to RBF kernel to see how SVM handles data that no straight line can separate.

Regularization C 1.0

The math behind it↓

Objective:  min  ½||w||² + C·Σ max(0, 1-yᵢ(w·xᵢ+b))

Margin:     M = 2/||w||   ← maximize this

Support vectors: points where yᵢ(w·xᵢ+b) = 1

RBF kernel: K(x,x') = exp(-γ‖x-x'‖²)
            maps data to infinite-dim. space
            curved boundaries in original space

Block 03 — Supervised Regression

Predicting Continuous Values

Cap. 07

Linear Regression

Fits a straight line through data by minimizing the sum of squared residuals (RSS). Every prediction is a weighted combination of inputs plus a bias term. The foundation of all regression.

SupervisedRegressionParametricLinearInterpretable

✓ Use when

Linear relationship exists. Need interpretable coefficients. Baseline for any regression task.

✗ Avoid when

Non-linear patterns. Many correlated features. Outliers dominate the data.

→ Demo 04

Cap. 08

Multiple Regression

Extends linear regression to p features. The model fits a hyperplane in p+1 dimensions. Each coefficient β tells you the expected change in y per unit increase in that feature, holding all others constant.

SupervisedRegressionParametricMultivariateOLS

✓ Use when

Multiple drivers explain the outcome. Need to isolate each variable's impact.

✗ Avoid when

Features are multicollinear. More features than observations (p > n).

→ Demo 05

Full Repository

Cap. 09

Lasso Regression

Adds an L1 penalty (λ·Σ|βj|) to OLS. This forces some coefficients to exactly zero — performing automatic feature selection. The higher λ, the more features are eliminated from the model entirely.

SupervisedRegressionRegularizedL1 PenaltyFeature Selection

✓ Use when

Many features, few relevant. Need automatic feature selection. Sparse solutions preferred.

✗ Avoid when

All features are relevant. Correlated features (picks one randomly, drops others).

→ Demo 06

Cap. 10

Ridge Regression

Adds an L2 penalty (λ·Σβj²) to OLS. Unlike Lasso, Ridge shrinks all coefficients toward zero but never to exactly zero — keeping all features in the model with reduced magnitude.

SupervisedRegressionRegularizedL2 PenaltyStable

✓ Use when

Multicollinearity is a problem. All features are relevant. Stable, continuous shrinkage needed.

✗ Avoid when

You need feature selection (Ridge never zeroes out coefficients).

→ Demo 06

Cap. 11

ElasticNet

Combines L1 and L2 penalties with a mixing ratio α. When α=1 it becomes pure Lasso; when α=0 it becomes pure Ridge. Captures both benefits: sparse solutions and stability with correlated features.

SupervisedRegressionRegularizedL1+L2Hybrid

✓ Use when

Correlated features exist AND some are irrelevant. Best of both worlds.

✗ Avoid when

You need pure interpretability — two hyperparameters (λ, α) complicate explanation.

→ Demo 06

Coming soon

More models on the way

This page is updated continuously as new projects are published. Follow on GitHub to stay current.

→ github.com/LozanoLsa

Think of it like...

Stretching a rubber band between two poles and letting it settle where it pulls equally from all data points. The line finds the position that makes everyone equally unhappy — minimizing the total tension.

Linear Regression — OLS

R²

—

RMSE

—

Slope β₁

—

Intercept β₀

—

Demo 04 — Linear Regression

The Line That
Minimizes Error

The red dashed lines are residuals — the vertical distance from each point to the regression line. OLS finds the line that minimizes their sum of squares (RSS). Toggle Show Residuals to see them, and drag any point to watch how the line reacts in real time. Watch R² — it tells you what fraction of variance the model explains: 0 means nothing, 1 means perfect.

Noise level low

The math behind it↓

Model:    ŷ = β₀ + β₁x

OLS solution (closed form):
  β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)²
  β₀ = ȳ - β₁·x̄

Loss:     RSS = Σ(yᵢ - ŷᵢ)²   ← minimize this

Metrics:
  R²   = 1 - RSS/TSS    (0=bad, 1=perfect)
  TSS  = Σ(yᵢ - ȳ)²    (total variance)
  RMSE = √(RSS/n)       (avg error in y units)

R² interpretation:
  0.0–0.3  → weak fit
  0.3–0.7  → moderate fit
  0.7–1.0  → strong fit

Think of it like...

A machine downtime isn't caused by just one thing — temperature, speed, load, and vibration all play a role. Multiple regression isolates how much each driver contributes independently, holding the others constant.

Multiple Regression — 2 features

R²

—

Adj. R²

—

β₁ (X1)

—

β₂ (X2)

—

Demo 05 — Multiple Regression

When More Variables
Change Everything

Two features, one outcome. The scatter shows X1 vs Y with points colored by X2. Toggle features on/off and watch how R² and the coefficients change. Adjusted R² penalizes adding useless variables — if it drops when you add X2, that feature isn't helping. This is the core of variable selection in Six Sigma MSA.

Correlation X1↔X2 low

The math behind it↓

Model:   ŷ = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ

Matrix form:   β = (XᵀX)⁻¹Xᵀy

Adj. R²  = 1 - (1-R²)·(n-1)/(n-p-1)
           penalizes adding irrelevant features

Multicollinearity problem:
  if X₁ ≈ X₂ → XᵀX near-singular → unstable β
  Corr(X₁,X₂) > 0.8 → use Ridge or Lasso instead

Coefficient interpretation:
  β₁ = change in y per unit X₁, holding X₂ constant
  β₂ = change in y per unit X₂, holding X₁ constant

Think of it like...

You have 20 process variables but suspect only 4 actually drive defects. Regularization adds a tax on complexity — forcing the model to justify every variable it keeps. Lasso eliminates them entirely. Ridge just limits their influence. ElasticNet does both.

Lasso — λ = 0.10

Method

Lasso

Active features

—

R²

—

0.10

Demo 06 — Lasso · Ridge · ElasticNet

The Penalty That
Shapes the Model

8 features — only 3 are truly relevant. The bars show each coefficient's magnitude. As you increase λ, watch what happens: Lasso drives irrelevant coefficients to exactly zero (feature selection). Ridge shrinks all of them but never reaches zero. ElasticNet does both. The active features count tells you how many survived the penalty.

Regularization λ 0.10

The math behind it↓

OLS loss:     RSS = Σ(yᵢ - ŷᵢ)²

Lasso:   min RSS + λ·Σ|βⱼ|       ← L1
         → some βⱼ = 0 exactly (sparse)

Ridge:   min RSS + λ·Σβⱼ²        ← L2
         → all βⱼ shrink, none → 0

ElasticNet: min RSS + λ·[α·Σ|βⱼ| + (1-α)·Σβⱼ²]
         → α=1: pure Lasso
         → α=0: pure Ridge
         → 0 < α < 1: hybrid

λ effect:  λ→0: approaches OLS (no penalty)
           λ→∞: all coefficients → 0

Block 04 — Supervised · Boosting

Gradient Boosting

Cap. 12

XGBoost

Builds trees sequentially — each one correcting the errors of the previous. Uses gradient descent in function space to minimize loss. One of the most powerful and widely-used models in industry and competitions.

SupervisedEnsembleBoostingGradient DescentHigh Performance

✓ Use when

Tabular data. Competition-level accuracy needed. Mixed feature types.

✗ Avoid when

Need full interpretability. Very small datasets. Image or text data.

→ Demo 07

Coming soon

More boosting variants

LightGBM, CatBoost and other gradient boosting frameworks will be added as projects are completed.

Think of it like...

A team of students taking an exam in sequence. The first student answers. The second only focuses on what the first got wrong. The third fixes what the second missed. Each one specializes in the previous team's mistakes — together they get nearly everything right.

XGBoost — Boosting Steps

Trees built

Train error

—

vs Random Forest

—

Demo 07 — XGBoost

How Boosting
Corrects Itself

Each bar is one boosting round. The gold bars show XGBoost error dropping with each tree added — it learns from its own mistakes. The gray line is Random Forest at the same number of trees. Watch how boosting converges faster and lower by focusing on hard examples.

Learning rate η 0.30

The math behind it↓

Boosting update:
  F_m(x) = F_{m-1}(x) + η · h_m(x)

Where h_m fits the residuals:
  rᵢ = yᵢ - F_{m-1}(xᵢ)   ← what was wrong

Objective (XGBoost):
  L = Σ loss(yᵢ, ŷᵢ) + Σ Ω(fₖ)
  Ω(f) = γT + ½λ‖w‖²  ← regularization

η (learning rate): smaller = slower but better
Trees: more = less bias, risk of overfitting

Block 05 — Semi-Supervised Learning

Learning with Few Labels

Cap. 13

Self Training

Trains on labeled data, then predicts labels for unlabeled data with high confidence and adds them to training. Iterates until no more confident predictions remain. The model teaches itself.

Semi-SupervisedIterativeLabel PropagationFew Labels

✓ Use when

Few labeled examples, many unlabeled. Labeling is expensive or slow.

✗ Avoid when

Initial model is weak — wrong labels amplify errors.

→ Demo 08

Cap. 14

Co-Training

Uses two independent models trained on different feature views. Each model labels data for the other — they teach each other. Requires the two views to be conditionally independent given the class.

Semi-SupervisedMulti-ViewCollaborativeFeature Split

✓ Use when

Data has two natural independent views (e.g., sensor A and sensor B).

✗ Avoid when

Feature views are correlated — the independence assumption breaks down.

→ Demo 08

Think of it like...

You label 10 items in a factory. A model learns from those 10. It then looks at 200 unlabeled items and says "I'm 95% sure this one is defective." You trust it, add that label, retrain. Repeat. Now you effectively have hundreds of labeled examples — from just 10.

Self Training — Iteration 0

Labeled

Unlabeled

Iteration

Accuracy

—

Demo 08 — Semi-Supervised

The Model That
Labels Itself

Solid points are labeled — the model knows their class. Hollow points are unlabeled. Each iteration, the model assigns pseudo-labels to the unlabeled points it's most confident about, then retrains. Watch the boundary sharpen as more points get labeled.

Confidence threshold 0.80

The math behind it↓

Algorithm:
  1. Train f on labeled set L
  2. For each x in unlabeled set U:
       if max P(y|x) ≥ threshold:
         add (x, argmax P(y|x)) to L
  3. Remove added points from U
  4. Repeat until U is empty or no
     confident predictions remain

Risk: if initial model is wrong,
errors propagate through iterations

Block 06 — Unsupervised Learning

Finding Structure Without Labels

Cap. 15

K-Means

Groups data into K clusters by minimizing intra-cluster distance. Assigns each point to the nearest centroid, then recalculates centroids. Repeats until convergence. Simple, fast, and widely used.

UnsupervisedClusteringCentroid-basedEuclidean

✓ Use when

Clusters are roughly spherical and similar size. K is known or estimable.

✗ Avoid when

Clusters have irregular shapes. Outliers present. K unknown.

→ Demo 09

Cap. 16

DBSCAN

Density-Based Spatial Clustering. Groups points that are closely packed together and marks outliers as noise. Finds clusters of arbitrary shape — no need to specify K. Points are core, border, or noise.

UnsupervisedClusteringDensity-basedOutlier Detection

✓ Use when

Arbitrary cluster shapes. Outliers present. K unknown. Dense regions visible.

✗ Avoid when

Varying density clusters. High-dimensional data (distance loses meaning).

→ Demo 09

Cap. 17

PCA

Principal Component Analysis finds the directions of maximum variance in data and projects it onto fewer dimensions. Each principal component is orthogonal to the others. Reduces noise and reveals hidden structure.

UnsupervisedDimensionality ReductionLinearVariance

✓ Use when

Too many features. Features are correlated. Visualization of high-dim data needed.

✗ Avoid when

Non-linear structure in data. Interpretability of original features required.

→ Demo 10

Cap. 18

ICA

Independent Component Analysis separates a multivariate signal into additive, statistically independent components. Unlike PCA (which finds uncorrelated components), ICA finds truly independent sources. Classic use: separating mixed audio signals.

UnsupervisedDimensionality ReductionSource SeparationNon-Gaussian

✓ Use when

Multiple independent signals are mixed together. Non-Gaussian sources expected.

✗ Avoid when

Sources are Gaussian — ICA cannot separate them. Small sample sizes.

→ Demo 10

Cap. 19

Apriori

Discovers frequent itemsets and association rules from transactional data. Uses support, confidence, and lift to find "if A then B" patterns. The algorithm behind market basket analysis and recommendation systems.

UnsupervisedAssociation RulesFrequent PatternsTransactions

✓ Use when

Transactional data. Finding co-occurrence patterns. Recommendation engines.

✗ Avoid when

Continuous data. Very large item sets (exponential search space).

→ Demo 09

Think of it like...

K-Means draws circles around groups — it works perfectly when groups are round and similar in size. DBSCAN finds dense neighborhoods — it works on any shape and marks lonely points as outliers. Same data, completely different logic.

K-Means — K=3

Algorithm

K-Means

Clusters

Noise points

—

Inertia

—

Demo 09 — Clustering

K-Means vs
DBSCAN

Switch between K-Means and DBSCAN on the same dataset. K-Means forces every point into a cluster — even outliers. DBSCAN identifies true cluster shapes and marks outliers as noise. Use the dataset toggle to see where each algorithm struggles.

K clusters 3

The math behind it↓

K-Means objective:
  min Σₖ Σ_{x∈Cₖ} ||x - μₖ||²
  μₖ = centroid of cluster k

DBSCAN definitions:
  Core point: ≥ minPts in ε-neighborhood
  Border point: in ε-neighborhood of core
  Noise point: neither core nor border

ε (epsilon): radius of neighborhood
minPts: minimum points to form a cluster

Think of it like...

You have a shadow on the wall. The 3D object cast a 2D shadow — information was lost, but the important shape is preserved. PCA finds the angle that preserves the most information when projecting data to fewer dimensions.

PCA — Variance Explained

Original dims

PC1 variance

—

PC2 variance

—

Demo 10 — PCA · ICA

Finding the
Direction of Most Variance

The blue arrow is PC1 — the direction that captures the most variance in the data. The gray arrow is PC2 — orthogonal to PC1. Drag the correlation slider to change the data shape and watch how the principal components rotate to always point along maximum variance.

Feature correlation 0.80

Show projection onto PC1 ON

The math behind it↓

Steps:
  1. Center data: X = X - mean(X)
  2. Covariance matrix: C = XᵀX / (n-1)
  3. Eigendecomposition: C = VΛVᵀ
  4. Sort eigenvectors by eigenvalue
  5. Project: Z = X · V_k

PC1 = eigenvector with largest eigenvalue
     = direction of maximum variance

Explained variance ratio:
  EVR_k = λₖ / Σλᵢ

Block 07 — Anomaly Detection

Finding the Unexpected

Cap. 20

Z-Score

Measures how many standard deviations a point is from the mean. Points beyond a threshold (typically ±3σ) are flagged as anomalies. Assumes Gaussian distribution. Fast, interpretable, and a strong baseline.

Anomaly DetectionStatisticalUnivariateGaussian

✓ Use when

Data is approximately Gaussian. Single variable. Need an interpretable baseline.

✗ Avoid when

Non-Gaussian distributions. Multivariate anomalies. Context-dependent outliers.

→ Demo 11

Cap. 21

Isolation Forest

Anomalies are easier to isolate than normal points. Builds random trees that split features randomly — anomalies need fewer splits to be isolated. The anomaly score is the inverse of the average path length.

Anomaly DetectionTree-basedMultivariateUnsupervised

✓ Use when

Multivariate anomalies. Non-Gaussian data. Large datasets. No labeled anomalies.

✗ Avoid when

Very high-dimensional data. Anomalies cluster together (they become hard to isolate).

→ Demo 11

Think of it like...

Z-Score is like asking "how far is this from the average?" Isolation Forest asks "how quickly can I separate this point from everyone else?" Both find outliers — but one uses statistics, the other uses trees.

Z-Score — threshold ±3σ

Method

Z-Score

Anomalies

—

Normal

—

Threshold

3σ

Demo 11 — Anomaly Detection

Z-Score vs
Isolation Forest

Same dataset, two detection methods. Z-Score flags points beyond ±Nσ from the mean — simple and fast but assumes Gaussian data. Isolation Forest builds random trees and isolates anomalies — works on any distribution and catches multivariate outliers Z-Score misses.

Sigma threshold 3.0σ

The math behind it↓

Z-Score:
  z = (x - μ) / σ
  flag if |z| > threshold (typically 3)
  assumes N(μ, σ²) distribution

Isolation Forest:
  score(x) = 2^(-E[h(x)] / c(n))
  h(x) = path length to isolate x
  c(n) = avg path length for n points
  anomaly if score → 1 (short path)
  normal  if score → 0 (long path)

Block 08 — Reinforcement Learning

Learning by Trial and Error

Cap. 22

Q-Learning

An agent learns a Q-table — the expected reward for taking action A in state S. It explores the environment, receives rewards, and updates Q-values using the Bellman equation. Model-free: no prior knowledge of the environment needed.

Reinforcement LearningModel-FreeQ-TableDiscrete

✓ Use when

Discrete state and action spaces. Sequential decision making. Reward signal available.

✗ Avoid when

Continuous high-dimensional state spaces (use Deep RL instead).

→ Demo 12

Cap. 23

Policy Optimization

Directly optimizes the policy π(a|s) — the probability of taking action A in state S. Uses gradient ascent on expected cumulative reward. REINFORCE and PPO are key algorithms. Works in continuous action spaces.

Reinforcement LearningPolicy GradientContinuousREINFORCE

✓ Use when

Continuous action spaces. Stochastic policies needed. Robotics and control tasks.

✗ Avoid when

Discrete simple problems — Q-learning is more efficient there.

→ Demo 12

Cap. 24

Model-Based RL

The agent builds an internal model of the environment — learning transition dynamics P(s'|s,a) and reward function R(s,a). Uses the model to plan ahead before acting. More sample-efficient than model-free approaches.

Reinforcement LearningModel-BasedPlanningSample Efficient

✓ Use when

Simulations available. Sample efficiency critical. Environment is learnable.

✗ Avoid when

Environment is stochastic and hard to model. Model errors compound badly.

→ Demo 12

Think of it like...

A robot in a maze with no map. It moves randomly at first, bumping into walls (negative reward) and occasionally finding the exit (positive reward). Over hundreds of episodes it builds a memory — a table of which direction to go from each cell to maximize reward. That table is the Q-table.

Q-Learning — Episode 0

Episode

Steps

—

Total reward

—

ε (explore)

1.00

Demo 12 — Reinforcement Learning

The Agent That
Learns by Doing

The orange agent starts at top-left. The green cell is the goal (+10 reward). Red cells are penalties (-5). Watch the ε parameter — it starts at 1.0 (fully random exploration) and decays as the agent learns. The arrows show what the Q-table has learned: the best action from each cell.

Learning rate α 0.50

Discount factor γ 0.90

The math behind it↓

Bellman equation (Q-update):
  Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]

Where:
  α  = learning rate (how fast to update)
  γ  = discount factor (value of future rewards)
  r  = immediate reward
  s' = next state after action a

ε-greedy policy:
  with prob ε: explore (random action)
  with prob 1-ε: exploit (best Q action)
  ε decays over time: ε = ε · decay_rate

Not an oracle.
Not a black box.
Just code, statistics,
and the right questions.

About this project

Everyone is talking about AI. Most treat it like an oracle. Here's what it actually is: Code. Statistics. Data. Things we've had for decades — we just weren't asking the right questions.

I've spent several years in Operational Excellence roles — Lean, Six Sigma, Continuous Improvement on the floor. Plants full of data. Machines sending signals. People making reactive decisions. So I decided to build the bridge.

Projects. Real use cases. Open code. Not a course. Not a pitch. Just the honest path — including the mistakes — so others can go further without hitting the same walls.

If you work in operations, maintenance, quality, or CI — let's connect, collaborate, and learn together. More projects are on the way.

GitHub LinkedIn Email

Interactive ML Models

Understand the building block first

From a Lineto a Probability

Linear & Probabilistic Models

How Classifiers Draw Boundaries

Why theGaussians?

Kernel & Tree-Based Models

The Tree That Learns to Split

The Margin That Separates Everything

Predicting Continuous Values

The Line ThatMinimizes Error

When More VariablesChange Everything

The Penalty ThatShapes the Model

Gradient Boosting

How BoostingCorrects Itself

Learning with Few Labels

The Model ThatLabels Itself

Finding Structure Without Labels

K-Means vsDBSCAN

Finding theDirection of Most Variance

Finding the Unexpected

Z-Score vsIsolation Forest

Learning by Trial and Error

The Agent ThatLearns by Doing

From a Line
to a Probability

Why the
Gaussians?

The Line That
Minimizes Error

When More Variables
Change Everything

The Penalty That
Shapes the Model

How Boosting
Corrects Itself

The Model That
Labels Itself

K-Means vs
DBSCAN

Finding the
Direction of Most Variance

Z-Score vs
Isolation Forest

The Agent That
Learns by Doing