Understanding Model Training — A Step-by-Step Explanation

One must have heard the buzzwords “Model Training”, “Machine Learning”, “Model Learning”, or “AI Model” quite often — whether in tech discussions, product demos, or data science talks.

However, when it comes to explaining what actually happens during this “training” process — in plain English or even in technical terms — most people are left guessing. Is the model memorizing data? Is it adjusting something inside? What exactly is it learning?

In this blog, let’s peel back the layers and understand what truly happens when a model is trained — step by step. We’ll start from a simple analogy and then gradually move into the math behind the learning process. The goal is to make the idea of “model training” not just familiar, but intuitively clear.

Analogy: A Child Learning to Throw a Basketball

To understand the model learning process in a simple, non-technical way, imagine a child learning to throw a basketball into a hoop.

Initially, the child doesn’t know how much force to use. On the first try, the ball falls too short or goes too far. Depending on the outcome, the child adjusts slightly and tries again. After a few attempts, the child improves and starts hitting the target consistently.

That’s exactly how a machine learning model gets trained — it starts with random guesses, measures how wrong it was, adjusts itself, and improves over many repetitions. It learns not because someone told it what’s right, but by learning from its own mistakes.

Before We Begin Few Important Notes:

Data as Numbers

To train any model — whether for image classification, prediction, or generative AI — data must be represented numerically (as integers, decimals, or vectors). In this blog, we’ll skip the mathematical details of data conversion to numeric format. As part of this blog we will take an example of data which is already in numerical form.

Loss Function

A loss function is like a report card for a machine learning model. It tells the model how well or how poorly it performed on the training data by comparing its predictions with the actual answers. In simple terms, the loss function calculates the difference between what the model’s predicted and what it should have predicted. The bigger the difference, the higher the loss — meaning the model is doing poorly.

The whole idea of model training is to minimize the loss — that is, to reduce the gap between what the model predicted and what it should have predicted with every iteration.

Optimizer

An optimizer is the part of the training process that helps the model to learn from its mistakes. Once the loss function tells the model how wrong it was, the optimizer decides how to adjust the model’s internal parameters (like weights and biases) to reduce that error in the next round.

Think of it like the model’s coach or guide — after every attempt, it reviews the model’s performance (using the loss value) and gives it small, calculated corrections to move it closer to the right answer. Technically, an optimizer updates the model’s parameters ensuring that with every step, the model’s predictions improve.

Popular optimizers include Gradient Descent, Adam, RMSProp, and SGD.

Optimization step & Learning rate

Optimization step

An optimization step is the actual moment when the model updates its internal parameters (like weights and biases) based on what it learned from the loss function. The optimization step applies the gradients calculated in previous steps to make the model slightly better than before.

You can think of it as the model taking one step forward in the right direction toward minimizing the loss.
Over many such steps (iterations or epochs), the model gradually “learns” the best parameter values.

Learning Rate

The learning rate, often denoted by the Greek letter η (eta), controls how big each optimization step should be. It’s a small numerical value that determines how quickly or slowly the model updates its parameters.

If the learning rate is too high, the model might overshoot the optimal point and fail to converge.If it’s too low, the model will learn very slowly and take a long time to reach good performance.

In simple terms —
The learning rate is like the step size the model takes while learning.
A good learning rate ensures the model moves steadily toward lower loss without jumping past the goal.

Mathematically:

wnew = wold − η × (∂L/∂w)

Now let’s dive into the actual model training part.

Introduction — What Happens When a Model Trains

When we call below code in python:

model.fit()

we are asking the model to learn patterns that map inputs to outputs. Behind this simple command lies a mathematical cycle of prediction, error measurement, and gradual improvement.

In essence: Model training is about minimizing mistakes — by repeatedly predicting, comparing, and correcting.

To truly understand what “learning” means, let’s dive one level down in the layers with a simple example: linear regression, where a line is fit to the provided data points using gradient descent.

Step 1: Getting the Historical Data and Understanding the Business Ask

To begin any model training process, we need historical records which holds the input and output values required for model training.

Let’s consider the below data points as our historical records, where x is the input and y is the output for our model training. It means that whenever x happened, what was the value of y.

xy122436

Business problem: Build a model that predicts y for any given x, based on the historical data.

Step 2: Making Predictions (Forward Pass)

Making predictions in world of model training is also referred to as “Forward Pass”, where both true input and the true outputs (i.e. historical record samples) are provided to the model for it to start learning.

Since we are considering a linear regression as our example, we use the simple model equation:

ŷ = w·x + b

We’ll start with random model parameters: w = 0 and b = 0, predictions for all (x, y) pairs are:

For three training data points values i.e. in the pairing of (x,y) as per above mentioned historical records:

(1,2), (2,4), (3,6)

xy (Actual)ŷ (Predicted)120240360

The model predicts nothing correctly yet — it hasn’t learned.

Let’s break down one prediction for better understanding.
For example let us consider the pair (x,y) –> (1,2), where x is the input to the model equation and y is the expected output.
We have our parameters w and b both having value as “0”.
If we substitute values of w, x and b in above equation, since both w and b are “0”, the result will be “0”.
Same thing happens with other paired values of (x,y). Hence all the predicted values are “0”.

Step 3: Measuring the Error (Loss Function)

We measure how wrong the predictions are using Mean Squared Error (MSE), which is given by:

L = (1/n) Σ(yᵢ − ŷᵢ)²

Substituting the numbers:

L = (1/3)[(2−0)² + (4−0)² + (6−0)²] = 18.67

So, the loss = 18.67 — quite high.

The model now knows how bad it is doing, but not how to improve. That’s where gradients come in.

Step 4: Learning from Mistakes (Gradient Computation)

To improve, the model must figure out how changing each parameter (w, b) affects the loss.
This is done using gradients — the partial derivatives of the loss with respect to each parameter.

∂L/∂w = −(2/n) Σ xᵢ(yᵢ − ŷᵢ)
∂L/∂b = −(2/n) Σ (yᵢ − ŷᵢ)

At our current statue (w=0, b=0):

∂L/∂w = – (2/3) × [(1)(2) + (2)(4) + (3)(6)] = – (2/3) × 28 = -18.67

∂L/∂b = – (2/3) × (2 + 4 + 6) = -8

This tells the model to increase w and b to reduce the loss.

Step 5: Updating the Model (Optimization Step)

Now comes the optimization step — where we update parameters in the opposite direction of the gradient, scaled by the learning rate (η).

Lets take η = 0.1.

We update parameters using the learning rate (η):

wnew = w − η(∂L/∂w)
bnew = b − η(∂L/∂b)

Plugging in the values:

w = 0 − 0.1(−18.67) = 1.867
b = 0 − 0.1(−8) = 0.8

After iteration 1: w = 1.867, b = 0.8

Training doesn’t stop after one update.
We repeat the process (forward pass → loss → gradient → update) for several epochs, each time bringing the model closer to the true pattern.

Let’s perform one more iteration to see the progression.

Iteration 2

Forward Pass

ŷ = 1.867x + 0.8

xyŷ (Predicted)122.667244.534366.401

Loss after iteration 2:

L = (1/3)[(2−2.667)² + (4−4.534)² + (6−6.401)²] ≈ 0.34

Loss dropped from 18.67 → 0.34 in just one iteration!

Compute Gradients

∂L/∂w = – (2/3) × [(1)(2−2.667) + (2)(4−4.534) + (3)(6−6.401)] = – (2/3) × [−0.667 − 1.068 − 1.203] = 1.86

∂L/∂b = – (2/3) × [(2−2.667) + (4−4.534) + (6−6.401)] = 1.73

Update Parameters

w = 1.867 − 0.1 × (1.86) = 1.681

b = 0.8 − 0.1 × (1.73) = 0.627

After iteration 2:

w=1.681, b = 0.627

Loss has dropped sharply — the model is learning!

Loss Summary Table:

IterationswbLoss11.8670.80018.6721.6810.6270.34

If we finish the training process after two iterations, the model for the given data would be represented by the equation: ŷ = 1.681x + 0.627

Here, the values 1.681 and 0.627 are the learned parameters (weight and bias) that the model has adjusted during training to best fit the data.

The animation attached shows how the regression line gradually adjusts during training for 10 iterations. With each iteration, the model updates its weight (w) and bias (b) to better fit the data points — moving closer to the true relationship between x and y.

Additional Note:

For ease of explanation, we considered an example with just one input variable (x).
However, in real-world scenarios, models usually work with multiple input features, represented as x₁, x₂, x₃, …, where each represents a different attribute or factor influencing the prediction.

Intuitive Summary

Model training is guided trial and error mechanism. Each iteration:The model guesses (forward pass).It checks how wrong it was (loss).It learns from the error (gradient).It updates itself slightly (optimization).It repeats until mistakes are minimal.

And that’s how a simple mathematical routine turns into a “learning” machine or what we proudly call today a “Machine Learning Model.”

Understanding Model Training — A Step-by-Step ExplanationOne must have heard the buzzwords “Model Training”, “Machine Learning”, “Model Learning”, or “AI Model” quite often — whether in tech discussions, product demos, or data science talks.However, when it comes to explaining what actually happens during this “training” process — in plain English or even in technical terms — most people are left guessing. Is the model memorizing data? Is it adjusting something inside? What exactly is it learning?In this blog, let’s peel back the layers and understand what truly happens when a model is trained — step by step. We’ll start from a simple analogy and then gradually move into the math behind the learning process. The goal is to make the idea of “model training” not just familiar, but intuitively clear.Analogy: A Child Learning to Throw a BasketballTo understand the model learning process in a simple, non-technical way, imagine a child learning to throw a basketball into a hoop.Initially, the child doesn’t know how much force to use. On the first try, the ball falls too short or goes too far. Depending on the outcome, the child adjusts slightly and tries again. After a few attempts, the child improves and starts hitting the target consistently.That’s exactly how a machine learning model gets trained — it starts with random guesses, measures how wrong it was, adjusts itself, and improves over many repetitions. It learns not because someone told it what’s right, but by learning from its own mistakes.Before We Begin Few Important Notes:Data as NumbersTo train any model — whether for image classification, prediction, or generative AI — data must be represented numerically (as integers, decimals, or vectors). In this blog, we’ll skip the mathematical details of data conversion to numeric format. As part of this blog we will take an example of data which is already in numerical form. Loss FunctionA loss function is like a report card for a machine learning model. It tells the model how well or how poorly it performed on the training data by comparing its predictions with the actual answers. In simple terms, the loss function calculates the difference between what the model’s predicted and what it should have predicted. The bigger the difference, the higher the loss — meaning the model is doing poorly.The whole idea of model training is to minimize the loss — that is, to reduce the gap between what the model predicted and what it should have predicted with every iteration.OptimizerAn optimizer is the part of the training process that helps the model to learn from its mistakes. Once the loss function tells the model how wrong it was, the optimizer decides how to adjust the model’s internal parameters (like weights and biases) to reduce that error in the next round.Think of it like the model’s coach or guide — after every attempt, it reviews the model’s performance (using the loss value) and gives it small, calculated corrections to move it closer to the right answer. Technically, an optimizer updates the model’s parameters ensuring that with every step, the model’s predictions improve.Popular optimizers include Gradient Descent, Adam, RMSProp, and SGD.Optimization step & Learning rateOptimization stepAn optimization step is the actual moment when the model updates its internal parameters (like weights and biases) based on what it learned from the loss function. The optimization step applies the gradients calculated in previous steps to make the model slightly better than before.You can think of it as the model taking one step forward in the right direction toward minimizing the loss.Over many such steps (iterations or epochs), the model gradually “learns” the best parameter values.Learning RateThe learning rate, often denoted by the Greek letter η (eta), controls how big each optimization step should be. It’s a small numerical value that determines how quickly or slowly the model updates its parameters.If the learning rate is too high, the model might overshoot the optimal point and fail to converge.If it’s too low, the model will learn very slowly and take a long time to reach good performance.In simple terms —The learning rate is like the step size the model takes while learning.A good learning rate ensures the model moves steadily toward lower loss without jumping past the goal.Mathematically: wnew = wold − η × (∂L/∂w) Now let’s dive into the actual model training part. Introduction — What Happens When a Model TrainsWhen we call below code in python:model.fit()we are asking the model to learn patterns that map inputs to outputs. Behind this simple command lies a mathematical cycle of prediction, error measurement, and gradual improvement.In essence: Model training is about minimizing mistakes — by repeatedly predicting, comparing, and correcting. To truly understand what “learning” means, let’s dive one level down in the layers with a simple example: linear regression, where a line is fit to the provided data points using gradient descent.Step 1: Getting the Historical Data and Understanding the Business AskTo begin any model training process, we need historical records which holds the input and output values required for model training.Let’s consider the below data points as our historical records, where x is the input and y is the output for our model training. It means that whenever x happened, what was the value of y.xy122436Business problem: Build a model that predicts y for any given x, based on the historical data.Step 2: Making Predictions (Forward Pass)Making predictions in world of model training is also referred to as “Forward Pass”, where both true input and the true outputs (i.e. historical record samples) are provided to the model for it to start learning.Since we are considering a linear regression as our example, we use the simple model equation:ŷ = w·x + bWe’ll start with random model parameters: w = 0 and b = 0, predictions for all (x, y) pairs are:For three training data points values i.e. in the pairing of (x,y) as per above mentioned historical records:(1,2), (2,4), (3,6)xy (Actual)ŷ (Predicted)120240360The model predicts nothing correctly yet — it hasn’t learned.Let’s break down one prediction for better understanding.For example let us consider the pair (x,y) –> (1,2), where x is the input to the model equation and y is the expected output.We have our parameters w and b both having value as “0”.If we substitute values of w, x and b in above equation, since both w and b are “0”, the result will be “0”.Same thing happens with other paired values of (x,y). Hence all the predicted values are “0”.Step 3: Measuring the Error (Loss Function)We measure how wrong the predictions are using Mean Squared Error (MSE), which is given by:L = (1/n) Σ(yᵢ − ŷᵢ)²Substituting the numbers:L = (1/3)[(2−0)² + (4−0)² + (6−0)²] = 18.67So, the loss = 18.67 — quite high.The model now knows how bad it is doing, but not how to improve. That’s where gradients come in.Step 4: Learning from Mistakes (Gradient Computation)To improve, the model must figure out how changing each parameter (w, b) affects the loss.This is done using gradients — the partial derivatives of the loss with respect to each parameter.∂L/∂w = −(2/n) Σ xᵢ(yᵢ − ŷᵢ)∂L/∂b = −(2/n) Σ (yᵢ − ŷᵢ)At our current statue (w=0, b=0):∂L/∂w = – (2/3) × [(1)(2) + (2)(4) + (3)(6)] = – (2/3) × 28 = -18.67∂L/∂b = – (2/3) × (2 + 4 + 6) = -8This tells the model to increase w and b to reduce the loss.Step 5: Updating the Model (Optimization Step)Now comes the optimization step — where we update parameters in the opposite direction of the gradient, scaled by the learning rate (η).Lets take η = 0.1.We update parameters using the learning rate (η):wnew = w − η(∂L/∂w)bnew = b − η(∂L/∂b) Plugging in the values:w = 0 − 0.1(−18.67) = 1.867b = 0 − 0.1(−8) = 0.8After iteration 1: w = 1.867, b = 0.8Training doesn’t stop after one update.We repeat the process (forward pass → loss → gradient → update) for several epochs, each time bringing the model closer to the true pattern.Let’s perform one more iteration to see the progression.Iteration 2Forward Passŷ = 1.867x + 0.8 xyŷ (Predicted)122.667244.534366.401Loss after iteration 2:L = (1/3)[(2−2.667)² + (4−4.534)² + (6−6.401)²] ≈ 0.34Loss dropped from 18.67 → 0.34 in just one iteration!Compute Gradients∂L/∂w = – (2/3) × [(1)(2−2.667) + (2)(4−4.534) + (3)(6−6.401)] = – (2/3) × [−0.667 − 1.068 − 1.203] = 1.86∂L/∂b = – (2/3) × [(2−2.667) + (4−4.534) + (6−6.401)] = 1.73Update Parametersw = 1.867 − 0.1 × (1.86) = 1.681b = 0.8 − 0.1 × (1.73) = 0.627 After iteration 2: w=1.681, b = 0.627 Loss has dropped sharply — the model is learning! Loss Summary Table:IterationswbLoss11.8670.80018.6721.6810.6270.34If we finish the training process after two iterations, the model for the given data would be represented by the equation: ŷ = 1.681x + 0.627Here, the values 1.681 and 0.627 are the learned parameters (weight and bias) that the model has adjusted during training to best fit the data.The animation attached shows how the regression line gradually adjusts during training for 10 iterations. With each iteration, the model updates its weight (w) and bias (b) to better fit the data points — moving closer to the true relationship between x and y.Additional Note:For ease of explanation, we considered an example with just one input variable (x).However, in real-world scenarios, models usually work with multiple input features, represented as x₁, x₂, x₃, …, where each represents a different attribute or factor influencing the prediction.Intuitive SummaryModel training is guided trial and error mechanism. Each iteration:The model guesses (forward pass).It checks how wrong it was (loss).It learns from the error (gradient).It updates itself slightly (optimization).It repeats until mistakes are minimal.And that’s how a simple mathematical routine turns into a “learning” machine or what we proudly call today a “Machine Learning Model.” Read More Technology Blog Posts by SAP articles

#SAP

#SAPTechnologyblog