In this lecture we introduce the concept of a predictive model, which lies at the heart of machine learning (ML).
To begin with, we observe some outputs and the corresponding input vectors that may help to predict the outputs before they are observed.
Examples:
is the total amount of purchases made by a customer while visiting an online shop; are some characteristics of the landing page that was first seen by the customer;
is inflation observed in month and is a vector of macro-economic variables known before ;
is 1 if firm defaults within a year and 0 otherwise; is a vector of firm 's characteristics that may help to predict the default;
is a measure of economic activity in province ; is a vector of pixel values from a satellite image of the province.
Note: the subscript used to index the observations is not necessarily time.
We use the observed inputs and outputs to build a predictive model, that is, a function that takes new inputs as arguments and returns the predictions of previously unseen outputs
Before diving into predictive modelling, let us learn some machine learning jargon.
The problem of learning an input-output mapping is called a supervised learning problem.
The data used for learning is called labelled data and the outputs are called labels or targets.
Basically, in a supervised learning problem, the task is to learn the conditional distribution of the outputs given the inputs.
On the contrary, in an unsupervised learning problem, there are no labels and the task is to learn something about the unconditional distribution of .
The typical example are photos of cats and dogs: is a vector of pixel values; in supervised learning, you have labels (1 if dog, 0 if cat); in unsupervised learning, you have no labels, but you typically do something like clustering in the hope that the algorithm autonomously separates cats from dogs.
A supervised learning problem is called:
a classification problem if the output variable is discrete / categorical (e.g., cat vs dog);
a regression problem if the output variable is continuous (e.g., income earned).
The inputs are often called features and the vector is called a feature vector.
The act of using data to find the best predictive model (e.g., by optimizing the parameters of a parametric model) is called model training.
How do we assess the quality of a predictive model?
How do we compare predicted outputs with observed outputs ?
We do these things by specifying a loss function, which is always required in a machine learning problem.
A loss function quantifies the losses that we incur when we make inaccurate predictions.
Examples:
Squared Error (SE):
Absolute Error (AE):
Log-loss (or cross-entropy): when is binary (i.e., it can take only two values, either 0 or 1); the multivariate generalization is when is a multinoulli vector (i.e., we have a categorical variable that can take only values; when it takes the -th, then and all the other entries of the vector are zero).
Ideally, the best predictive models is the one having the smallest statistical risk (or expected loss) where the expected value is with respect to the joint distribution of and .
Since the true joint distribution of and is usually unknown, the risk is approximated by the empirical risk where is a set of input-output pairs used for calculating the empirical risk and is its cardinality (the number of input-output pairs contained in ).
Thus, the empirical risk is the sample average of the losses over a set of observed data . This is the reason why we sometimes call it average loss.
How to choose is one of the most important decisions in machine learning and we will discuss it at length.
For specific choices of the loss function, empirical risk has names that are well-known to statisticians:
if the loss is the Squared Error, then the empirical risk is the Mean Squared Error (MSE), and its square root is the Root Mean Squared Error (RMSE);
if the loss is the Absolute Error, then the empirical risk is the Mean Absolute Error (MAE);
if the loss is the Cross-Entropy, it can easily be proved that the empirical risk is equal to the negative average log-likelihood.
The criterion generally followed in machine learning is that of empirical risk minimization:
if we are setting the parameters of a model, we choose the parameters that minimize the empirical risk;
if we are choosing the best model in a set of models, we pick the one that has the lowest empirical risk.
Statistically speaking, it is a sound criterion because empirical risk minimizers are extremum estimators.
Please cite as:
Taboga, Marco (2021). "Predictive model", Lectures on machine learning. https://www.statlect.com/machine-learning/predictive-model.
Most of the learning materials found on this website are now available in a traditional textbook format.