Chapter 1.3-1.4 : Model Selection & the Curse of Dimensionality
Chapter 1.3-1.4 : Model Selection & the Curse of Dimensionality
Christopher M. Bishop, PRML, Chapter 1 Introdcution
1. Model Selection
In our example of polynomial curve fitting using least squares:
1.1 Parameters and Model Complexity:
- Order of polynomial: there was an optimal order of polynomial that gave the best generalization. The order of the polynomial controls the number of free parameters in the model and thereby governs the model complexity.
- Regularization coefficient : With regularized least squares, the regularization coefficient λ also controls the effective complexity of the model.
- More more complex models: such as mixture distributions or neural networks there may be multiple parameters governing complexity.
1.2 Model Selection and Why?
The principal objective in determining the values of such parameters, is
- usually to achieve the best predictive performance on new data.
- finding the appropriate values for complexity parameters within a given model.
- to consider a range of different types of model in order to find the best one for our particular application.
1.3 Category 1 - a training set, a validation set, a third test set
If data is plentiful, then one approach is simply to use some of the available data
- to train 1) a range of models, or 2) a given model with a range of values for its complexity parameters,
- and then to compare them on independent data (called a validation set), and select the one having the best predictive performance.
- If the model design is iterated many times using a limited size data set, then some over-fitting to the validation data can occur and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally evaluated.
1.4 Category 2 - cross validation
Problem: What if the supply of data for training and testing is limited?
- in order to build good models, we wish to use as much of the available data as possible for training.
- However, if the validation set is small, it will give a relatively noisy estimate of predictive performance.
- One solution to this dilemma is to use cross-validation, which is illustrated in Figure 1.18.
Cross-validation allows a proportion of the available data to be used for training while making use of all of the data to assess performance. When data is particularly scarce, it may be appropriate to consider the case , where is the total number of data points, which gives the leave-one-out technique.
1.5 Drawbacks of Cross-validation
- One major drawback of cross-validation: is that the number of training runs that must be performed is increased by a factor of , and this can prove problematic for models in which the training is itself computationally expensive.
- Exploring combinations of settings for multiple complexity parameters for a single model could, in the worst case, require a number of training runs that is exponential in the number of parameters.
- Clearly, we need a better approach. Ideally, this should rely only on the training data and should allow multiple hyperparameters and model types to be compared in a single training run.
2. The Curse of Dimensionality
2.1 A Simplistic Classification Approach
One very simple approach would be to divide the input space into regular cells, as indicated in Figure 1.20.
2.2 Problem with This Naive Approach
The origin of the problem is illustrated in Figure 1.21, which shows that, if we divide a region of a space into regular cells, then the number of such cells grows exponentially with the dimensionality of the space. The problem with an exponentially large number of cells is that we would need an exponentially large quantity of training data in order to ensure that the cells are not empty.
- The Curse of Dimensionality: The severe difficulty that can arise in spaces of many dimensions is sometimes called the curse of dimensionality.
- The reader should be warned that not all intuitions developed in spaces of low dimensionality will generalize to high-dimensional spaces.