Course:CPSC522/Ensemble Learning

Ensemble Learning

Ensemble methods are learning algorithms that employ multiple learning classifiers for better predictive performance than what could be obtained by a single classifier.

Principal Author: Vanessa Putnam
Collaborators: Surbhi Palande

Abstract

In ensemble learning, an ensemble of classifiers are combined to make a prediction on the same problem. Unlike traditional machine learning approaches that use a single classifier, ensemble learning utilizes predictions made by multiple classifiers on the same data. This allows an ensemble learner to come to a more robust prediction by taking the perspective of multiple classifiers to make a final prediction.

An ensemble learner is composed of multiple base learners. Although an ensemble of learners can have a high predictive performance, a single base learner has weak performance in comparison to it's ensemble output. For this reason, base learners are also known as weak learners. Due to this fact, an ensemble classifier truly displays that there is power in numbers. Despite a base learners seemingly weak performance when standing alone, an ensemble of base learners result in strong predictive performance when applying any one of the representative ensemble methods described below.

Builds on

Ensemble learning builds on traditional Machine Learning algorithms, for solving Supervised and Unsupervised tasks.

Related Pages

See Decision Trees for and example of ensemble base learners, and the Random forest algorithm to understand an ensemble classifier composed of decision trees.

Content

Accuracy and Diversity

In order for an ensemble method to perform well, base learners must be designed with accuracy and diversity in mind. To understand this reasoning, consider three classifiers: $A$ , $B$ , and $C$ . When we combine these three classifiers to create an ensemble, we obviously aim to make our predictions as accurate as possible. However, it is equally important to make sure these classifiers are also diverse from one another. Consider the case in which all classifiers are designed identically using the same algorithm, inputs, etc. In this example, all classifiers $A$ , $B$ , and $C$ will output the same prediction. This does not help our model gain a stronger prediction since each classifier is using the same paradigm. As our goal is to make a better prediction than a single classifier, this example using a parallel set of base learners would not improve our model’s performance. Therefore, we want to design our base learners to be diverse from one another so each of their predictions can be combined for a better predictive performance.

Representative methods

Figure 1: Bagging method

From what we have described so far, it is clear that an ensemble method utilises multiple classifiers. Yet, we have still not discussed how the final decision is made when outputs from these classifiers are brought together. To understand this point, we will discuss a few different representations of ensemble methods that will help us determine by what means a final prediction is made.

Bagging

Bagging, also known as bootstrap aggregation, involves having each model in the ensemble make a prediction with equal weight. Once each model has made a prediction, a vote is taken among the classifiers to come to a consensus. See figure 1.

In order to account for model diversity, bagging trains each classifier in the ensemble using a randomly drawn subset of the training set. These subsets are also known as bootstrap samples. Bootstrap samples allow each of the base learners in the ensemble to obtain a different output. Intuitively this makes since, since with this method each learner has access to a different portion of the data.

A classic example of bagging is also known as Random Forests. In this algorithm, the base learners are known as random decision trees, and when combined (as described) achieve a very high prediction accuracy. The following pseudo code provides a walkthrough of the Random Forest algorithm.

Random Forest Algorithm

There are two important takeaways from this pseudo code. The first is fundamental to ensembles, in that each base learner is created from a subset of features. This ensures every tree is made up of different splitting rules and therefore promotes diversity of the overall model. Second and specific to bagging, the algorithm obtains a bootstrap sample for each classifier to be tested on. This is analogous to a bag, where each classifier is its own bag and each sample of data is bagged inside the classifier.

Consider the following Precondition: A training set S := (x1, y1), . . . ,(xn, yn), features F, and number of trees in forest B.

1 function RandomForest(S , F)
2        H ← 0
3        for i ∈ 1, . . . , B do
4                S(i) ← A bootstrap sample from S // obtain a bootstrap sample from the training data
5                hi ← RandomizedTreeLearn(S(i), F) // feed the sample into associated random tree (base learner)  
6                H ← H \in {hi} // add tree hi to forest H 
7        end for
8        return H
9 end function
10
11 function RandomizedTreeLearn(S , F)
12        At each node:
13                f ← very small subset of F  // only use subset of features for model diversity 
14                Split on best feature in f
15        return The learned tree
16 end function

Is is important to note that this example only implements the creation of the random forest and does not demonstrate the final ensemble vote among all classifiers. For a more detailed explanation of this and the algorithm in general, see Random forest.

Boosting

At its roots, boosting has a similar set-up to bagging. The main distinction between these two methods, is boosting attempts to correct misclassified points in earlier models. This is done by making sure misclassified training examples are more likely to be selected as inputs to the next model in order for their classifications to be taken into consideration again. Therefore, boosting gives the learner multiple chances to predict correctly on a misclassified point.

Although this method can perform well on training data, it is also very common for boosting to overfit. This is something to take note of when designing a model that utilizes boosting rather than bagging.

The most common implementation of boosting is known as AdaBoost. This algorithm attempts to improve initial learners, by focusing on training examples where the learner is not performing well.

AdaBoost

Precondtion: A set of N labeled examples (x1, y1),…,(xN,yN), a learning algorithm L, the number of base learners in the ensemble B, a vector w of N example weights (initially 1 ⁄ N), a vector h of B base learners, a vector z of B base learner weights

1 function AdaBoost(examples, L, B):                            
2         for b = 1 to B do
3                  h[b] ← L(examples, w)    //Train base learner with weighted sample. 
4                  error ← 0
5                  for j = 1 to N do        //Test base learner on all data. 
6                           if h[b](xj) ≠ yj then error ← error + w[j]
7                  for j = 1 to N do                                        
8                           if h[b](xj) = yj then w[j] ← w[j] · error ⁄ (1 − error)
9                  w ← normalize(w)                                
10                z[b] ← log(1 − error) ⁄ error     //Set learner weight with weighted error 
11  return weighted-majority(h, z)                  //Set example weights based on ensemble predictions

Stacking

Figure 2: Stacking Method.

A stacked ensemble makes use of multiple layers of base learners to come to a final prediction. First, a number of first-level learners are generated from the training data set. Once these outputs are produced a following level learner takes each of the previous outputs as inputs and combines them to come to a final prediction. See figure 2.

We call this second-level learner a meta-learner. A meta-learner makes a final decision by training on the outputs of the base-level classifiers. Therefore the meta-level dataset consists of examples where the features each of example are the predictions of the base-level classifiers, and the class is the correct class of the example in hand. For a more step by step understanding of this process, the next section will walk through an example of stacking in the Super Learner algorithm.

Super Learner Algorithm

1. Set up the ensemble.
a. Specify a list of L base algorithms (with a specific set of model parameters).
b. Specify a metalearning algorithm.

2. Train the ensemble.
a. Train each of the L base algorithms on the training set.
b. Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.
c. The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along with the original response vector, is called the “level-one” data. (N = number of rows (examples) in the training set.)
d. Train the metalearning algorithm on the level-one data. The “ensemble model” consists of the L base learning models and the metalearning model, which can then be used to generate predictions on a test set.

3. Predict on new data.
a. To generate ensemble predictions, first generate predictions from the base learners.
b. Feed those predictions into the metalearner to generate the ensemble prediction.

Applications

In practice, it is possible to employ ensemble learning methods almost anywhere machine learning is used. Some of the more traditional applications for ensembling are used in optical character recognition, text categorization, and face recognition. Additionally, classifiers have also been built for boosting Human Activity Recognition, as well as, gene expression analysis in Bioinformatics.

Implementations

The following statistics packages have ensemble methods for everyday use:

Python : Scikit-learn.
MATLAB : Statistics and Machine Learning Toolbox.

Annotated Bibliography

Dietterich, Thomas G. Ensemble Methods in Machine Learning. [1]
David Poole & Alan Mackworth, "Artificial Intelligence: Foundations of Computational Agents", Chapter 7.6.2
Zhou, Zhi-Hua. Ensemble Learning. [2]
"Ensemble learning." Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 19 Dec. 2015. Web. 1 Feb. 2018.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0