3. Ensemble Methods
"A bunch of silly trees learning to correct each other's mistakes"
Now used for:
Everything that fits the classical algorithm (but works better)
Search engines (★)
Computer vision
Object detection
Popular algorithms: Random Forest, Gradient Boosting
It's time for mature and modern methods. Ensembles and neural networks are the two main fighters that pull our path towards uniqueness. Today they produce the most accurate results and are widely used in production.
Despite all the effectiveness, the idea behind these is very simple. If you take a bunch of inefficient algorithms and force them to correct each other's mistakes, the overall quality of a system will be higher than even the best individual algorithms.
You'll get better results if you take the most unstable algorithms that predict completely different results on small noise in the input data. Like regression and decision trees. These algorithms are sensitive even to one missing bit in the input data to drive the models crazy.
In fact, this is exactly what we need.
Instead, there are three battle-tested methods for creating ensembles.
The output of several parallel models is passed as input to the last one that makes the final decision.
The emphasis here is on the word "different". Mixing similar algorithms on the same data makes no sense. The choice of algorithms is entirely up to you. However, for the final decision-making model, regression is usually a good choice.
Data may be repeated in random subsets. For example, from a set like "1-2-3" we can get subsets like "2-2-3", "1-2-2", "3-1-2", etc. We use this new dataset to train multiple times with the same algorithm and then predict the final answer through simple majority voting.
The most famous example of bagging is the Random Forest algorithm, which is simply bagging decision trees (shown above). When you open your phone's camera app and see boxes around people's faces - it's probably the result of Random Forest's work. Neural networks are too slow for real-time execution but ideal classification because it can calculate trees on all the cores of a video card or new ML processors.
In some tasks, the ability to run Random Forest in parallel outweighs the small loss in accuracy compared to boosting. Especially in real-time processing. There's always a trade-off.
Boosting algorithms are trained sequentially one by one. Each subsequent one pays the most attention to the data points that were misclassified by the previous one.
Similar to bagging, we use subsets of our data but this time they are not generated randomly. Now, in each iteration, we take a portion of the data that wasn't processed by the previous algorithm. Thus, we learn a new algorithm that corrects the previous mistakes.
The main advantage here - very high, even illegal in some countries accurate classification that all cool kids can envy. The downside was already highlighted - it's not parallel. But still faster than neural networks. It's like a race between a dump truck and a race car. The truck can do more work, but if you want to go fast - take a car.
If you want a real-life example of boosting - open Facebook or Google and type in the search. Can you hear the sound of an army of trees drowning and yelling at each other to sort the results according to the ranking? That's because they use boosting.