Machine Learning - Part One

Why Do We Want Machines to Learn?

This is Billy. Billy wants to buy a car. He's trying to calculate how much money to save monthly for this expense. He went through dozens of ads on the internet and learned that new cars are around $20,000, used cars from 1 year ago are $19,000, 2 years ago are $18,000, and so on.

Billy starts to see a pattern: So, the price of a car depends on its age and decreases by $1000 every year, but doesn't go below $10,000.

In terms of machine learning, Billy invented regression – he predicted a value (price) based on known historical data.

Yes, it would be nice to have a simple formula for every problem in the world.

The problem is that cars have different production dates, dozens of characteristics, technical conditions, and so on.

People are dumb and lazy – we need robots to do the math for them. So, let's bring computational power here. Let's give the machine data and ask it to find all the hidden patterns related to the price.

The most interesting thing is that the machine does this much better than a real person, accurately analyzing all dependencies in its mind.

This was the birth of machine learning.

Three Components of Machine Learning

In AI, the sole purpose of machine learning is to predict outcomes based on the received data.

The more variety in your samples, the easier it is to find the relevant patterns and predict the outcome. So we need three components to train the machine:

Data Want to detect spam? Get samples of spam messages. Want to predict stocks? Find price history. Want to find out user preferences? Parse their activity on Facebook. The more diverse the data, the better the outcome.

There are two main ways to obtain data – manual and automatic. Manual data collection has much fewer errors but takes more time to collect – which generally makes it more expensive.

The automatic approach is cheaper – you collect everything you can find and hope for the best.

Some smart evaluations like Google use their customers to label data for them for free. Remember ReCaptcha that forces you to "select all street signs"? That's exactly what they do. Free workforce! Good. Instead of them, I start showing more and more captchas.

Gathering good data (usually referred to as a dataset) is very difficult. They are so important that companies might even reveal their algorithms

Features Also known as parameters or variables. These can be car mileage, user gender, stock price, word frequency in text. In other words, these are factors that a machine should look at.

They are simple when data is stored in tables – features are column names. But what if you have 100GB of cat pictures? We can't consider every pixel as a feature. That's why choosing the right features usually takes longer than other parts of ML. This is also the main source of errors. The human mind only selects features it likes or finds more important.

Algorithms The most noticeable part. Every problem can be solved differently. The method you choose affects the accuracy, efficiency, and size of the final model. Even the best algorithm won't help if the data is garbage. It's sometimes called "garbage in, garbage out". So don't pay too much attention to the accuracy percentage, try to get more data first.