Machine Learning - Part Four

Estimated read time: 5 minutes
2,817 views

Classification

"You divide objects based on a pre-known feature. You separate socks by color, documents by language, music by genre."

Used today for:

  • Spam filtering
  • Language detection
  • Searching for similar documents
  • Sentiment analysis
  • Recognizing handwritten characters and numbers
  • Fraud detection

Popular algorithms: Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbors, Support Vector Machine.

Machine learning is mostly about classifying things. The machine here is like a child who can sort toys: here's a robot, here's a car, here's a robo-car... Oh, wait. Error! Error!

In classification, you always need a teacher. Data must have features so that the machine can assign classes based on them. Everything can be classified - users based on interests (as algorithmic feeds have), articles based on language and topic (this is important for search engines), music based on genre (Spotify playlists), and even your emails.

In spam filtering, the Naive Bayes algorithm was widely used. The machine counts the number of occurrences of "viagra" in spam and normal letters, then multiplies both probabilities using Bayes' theorem, sums up the results, and voila, we have Machine Learning.

Later, spammers learned how to cope with Bayesian filters by adding many "good" words at the end of the email. Ironically, this method was called Bayesian poisoning. Naive Bayes appeared in history as the most elegant and first application of it, but now other algorithms are used for spam filtering.

Here's another practical example of classification. Let's say you need a loan. How does a bank know if you'll repay it or not? There's no way to be sure. But the bank has many profiles of people who previously took money. They have information about age, education, job, salary, and most importantly, the fact of whether they repaid the money or not.

Using this data, we can teach the machine to find patterns and get the answer. There is no problem getting the answer. The problem is that the bank cannot blindly trust the machine's answer. What if there's a system failure or a hacker attack?

To cope with that, we have a Decision Tree. All data is automatically divided into yes/no questions. They may look a bit strange from a human point of view, for example, does the borrower earn more than $128.12? Although, the machine came up with such questions to divide the data in the best way at each step.

Thus, a tree is built. The higher the branch - the broader the question. Any analyst can take it and explain it afterwards. He may not understand it, but he will easily explain it! (Average analyst)

Decision trees are widely used in high-responsibility areas: diagnostics, medicine, and finance.

Today, naive decision trees are rarely used. However, they often set the foundation for large systems, and their ensembles work even better than neural networks.

Support Vector Machines (SVM) are rightfully the most popular classic classification method. It's used to classify everything existing: plants by appearance in photos, documents by categories, etc.

The idea of SVM is simple - it tries to draw two lines between your data points with the largest margin between them. Look at the image:

There is a very useful application for anomaly detection in classification. When a feature doesn't fit any of the classes, we highlight it. Now it's used in medicine - in MRI, computers highlight all suspicious areas or test deviations. The stock exchange uses it to detect abnormal behavior of traders to find insiders. By teaching the computer the correct things, we automatically teach it what is wrong.

Today, neural networks are more often used for classification. Well, that's what they were created for.

The rule of thumb is that the more complex the data, the more complex the algorithm. For text, numbers, and tables, I would choose the classic method. Models are smaller there, they learn faster, and work with greater clarity. For images, videos, and other large complex datasets, I would definitely look at neural networks.

Just five years ago, you could find a face classifier built on SVM. Today, it's easier to choose from hundreds of pre-trained networks. Nothing has changed for spam filters. They are still written with SVM. And there's no good reason to move away from it anywhere.

 

Parts five and six

Mining Smartification
Rahkar Pouya's Smart Solution is a comprehensive system for mining smartification, integrating data, maintenance, data collection, mechanization, and big data analysis to boost mining productivity and enable data-driven planning.

Related projects

Related articles