Convolutional Neural Networks (CNNs)
Convolutional Neural Networks are all the rage now. They are used for object detection in photos and videos, face recognition, style transfer, image generation and enhancement, creating effects like slow motion, and improving image quality. Today, CNNs are used in virtually everything involving images and videos. Even on your iPhone, several networks analyze your photos to identify objects within them.
The problem with images has always been the difficulty of extracting features from them. You can split the text into sentences, search for word features in specialized vocabularies. This approach had the name "hand-crafted features" and was used almost universally.
There are many problems with handcrafted features.
First of all, if a cat has its ears down or turned away from the camera: you're in trouble, the neural network won't see anything.
Secondly, try to name 10 different features that distinguish cats from other animals in place. I couldn't do it for anyone, but when I see a black eye following me last night – even if I only see it in the corner of my eye – I would certainly say cat to rat. Because people don't just look at the shape of the ears or count the paws and account for various features that they don't even think about. And therefore cannot explain it to the machine.
So this means the machine must learn such features on its own, and on top of the main lines. We do the following: first, we divide the entire image into 8x8 pixel blocks and assign each one to the dominant type of line – either horizontal [-], vertical [|], or one of the diagonals [/]. It can also be that several are highly visible – this happens, and we are not always entirely sure.
The output can be several tables of sticks which are in fact the simplest marginal features of objects on the image. They are images themselves but made of sticks. So we can use 8x8 again and see how they match together.
This action is called "Convolution," which gave the method its name. The conclusion can be represented as a layer of a neural network, since each neuron can act like any function.
When we feed our neural network with a large number of cat photos, it automatically assigns more weight to those sticks that are most often seen. It doesn't matter if it's a straight line behind the cat or a complex geometric object like the cat's face, something that is highly activated.
As output, we put a simple perceptron that looks at the most active combinations and, based on them, distinguishes cats from dogs.
The beauty of this idea is that we have a neural network that searches for the most distinctive features of objects by itself. We don't need to choose them manually. We can feed it any amount of images of any object by simply googling billions of images with it, and our network will create feature maps from sticks and learn to distinguish each object on its own.
Give your neural network a fish, and it can recognize fish for the rest of its life. Give your neural network a fishing rod, and it can recognize fishing rods for the rest of its life….
Part Twelve