Machine Learning - Part Twelve

Recurrent Neural Networks (RNN)

The second most popular architecture today. Recurrent networks gave us useful things like neural machine translation, speech recognition, and voice synthesis in smart assistants. RNNs are best for sequential data like audio, text, or music.

Do you remember Microsoft Sam, the old-school speech synthesizer from Windows XP? That funny man builds words letter by letter, trying to stick them together. Now look at Amazon Alexa or Google Assistant. They not only pronounce words clearly but even put appropriate accents!

All because modern voice assistants are trained to speak not letter by letter, but entire phrases at once. We can prepare a dataset of texts with corresponding audio and train a neural network to generate an audio sequence close to the original speech.

In other words, we use text as input and its audio as the desired output. We want a neural network to generate audio for a given text, then compare it with the original, correct audio, and try to get as close to the ideal as possible.

Sounds like a classic learning process. Even a perceptron is suitable for this task. But how should we define its outputs? Firing a specific output for every possible phrase is not an option – obviously.

Here we’ll help ourselves with the fact that text, speech, or music are sequences. They consist of sequential units like syllables. They all seem unique but depend on the previous ones. Lose this connection, and you get dubstep.

We can train a perceptron to generate these unique sounds, but how do we remember the previous responses? So the idea is to add memory to each neuron and use it as additional input at the next step. A neuron can make a note for itself – hey, we had a sound here, the next sound should sound more like this (this is a very simplified example).

That’s how recurrent networks appeared.

This approach had one big problem – when all neurons remembered their past results, the number of connections in the network became so huge that adjusting all weights became technically impossible.

When a neural network cannot forget, it cannot learn new things (people have the same flaw).

The first decision was simple: limit the neuron’s memory. Let’s say, to keep no more than 5 recent results. But this idea failed.

Later, a much better approach was implemented: using special cells, similar to computer memory. Each cell can record a number, read it, or reset it. They were called Long Short-Term Memory cells (LSTM).

Now, when a neuron needs to remember, it sets a flag in that cell. Like, “this was a consonant, next time use different pronunciation rules”. When the flag is no longer needed, the cells are reset, and only the “long-term” connections of the classic perceptron remain. In other words, this network is trained not only to learn weights but also to adjust these reminders.

Conclusion: Time to War with Machines?

The main problem here is that the question “When will machines become smarter than us and enslave everyone?” is initially wrong. It has many hidden conditions.

We say “become smarter than us” as if we mean there is a unified scale of intelligence. At the top of it is a human, and animals are below humans.

If that were the case, every human should beat animals in everything, but this is not true. An average squirrel can remember a thousand hiding places with nuts – I can’t even remember where my keys are.

So intelligence is a set of different skills, not a single measurable value? Or is remembering nut-hiding places not intelligence?

An even more interesting question for me – why do we believe that the capabilities of the human brain are limited? There are many common charts on the Internet where technological progress is drawn as a curve, and human capabilities are constant.

Okay, right now, multiply 1680 by 950 in your mind. I know you won’t even try. But give you a calculator – you will do it in two seconds. Does this mean that a calculator has only expanded the capabilities of your brain?