”Alexa, what was the score of the Manchester United game yesterday?”
If you’ve ever asked a question like that of your personal assistant device (such as Amazon Echo or Google Home), you are already enjoying the fruits of the deep learning revolution. As Ralf Herbrich, Director of Machine Learning & Managing Director of Amazon Development Germany explains, when you speak to your device, ”features from a deep neural network are used to describe the audio stream in order to detect the wake-words… Once the wake-word is detected, neural networks are used to predict the sequence of phonetic states from the audio sequence of the whole microphone array.”
This same cutting-edge deep learning/neural network technology is already being used for a wide range of tasks, such as enabling the navigation of self-driving cars, colorizing old black-and-white films, assisting physicians in diagnosing illnesses, and beating the best human competitors at complex strategy games such as Go. But what, exactly, are deep learning and neural networks, and how do they work?
What Is Deep Learning?
Deep Learning (DL) is a subset of Machine Learning (ML), which in turn is a subset of Artificial Intelligence (AI). While the goal of artificial intelligence research is to create machines that can, on some level, ”think,” machine learning aims at giving computers the ability to learn by recognizing patterns in their input data. They can then use that information to accomplish designated tasks without being specifically programmed to handle each possible combination of input factors.
Most of the real-world data we want our machines to work with, whether images, speech, text, or sensor readings, is unstructured and unlabeled (that is, humans have not organized the data or identified particular features before it is presented as inputs to the system). What distinguishes deep learning from machine learning in general is that DL employs multi-layered algorithms that, through many (perhaps millions) of training iterations, tune themselves to accurately recognize and utilize significant patterns in unlabeled input data to make decisions.
For example, given the image of a street scene as its input, a deep learning system might be trained to identify which parts of the image represent cars and trucks, which are buildings, which are people, and even which are dogs as distinct from cats or pigeons.
Artificial Neural Networks (ANNs)
The tool most commonly used to implement deep learning is the artificial neural network, or ANN. These are software algorithms modeled after the biological neural networks that give the human brain its unparalleled information processing ability.
Our brains typically have about 100 billion neurons arranged into a highly interconnected neural network. Each individual neuron acts as a switch that changes state as a function of the inputs it receives from many other neurons. In turn, the output of any particular neuron may serve as an input for thousands of others. Learning takes place as different patterns of sensory input repeatedly stimulate and sensitize particular pathways through the network. This process allows our brains to be highly adept at recognizing and labeling patterns in what we see, hear, feel, smell, and taste.
An ANN implements a similar process in software. A large number of software neurons are arranged as nodes in a highly interconnected network. The nodes are themselves collected into several layers – typically an input layer, an output layer, and one or more internal layers in between. The in-between layers are called hidden layers since they do not directly interact with either the inputs or outputs of the network.
These stacked layers of software neurons are the key to the deep learning model. The first layer will be trained to recognize the most primitive features of the data, such as the edges in an image, or the phonemes in a speech sample. The outputs from that input layer serve as inputs to the next, which will be trained to identify higher level structures in the data, such as simple shapes or phoneme combinations. This process proceeds through whatever number of hidden layers may be employed, with each layer being trained to recognize increasingly complex patterns in the data presented to it by the previous layer. In general, adding more layers allows the ANN to recognize structures of greater complexity.
The term ”deep” refers to the depth of an ANN in terms of the number of layers it contains. In general, the ”deep learning” designation is applied to systems that employ two or more hidden layers in addition to the input and output layers.
How ANNs Learn
The inputs to each node of an ANN have weighting factors (or simply, ”weights”) individually assigned to them. For example, the first input to a particular node may have its value multiplied by 0.68, (so the first weight is w(1) = 0.68), while the second input value to the same node might be multiplied by w(2) = 0.23, and so on. The node then sums all its weighted inputs using a nonlinear transformation function, thereby generating the output value for that neuron. In other words, the reactions of the neurons in each layer of the network are determined by the weights applied to the inputs received from the previous layer. (Note that a bias value may also be applied at each node to shift the neuron’s input values in the positive or negative direction by some constant amount). It is by setting each of the weights to its optimal value that the network is enabled to identify salient features in the input data.
The ANN learns by processing thousands, or even millions, of training examples, and using an iterative optimization algorithm called gradient descent to adjust the weights of all the node inputs for the minimum discrepancy (or ”error”) between the actual and expected values at the output of the network.
The most commonly used process for iteratively calculating gradients and adjusting weights toward the optimum is called backpropagation, short for ”backward propagation of errors.” It works by starting at the output of the neural network and moving from back to front through the internal layers toward the input. At each layer the rates of change of the output error function with respect to changes in the weights applied to the inputs from the previous layer are calculated, so that the direction in which each weight must change in order to minimize the error function can be identified. Each weight is then adjusted in the appropriate direction. This process is repeated, typically many thousands of times, until the error function reaches its minimum value. At that point, the ANN is ”trained.”
Example of a Deep Learning System
(This example is based on a presentation given by Adam Coates of Baidu at the Bay Area Deep Learning School in 2016).
We started by noting that in performing its magic, Amazon’s Alexa employs a deep neural network that parses audio inputs and converts them into text. Let’s take a brief, high level look at how Alexa and similar devices, such as Apple’s Siri and Google Home, perform that conversion. Assume for the moment that the neural network has already been trained as described above.
The starting point is to capture and digitize the sounds the device hears. This is done by sampling the audio stream picked up by its microphones, typically at a rate of 16,000 samples per second. These samples are stored as a sequence of 16-bit numbers that represent the amplitude of the waveform at every point in time. The continuous stream of audio is then broken into 20-millisecond segments, each of which is processed to determine what letter of the alphabet is being spoken at that moment.
It has proven very difficult for neural networks to decode speech by directly processing waveform amplitude values. Instead, the data is pre-processed using a Fast Fourier Transform operation to convert the waveform to a frequency-domain representation. This divides the signal into a number of discrete frequency bands, and calculates the strength of the signal (actually the amount of energy) in each band. This is the set of values that is presented as inputs to the neural network.
Now, the entire audio stream is presented to the neural network one 20ms segment at a time. With each segment the neural network calculates a probability distribution that reflects the likelihood that the sound represents a particular letter of the alphabet.
The output layer of the network consists of a bank of neurons, each representing a single letter of the alphabet, a space, or a blank character (for English this set is {A-Z, space, blank}). As each 20ms segment is presented at the input of the network, the output value of each of the neurons in the output layer represents the probability that the character associated with that particular neuron is being spoken at that time. The character displaying the highest probability is selected as the one most likely being spoken during that time segment.
Typically, a recurrent neural network (RNN) is used to enhance accuracy. RNNs retain a memory of their previous states. This allows them to better predict which characters may follow the sequence of characters they have already received, thus reducing errors. For example, if the network has decoded the sequence HEL, it can infer that the next characters are much more likely to be LO rather than, say, QX.
The system includes an algorithm for handling the fact that words may be spoken at different rates of speed. For example, the network may determine that during the first fourteen 20ms segments, the characters in the audio stream resolve to HHH_EE_LL_L_O_ (where _ represents the blank character). The algorithm first combines any repeated characters to a single one, yielding H_E_L_L_O_. Then blanks are removed, yielding HELLO.
Of course this example is very simplified, and doesn’t cover a number of critical issues that must be addressed in a practical deep neural network-based speech recognition system. But hopefully it provides a good conceptual outline of how deep learning technology is being put into practice today.

