The purpose of this blog is to document my journey into deep learning.
Some background on myself: Engineer by training with specific interests in mechanical design, sensors, actuators, robotics, and embedded systems. After getting my Ph.D, I found that I actually had free time on weeknights and weekends, and I decided I wanted to do something more meaningful with my time other than browse reddit until my brain bleeds. So I decided to expand my knowledge of the vast, intimidating world of machine learning and data science. (Sidenote: I still browse reddit. Just not as much)
Full disclosure, I did take a machine learning class in graduate school, and managed to publish a paper on my final project which aligned nicely with my research objectives at the time (see link). My paper mostly focused on kernel-based ridge regression and support vector regression methods, which produced some nice results but amounted to little more than writing my own kernelized ridge regression and stochastic SVR code.
So here’s my attempt to dive into some areas of machine learning that I haven’t had the opportunity to explore through school work or research. This blog is more for me than for anybody else; what better way to hold myself accountable than posting my results on the world wide web for everybody (but probably nobody) to see? When learning algorithms and coding, its really easy for me to focus on implementation and lose sight of the underlying theory, so this blog is my ‘theory lifeline’ that I’ll use to remind me of what is really going on under the hood.
I intend to learn from the ground up by coding the algorithms myself, allowing me to ‘peak inside the black box’ and understand the basic methodologies at work before taking advantage of MATLAB/Python/C++ libraries and DL frameworks that are optimized for these types of things. I feel like pretty much anybody could initialize a CNN in keras , or use MATLAB’s built-in NN tools, without really understanding what’s going on under the hood, but that’s not the approach I want to take. I’ll start in MATLAB (as it is the language I am most comfortable with) to develop code from scratch, after which I’ll probably make the move to Python to use some of the powerful machine learning libraries like Keras and Tensorflow (also this transition may or may not have something to do with the fact that my academic license expires soon). I might even dip into parallel programming with CUDA.
Alright, let’s get this party started.
Trusty ol’ wikipedia defines an Artificial Neural Network (ANN) as:
…a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.
In schematic form, this might look like the following:
A deep neural network with three hidden layers
This is an example of a ‘Deep Neural Network’ (DNN), which is a subset of ANNs and is characterized by many layers, each which contain several nodes which can number in the ones to the hundreds and even thousands. An arbitrary ANN in a supervised learning task has two observable layers, the input layer and the output layer. An ANN can also have any number of hidden layers which try to sequentially filter the nodes in the input layer until the result resembles the nodes in the output layer, but aren’t actually observed outside of the context of the network.
There are many flavors and implementations of ANN’s and there’s no way I’ll be able to understand them all. So the following lists the algorithms I’ll try to implement from scratch, in the order in which I’ll attack them:
- Single-layer perceptron model (SLP)
- Concepts introduced: feedforward activation, backpropagation, stochastic gradient descent
- Applications: Simple classification/regression models
- Candidate Datasets: iris, wdbc
- Multi-layer Perceptron model (MLP)/Dense Neural Network (DNN)
- Concepts introduced: backpropagation through hidden layers, minibatch gradient descent, momentum, annealing, regularization
- Applications: character identification, financial prediction, noise filtering, data compression
- Candidate Datasets: MNIST, Fashion MNIST
- Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM)
- Concepts introduced: backpropagation through time, LSTM
- Applications: Predictive text, timeseries predictions, natural language processing
- Candidate Datasets:
- Convolutional Neural Networks (CNN)
- Concepts introduced: convolution, pooling
- Applications: Image classification/captioning/recognition, natural language processing
- Candidate Datasets: CIFAR-10
After trying to implement the above in MATLAB, at some point (either due to conceptual or computational limitations) I’ll probably make the transition to Python to take advantage of more powerful machine learning libraries and friendlier APIs.
Alright, here we go…
(1) Single Layer Perceptron Model
The simplest neural net architecture is one where the input layer feeds forward directly into the output layer through a weighted network (no hidden layers). This is the single layer perceptron model, and is fairly straightforward to implement in practice. An example is shown below which uses a simple step function for activation in the feedforward direction:
Single layer perceptron (SLP) model
Consider a network with inputs, and outputs. For the iris dataset, and . The outputs are generated by a weighted sum of the input nodes, plus a bias node, which is then fed (or ‘activated’) through a nonlinear function , as follows (for an example output ):
This lends naturally to linear algebra and we can re-write in more compact form:
The network designer is given freedom in choosing which nonlinear functions to use, although there are situations where one is preferred over the others. For example, for classification tasks, the output layer should use a softmax, tanh, or logistic function to bound the values between 0 and 1, whereas for a regression task, the outputs are unbounded and a linear function should be used.
SLP algorithm, abridged
The basic steps towards implementing a SLP are nicely written up here, but for the purposes of summarizing:
- Initialize weighting matrices randomly
- Feedforward: Multiply input nodes (and a bias node) by the weighting matrices at iteration .
- Activate through a nonlinear activation function to compute the output
- Compute cost function (which, for our purposes, is a L2 least squares error between the network output and target output)
- Backpropagate and update weights using stochastic gradient descent for the next iteration (we’ll look at minibatch descent in the next post)
- Repeat steps 2-5 until convergence criterion is met (usually a lower bound on the cost function, or a maximum number of iterations
Coding and Testing
The SLP algorithm was implemented in MATLAB. Literally the entire feedforward and backpropagation algorithm is pasted below:
% Randomly sample the input vector samp = randsample(size(data.input,1),1); rand_input = data.input(samp,:); rand_output = data.output(samp,:); % Feedforward on the random sample [output,net] = feedforward(rand_input, weights, data.bias(samp), activation_function); % Compute error error_vector = rand_output - output; % Backpropagate to adjust weighs [~, dY] = activation(net,activation_function); delta = error_vector.*dY; weights_delta = rate*kron([rand_input, data.bias(samp)]', delta); W = weights+weights_delta';
Some example results of a single-layer perceptron architecture are shown below for the iris and wdbc dataset (note that the x-axis labels are incorrectly titled ‘epoch’ when they should be ‘iteration’). Both use the tanh activation function and a learning rate of 0.05. The weights are updated by sampling one random input at a time (stochastic gradient descent).
SLP iris classification: (left) RMSE, (right) classification error
Some metadata for this classification:
Training Classification Success Rate: 95.238095 percent
Testing Classification Success Rate: 77.272727 percent
Validation Classification Success Rate: 95.652174 percent
The SLP was also used to train the wdbc database, and the results shown below demonstrate the power of a fairly simple model in a binary classification task:
SLP wdbc classification: (left) RMSE, (right) classification error
Some metadata for this classification:
Training Classification Success Rate: 97.487437 percent
Testing Classification Success Rate: 97.647059 percent
Validation Classification Success Rate: 96.511628 percent
Given the 31 input features, the SLP is able to correctly classify the tumor as malignant or benign in 98% of cases in the test set. Pretty cool.
(2) Limitations of SLP/Looking Forward to DNNs
The single layer model only gets us so far. While it performs pretty well for binary classification problems and problems with few (i.e. <50) input features, what if we want to do something more interesting, like classify handwritten digits? Enter MNIST:
A subset of MNIST characters and their associated labels
Classifying MNIST is known as the “Hello World!” of machine learning (which I perceive to be a slight undersell but that’s neither her nor there). The dataset consists of 60,000 training images and 10,000 test images (each black and white, 28×28 pixels for a total of 784 input features), all of handwritten characters ranging from 0-9 (for a total of 10 output classes). The goal of the trained algorithm is to take in a test digit and ‘spit out’ the correct label. MNIST is often used as a benchmark dataset when new algorithms are developed, and the state-of-the art (a 6-layer CNN by committee) can achieve 99.79% classification accuracy.
SLP performance on MNIST
Now, we can’t expect a single-layer perceptron model to capture all the nuances and features needed to correctly classify the handwritten characters in MNIST. However I was pretty surprised on how well it performed:
Training Classification Success Rate: 86.337143 percent
Testing Classification Success Rate: 86.240000 percent
Validation Classification Success Rate: 86.200000 percent
Ask anyone else even remotely practiced in ML and they’ll tell you how abysmal these numbers are. Anything below 95% is pretty much unacceptable. But as a (relative) newcomer to the field I was pleasantly surprised.
We can peak inside the network a bit by visualizing the weighting matrices associated with each output. Since we have 784 inputs and 10 classes, our weighting matrix , so we can view 10x[28×28] separate weighting matrices (one for each output). The image below shows a representative selection of test inputs for each output class. The middle row is a visualization of the weighting matrix, and the third row is the resulting prediction. If you squint, it looks sort of like the weights are blurry versions of the correct output class. This makes intuitive sense: inputs that correlate highly with the weighting matrix of a certain class will be more likely to produce a large output for that class, since all we’re doing is taking a dot product between the input and the weighting matrix. Pretty interesting.
A selection of test inputs, weights associated with the correct output, and resulting prediction
So how can we improve performance? Well, by introducing more levels of abstraction in our model to capture higher-order nuances in our data.