CUDA - Neural Network Implementation

After writing the fractal renderer to familiarise myself with CUDA, I wanted to use it to implement a fast neural network. Neural networks are almost the ideal application for GPUs as most of the computation boils down to matrix multiplications (or similar operations) across tensors. There is little data inter-dependency and a well implemented neural network implementation can be 100x faster on a GPU than a modern CPU. The emergence of GPUs as a platform for general purpose programming is actually one of the big drivers of the emergence of “deep learning” in the last few years. Training deep models is a computationally intensive process, and modern GPUs have made this far more tractable.

Initially implementing a neural network on a GPU seemed a bit daunting. The programming model was fairly new and unfamiliar, and there seemed to be a number of things to consider at the outset to achieve good performance. I decided to approach the problem by breaking it up into smaller tasks:

  • Enumerate the individual low-level stages of back-propagation learning
  • Specify the data needed at each stage
  • Decide on how to store this data in memory
  • What data is output by each stage
  • What computation is necessary to compute this output
  • Implement the design in CUDA / C++

Back-Propagation Steps

The first thing to do was specify the individual steps of training a neural network with back-propagation. This would then help with nailing down the data required and how it should be stored/processed. The rough outline of the back-propagation algorithm is as follows:

Step 0 - create a training mini-batch consisting of input vectors and corresponding desired outputs.

Step 1 - perform a forward pass through the network for the mini-batch input vectors:

  • at each layer store the layers output for each input vector
  • store the derivative of the layers activation function for each input vector

Step 2 - compare the difference of the network output to the desired output, store this as the last layers delta.

Step 3 - iterate backward through the layers, computing the layer delta based on the activation derivatives and delta of the following layer.

Step 4 - for each layer compute the gradient of the weights with respect to the network error, using the layer delta and outputs over the mini-batch.

Step 5 - Average the weight gradients across the mini-batch samples and shift the network’s weights using the gradient.

Data To Be Manipulated

With algorithm roughly outlined, it’s possible to list the data that needs to be stored and manipulated by the GPU. The rough enumeration of this data is as follows:

  • Connection weights: a list of 2D matrices. Each matrix represents the connection weights of a single layer of the neural network.
  • Batch Inputs: a 2D matrix where each row represents a sample input. The number of rows is the batch size.
  • Batch Outputs: a 2D matrix of the corresponding expected output, with each row representing the desired output vector for the corresponding input row vector.
  • Layer Outputs: a list of 2D matrices, with each matrix representing the output of a layer. Each row of the matrix corresponds to the output for a single element of the batch input. The number of rows is the mini-batch size, and the number of columns is the number of nodes in the layer.
  • Layer Derivatives: similar to the layer outputs, this is a list of 2D matrices, but each matrix is the derivative of the node output of a layer with respect to its input.
  • Layer Deltas: a list of 2D matrices, with each matrix representing the delta of a layer’s output in relation to the desired output. Again, this is a 2D matrix because of the network is processing a batch of inputs at a time, rather than a single input vector.
  • Batch gradient: a list of 2D matrices representing the gradient of the layer weights with respect to the error over the mini-batch.

Batch inputs matrix

There are a few other auxiliary pieces of data that are manipulated by the GPU, but this covers the main chunks of data the GPU needs to use during neural network training using back-propagation.

Processing

Looking at the back-propagation algorithm and the enumerated data structures, the following implementation structure made the most sense:

  • Pre-allocate the necessary memory on the device and host prior to learning. Memory allocation should not be done during learning.
  • Treat each mini-batch separately.
  • For the forward pass part of the algorithm: a) - have a kernel (function that runs on the GPU device) that computes the batch outputs of a single layer given the output of the previous layer. It will store the layer output and derivative information in a given array b) - the host (CPU) code will call this kernel in order over the layers.
  • For computing the layer delta part of the back-propagation: a) - have a kernel that computes the layer delta of a single layer based on the delta of the following layer b) - invoke this kernel from CPU code iteratively over the layers in reverse.
  • For computing the weight gradients: a) - have a kernel that computes the weight gradients for a given layer using the delta and derivative information of the layers output b) - invoke this kernel from the CPU for all the layers
  • Invoke a kernel that updates the layer weights given the gradient information.

Breaking it down like this helped me to structure the code and not get overwhelmed. The reason I was initially overwhelmed is because I was naively trying to implement the entire back-propagation algorithm in a single kernel. This is actually a needlessly complex way of approaching the problem. Its much easier to implement different parts of the algorithm in individual kernel functions, and invoke these from CPU code.

Forward Pass Kernel

This kernel computes the output of a given layer based on the output of the previous layer, the connection weights, and the activation function. The main computation is essentially a matrix multiplication of the weights and the matrix holding the mini-batch outputs of the previous layer. The weights matrix is (m rows) x (n columns), where m is the current layer size and n is the previous layer size. The input data is (a rows) x (b columns), where a is the mini-batch size, and b is the size of the previous layer. Each row of this matrix is a single input vector. The kernels algorithm is as follows:

  • Copy a chunk of the weight matrix and input matrix into shared memory (for fast access).
  • Compute the product of the two chunks and increment the corresponding elements in the output matrix.
  • Shift the chunk window along in the weights matrix and input matrix.
  • After the two matrices have been multiplied, compute and store the activation function (eg: logistic, ReLU, etc) for each node in the output matrix. Remember the output for a layer is a matrix rather than a vector because we are processing a whole mini-batch at a time.
  • In computing and storing the activation value for each node, we also compute and store the derivative of the activation function.

Forward pass kernel

Really the only complex part of this code is the shared-memory based “matrix multiplication” of the mini-batch input with the weights (I didn’t implement it exactly as a matrix multiplication due to the way the mini-batch data is arranged, but its very close). There is a pretty good explanation of the matrix multiplication algorithm for CUDA on the NVidia developer blog. The main difference really is that in the neural network code the rhs matrix is traversed in row-major order.

Layer Delta Kernel

Once the forward pass kernel has been applied iteratively for each layer in order, the layer delta kernel is applied in reverse, going backward through the layers starting at the output layer. The output layer is treated as a special case, but for each other layer, the delta kernel computes the delta of the current layer given the delta of the following layer, the connecting weights, and the derivatives of the activation function of the current layer for the current mini-batch. The implementation for this kernel is almost identical to the forward pass kernel in that it involves pseudo matrix multiplication using shared memory chunks. Except in this case, we are multiplying the transpose of the weight matrix by the delta of the next layer (the layer closer to the output). Once we have this, we multiply each node value by the derivative of the activation function, which was computed already by the forward pass kernel.

Gradient Kernel

The last main part of the back-propagation algorithm is handled by the gradient kernel. This kernel takes as input the layer deltas and the outputs of the layer for the current mini-batch, and outputs the average gradient of the layers weights with respect to the error over the mini-batch. Again this is done by a process very similar to a matrix multiplication, using shared memory chunks to multiply the layer delta matrix with the layer output matrix (remember these are matrices as we are dealing with a mini-batch delta and layer output). The result is a summed gradient over the weights of the layer, which is then divided by the batch size to get the average gradient.

This gradient kernel is applied to each layer to get the weights gradient over the entire network. This can then be used to update the weights as part of a higher-level gradient descent policy (eg: rmsprop, adam, etc). I won’t really cover these policies here as they are relatively straight forward to implement as kernels.

Miscellaneous Kernels

There are a bunch of other kernels that were needed for miscellaneous tasks. These include a kernel for transposing the weight matrix of a layer (which is needed for the layer delta kernel), a kernel for computing the softmax output of a layer, and kernels for applying the Adam update rule to the network weights. But for the most part these are fairly straight forward.

Performance

To measure the performance of the CUDA implementation I decided to compare it against my CPU implementation on the MNIST hand-written digits task. In this case the neural-network has 784 inputs, and 10 output classes. I didn’t want to mess around too much with different setups, so I decided to just try two network configurations (2 hidden layers and 3), and vary the mini-batch size across 3 different values. The CPU of my laptop is a quad core (giving 8 threads of execution due to Hyperthreading) Intel i7-4710HQ @ 2.50GHz. The GPU is an NVidia 970M. The results are presented in the two diagrams below, measuring the mini-batch iterations per second (so higher is better).

Back propagation performance of CUDA vs CPU

Back propagation performance of CUDA vs CPU

That’s some impressive performance as compared to using the CPU. The speedup is in the 50x ballpark. Looking at the pipeline of GPUs coming up, it’ll be quite interesting to see how far this gap widens, especially with the introduction of 16-bit floating point operations on the latest NVidia cards (which should give up to 2x performance as compared to 32-bit FP). As a small aside, one interesting bit for me was the large effect of avoiding bank memory conflicts. By simply adding a little bit of padding to each row in the shared memory chunk improved performance by 2x. Note for the future.

The code for this project can be found on Github: here (warning: code is experimental and may be slightly messy).

Written on June 19, 2016