## Efficient convolution on multi-dimensional data (part 2)

In part 1 of this post I talked about a general method for performing faster convolution operations on GPU hardware. This was achieved by reordering the memory operations to get a higher utilization of the floating point units.

Today we will continue by taking a look at a few specific implementations of convolution operations, complete with the OpenCL code, and see how we can go from a < 10 % utilization to closer to 65% when operating on a 3D dataset and performing convolutions with multiple different convolution kernels.

#### Basic assumptions

The input image is 256 * 256 * 256 intensity values (unsigned char). The kernel size is given as a CL compile time variable and a set of N kernels to perform convolution upon is given to the kernel as a packed array of floats in KXYZ order (K for Kernel, XYZ for the 3 dimensions).

## The first naive implementation

We can implement a naive convolution operation by using one work-item for each target pixel and using an accumulator that iterates over all source pixels within the kernel size window. In the first implementation we use an explicit array for the different accumulators.
/* Expected DEFINE's from the compilation */
/* kernsize -- the size of the filtering kernel in each direction */
/* nkernels -- the number of filters to use */

/* Result is only defined in the area defined by [0 ... (imagesize - kernsize)]
This means that the kernels are not centered around the input pixel
but rather gives offsets the output image slightly. You need to
shift it back manually afterwards if you care for this
*/

kernel void convolute(int4 imagesize, global unsigned char *input,
global unsigned char *output, global float *filter) {
int4 gid = (int4)(get_global_id(0),  get_global_id(1),  get_global_id(2),  0);
int4 lid = (int4)(get_local_id(0), get_local_id(1), get_local_id(2), 0);
int4 group = (int4)(get_group_id(0), get_group_id(1), get_group_id(2), 0);
int4 pixelid = gid;

// Starting offset of the first pixel to process
int imoffset = pixelid.s0 + imagesize.s0 * pixelid.s1 +
imagesize.s0 * imagesize.s1 * pixelid.s2;
int i;

/* The naive way of doing convolutions */
if(gid.s0 + kernsize > imagesize.s0 ||
gid.s1 + kernsize > imagesize.s1 ||
gid.s2 + kernsize > imagesize.s2) return;
int dx,dy,dz;

float val[nkernels];
for(i=0;i<nkernels;i++) val[i]=0.0;
for(dz=0;dz<kernsize;dz++)
for(dy=0;dy<kernsize;dy++)
for(dx=0;dx<kernsize;dx++) {
unsigned char raw = input[imoffset+dx+dy*imagesize.s0 +
dz*imagesize.s0*imagesize.s1];
for(i=0;i<nkernels;i++) {
val[i] += raw * filter[i+nkernels*(dx+kernsize*dy+kernsize*kernsize*dz)];
}
}

for(i=0;i<nkernels;i++)
output[imoffset*nkernels+i] = val[i];
}


As you can see by running the OpenCL algorithm above multiple times you get an increase in effective speed when performing multiple convolutions at the same time since the input variable raw does not need to be read for each convolution kernel.

When executed 1000 times on a AMD 6970 GPU hardware on a 256*256 *256 dataset with kernelsizes of 7*7*7 we gain an execution speed ranging from 0.184 seconds per convolution kernel (for one kernel) down to 0.022 seconds per convolution kernel (for 32 kernels).

Note especially the bump in the timing chart above occuring for 3 kernels. This bump is not a sampling problem but rather caused by each kernel writing out an uneven number of char values at the end of each computation. (Even though it seems like each work item is writing to memory individually the writes for a unit is coalesced together, but something seems to happen exactly for the case of writing out 3 bytes at a time).

## The second naive implementation

Another way of implementing the same naive convolution algorithm is to use the OpenCL built-in vector data types (eg. uchar2, float4, ...) and operations thereupon instead of an explicit array with for-loops. This gives the implementation below.

#### Macro for vector operations

As in the previous implementation we will use a set of macros to define the datatypes kernf for a vector of floats that matches the number of kernels, kernuc for a corresponding vector of unsigned char's. We define the macro kernstore to write a vector of unsigned chars to the memory and convert_kernuc to convert (type casting for vectors) floating point values to unsigned chars.
/* Preprocessor settings to define types that can
process multiple convolution kernels the same time. */
#if nkernels == 1
typedef float kernf;
typedef uchar kernuc;
#define kernstore(val,offset,arr) arr[offset]=val
#define convert_kernuc convert_uchar
#elif nkernels == 2
typedef float2 kernf;
typedef uchar2 kernuc;
#define kernstore vstore2
#define convert_kernuc convert_uchar2
#elif nkernels == 3
typedef float3 kernf;
typedef uchar3 kernuc;
#define kernstore vstore3
#define convert_kernuc convert_uchar3
#elif nkernels == 4
typedef float4 kernf;
typedef uchar4 kernuc;
#define kernstore vstore4
#define convert_kernuc convert_uchar4
#elif nkernels == 8
typedef float8 kernf;
typedef uchar8 kernuc;
#define kernstore vstore8
#define convert_kernuc convert_uchar8
#elif nkernels == 16
typedef float16 kernf;
typedef uchar16 kernuc;
#define kernstore vstore16
#define convert_kernuc convert_uchar16
#else
#error "nkernels should be one of: 1,2,3,4,8,16"
#endif


#### Caching the filter in local memory

Another change we can do to (attempt) to improve the performance is to load the filter values into local memory before the actual convolution operations. Although this seem to minimize the number of times the filter values are read from global memory (once for the whole filter per compute unit, as opposed to once per work-item) it has no effect at all as long as we only perform one convolution per work-item.

  /* Copy global filter to local memory */
local kernf filter[kernsize*kernsize*kernsize];
event_t event = async_work_group_copy(filter,filterG,kernsize*kernsize*kernsize,0);
wait_group_events(1, &event);


This can be seen in the run-time analysis below where the second implementation performs worse than the first implementation. The reason for this lack of improvement is again most likely due to the coalescing of memory operations from each compute device, leading to the first implementation also reading the filter data exactly once per compute unit.

Nevertheless, we choose to use this explicit load of the filter data since it will come very much in handy in the final implementation.

#### Full Kernel code

The full code of the kernel minus the macro's from above and the compile time defines.

kernel void convolute(int4 imagesize, global unsigned char *input,
global unsigned char *output, global kernf *filterG) {
int4 gid = (int4)(get_global_id(0),  get_global_id(1),  get_global_id(2),  0);
int4 lid = (int4)(get_local_id(0), get_local_id(1), get_local_id(2), 0);
int4 group = (int4)(get_group_id(0), get_group_id(1), get_group_id(2), 0);

// First (?) pixel to process with this kernel
int4 pixelid = gid;

// Starting offset of the first pixel to process
int imoffset = pixelid.s0 + imagesize.s0 * pixelid.s1 +
imagesize.s0 * imagesize.s1 * pixelid.s2;
int i;

/* Copy global filter to local memory */
local kernf filter[kernsize*kernsize*kernsize];
event_t event = async_work_group_copy(filter,filterG,kernsize*kernsize*kernsize,0);
wait_group_events(1, &event);

if(gid.s0 + kernsize > imagesize.s0 ||
gid.s1 + kernsize > imagesize.s1 ||
gid.s2 + kernsize > imagesize.s2) return;
int dx,dy,dz;

kernf val = (kernf)(0.0);
for(dz=0;dz<kernsize;dz++)
for(dy=0;dy<kernsize;dy++)
for(dx=0;dx<kernsize;dx++) {
unsigned char raw = input[imoffset+dx+dy*imagesize.s0 +
dz*imagesize.s0*imagesize.s1];
val += raw * filter[dx+kernsize*dy+dz*kernsize*kernsize];
}
kernstore( convert_kernuc(val), imoffset, output);
}


As we will see, this second implementation uses some macro definitions to use the appropriate floatX datatype (eg. float4), and it explicitly caches the filter data in local memory.  The results of these two operations give execution speeds ranging from 0.093 seconds per convolution kernel (for 1 kernel) down to 0.0365 seconds (for 16 kernels).

In the above graph of execution speed we can see that little happens after the point of using 4 kernels simultaneously. It is also at this point that the first implementation becomes faster than the second implementation.

So, are these two implementations good enough? 0.01 seconds to perform a convolution requiring 5.7 billion (256*256*256*7*7*7) operations certainly seems fast. To answer this question we will plot the execution times vs. the theoretically maximum (2.52 Tflops) performance of the target GPU.

The computation above assumes that a FMAC (Fused Multiply Add to accumulator) operation counts as a single floating point operation and demonstrates that our implementation is memory starved since we only can utilize roughly 10% of the theoretical maximum.

## Reordering the memory operations for efficiency

Next we will implement the methods described in my previous post in which we let each work item be responsible for the convolution operations for multiple pixels, and where we by reordering the convolution operations can reuse the memory fetch of the same input values for the results of multiple different output values.

Before we start, we will take a careful look at some of the macros in use.

#### Macro for vector operations

As in the previous implementation we will use a set of macros to define the datatypes kernf for a vector of floats that matches the number of kernels, kernuc for a corresponding vector of unsigned char's. We define the macro kernstore to write a vector of unsigned chars to the memory and convert_kernuc to convert (type casting for vectors) floating point values to unsigned chars.

Use one of these three definitions to perform the multiply-add operation.

Second most exact (or tied with fma), but fastest due to use of FMAC instructions.
#define mmad(x,y,z) (x+y*z)

Undefined precision (for some cases this can be very very wrong)
#define mmad(x,y,z) mad(x,y,z)

Guaranteed to be the most exact
#define mmad(x,y,z) fma(x,y,z)

#### Loop unrolling/reordering through macro expansion / optimizer

Before we give the final code for the efficient implementation we will study the last two macros used in the code. We start by noting that since the kernel will now be reasonsible for computing the outputs for ko number of destination pixels we need as many accumulators.

  kernf val[CONV_UNROLL];


Ideally we would like to completely unroll the intermost loop (dx=0 ... kernelsize) from the implementations above and to extend it by the loop unrolling factor (dx=0 ... kernelsize+ko). Furthermore, the convolution steps should now use different filter positions to increment the different accumulators - but can do so using the same raw value. In pseudo code:

  raw = pixel(x+0,y+dy,z+dz)
val[0] += filter[0][dy][dz] * raw
raw = pixel(x+1,y+dy,z+dz)
val[0] += filter[1][dy][dz] * raw
val[1] += filter[0][dy][dz] * raw
...


We note that the first raw value will be used by one target value, the second by two etc. until the 7'th (for kernelsize=7) raw value which will be used by all. After the 7'th value the first target pixel will not need the raw data and the number of used filters will diminish by one if ko<kernsize.

If we where to manually write the code that is needed for a kernelsize of 7 and a convolution unroll of 16 we would need 335 lines of code (23 raw values, and 7*16-7*8 multiplication steps). Instead of doing this by hand for each possible case of kernelsize and loop unrolling we let the macro processor expand these for us. We note that the multiplication for a single step can be written as follows if we let pos be the dx value and ko be one instance of the convolution unroll values.

      if(pos-ko >= 0 && pos-ko < kernsize) {
}

Now, since all of the expressions in the if-statement are compile time constants we can safely use macro expansion to output the lines above for all possible values of pos and ko. The optimizer will remove the actual if statements and only keep the multiplication code for the values that satisfy the if statements.

Macro expanding out the code above for all values of pos and ko is done through the two sets of macros MAD(ko,pos) and MADS(pos).

#### Final efficient implementation of convolutions

Finally the full source code for this

/* Expected DEFINE's from the compilation */
/* kernsize -- the size of the filtering kernel in each direction */
/* nkernels -- the number of filters to use */
/* CONV_UNROLL -- amount of unrolling to perform */

/* Result is only defined in the area defined by [0 ... (imagesize - kernsize - CONV_UNROLL)]
This means that the kernels are not centered around the input pixel but rather gives offsets the output image slightly.
You need to shift it back manually afterwards if you care for this.
*/

/* Preprocessor settings to define types that can process multiple convolution kernels the same time.
*/
#if nkernels == 1
typedef float kernf;
typedef uchar kernuc;
#define kernstore(val,offset,arr) arr[offset]=val
#define convert_kernuc convert_uchar
#elif nkernels == 2
typedef float2 kernf;
typedef uchar2 kernuc;
#define kernstore vstore2
#define convert_kernuc convert_uchar2
#elif nkernels == 3
typedef float3 kernf;
typedef uchar3 kernuc;
#define kernstore vstore3
#define convert_kernuc convert_uchar3
#elif nkernels == 4
typedef float4 kernf;
typedef uchar4 kernuc;
#define kernstore vstore4
#define convert_kernuc convert_uchar4
#elif nkernels == 8
typedef float8 kernf;
typedef uchar8 kernuc;
#define kernstore vstore8
#define convert_kernuc convert_uchar8
#elif nkernels == 16
typedef float16 kernf;
typedef uchar16 kernuc;
#define kernstore vstore16
#define convert_kernuc convert_uchar16
#elif nkernels == 32
typedef float32 kernf;
typedef uchar32 kernuc;
#define kernstore vstore32
#define convert_kernuc convert_uchar32
#else
#error "nkernels should be one of: 1,2,3,4,8,16,32"
#endif

/* Use one of these three definitions to perform the multiply-add operation */
#define mmad(x,y,z) (x+y*z)       // Second most exact (or tied with fma), but fastest due to use of FMAC (FMA-accumulator) instruction
//#define mmad(x,y,z) mad(x,y,z)  // Undefined precision (for some cases this can be very very wrong)
//#define mmad(x,y,z) fma(x,y,z)  // Guaranteed to be the most exact

kernel void convolute(int4 imagesize, global unsigned char *input,
global unsigned char *output, global kernf *filterG) {
int4 gid = (int4)(get_global_id(0)*CONV_UNROLL,  get_global_id(1),  get_global_id(2),  0);
int4 lid = (int4)(get_local_id(0), get_local_id(1), get_local_id(2), 0);
int4 group = (int4)(get_group_id(0), get_group_id(1), get_group_id(2), 0);
// First (?) pixel to process with this kernel
int4 pixelid = gid;

// Starting offset of the first pixel to process
int imoffset = pixelid.s0 + imagesize.s0 * pixelid.s1 + imagesize.s0 * imagesize.s1 * pixelid.s2;
int i,j;

int dx,dy,dz;

/* MAD performs a single convolution operation for each kernel,
using the current 'raw' value as the input image
'ko' as an instance of an unrolled convolution filter
'pos' as the X-offset for each of the unrolled convolution filters
Note that all the if statements dependent only on static values -
meaning that they can be optimized away by the compiler
*/
if(pos-ko >= 0 && pos-ko < kernsize) {    \
}}}
raw=input[imoffset2+pos];       \
}}

kernf val[CONV_UNROLL];
for(j=0;j<CONV_UNROLL;j++)
val[j]=(kernf)(0.0);

int localSize = get_local_size(0) * get_local_size(1) * get_local_size(2);
local kernf filter[kernsize*kernsize*kernsize];

/* Copy global filter to local memory */
event_t event = async_work_group_copy(filter,filterG,kernsize*kernsize*kernsize,0);
wait_group_events(1, &event);

if(gid.s0 + kernsize + CONV_UNROLL > imagesize.s0 ||
gid.s1 + kernsize > imagesize.s1 ||
gid.s2 + kernsize > imagesize.s2) return;

for(dz=0;dz<kernsize;dz++)
for(dy=0;dy<kernsize;dy++)  {
int offset = dy*kernsize*nkernels + dz*kernsize*kernsize*nkernels;
int imoffset2 = imoffset+dy*imagesize.s0 + dz*imagesize.s0*imagesize.s1;
unsigned char raw;

/* kernsize + convolution_unroll < 42 */
}

for(j=0;j<CONV_UNROLL;j++) {
kernstore( convert_kernuc(val[j]), imoffset+j, output);
}
}


#### Analysis of the full implementation

Now, let's take a look at what efficiency we can gain using this. We start by looking at again at the GPU utilization as a function of the level of convolution unrolling performed and with three different settings for the workgroup size. These measurements have been done using the same hardware as above and have fixed the number of convolution kernels used to 8.

As you can see we gain a speed advantage up to roughly 65% of the theoretical maximum for these settings and for this input kernel size. For other values we can gain slightly higher of lower efficiencies. For reference the optimal values occur for a convolution unroll of 16 which gives 0.004 seconds per convolution kernel.

To compare this method with the naive method I also give you the below graph which shows the efficiency as a function number of convolution kernels. With a fixed workgroup size of 16,16,1 and 4 kernels.

### Conclusions

So what conclusions can we draw from this? Well, to start with:

1) A naive convolution operation is memory starved when perform on the GPU. By rearranging the order of the innermost steps we can gain a significant performance gain.

2) Using float3/uchar3 when writing out to memory is a bad idea.

3) Performing a local cache of the filter data is pointless if the data is only used once -- even if it is used once for each work item, as long as they access each item simultaneously.

4) The optimizer can be trusted to remove dead code when using if statements with only compile time constants. This can be utilized to simplify the creation of repetitive amounts of code such as loop unrolling through the use of macro expansions.

## Efficient convolution of multi-dimensional data

For the first real content post I will introduce a method for improving the memory bandwidth efficiency when performing basic convolution operations on multidimensional data (eg. 2D/3D/4D datasets). I presented the basic method for this as a side note in the publication below, but without much technical details -- which we can look at in this post instead.

M. Broxvall, K. Emilsson, P. Thunberg. Fast GPU Based Adaptive Filtering of 4D Echocardiography. IEEE Transactions on Medical Imaging , 2011, DOI 10.1109/TMI.2011.2179308

I will assume that you know how a convolution operation is defined and the purposes it has in image processing. Furthermore, I will assume that the convolution filter size (to avoid confusion I will call convolution kernels as convolution filters as to avoid a confusion with the notion of kernels from OpenCL) is much smaller than the image size that you are trying to convolve.

Assume that we have the convolution filter $$f$$ of size $$\vec k$$, an image $$M$$ and want to calculate the convolution $$f * M$$. A typical approach to this would be to launch one OpenCL kernel for each output point $$\mbox{out}$$ and to perform the computation below:

$$\mbox{out}(\vec p) = 0$$
for $$j_1$$ = 0 ... $$k_1$$ do
...
for $$j_n$$ = 0 to $$k_n$$ do
$$\mbox{out}(\vec p) \leftarrow \mbox{out}(\vec p) + f(\vec j) M(\vec p + \vec j)$$

With this formulation we will need to perform $$\prod_i k_i$$ multipllications and additions, as well as the same number of memory fetch operations reading from the input image $$M$$. If we would perform, for instance, a convolution of size 10 over a 3D dataset we thus require 1000 memory fetches and 1000 floating point multiply-add operations for each generated output.

The multiply-add operations can often be performed faster than the equivalent two floating point operations by using dedicated hardware, and is exposed in the OpenCL standard through the operations mad or mad24.

Now, assuming that our hardware does not cache the input image we will have a problem. This is the case on many GPUs when you are using global memory buffers as opposed to OpenCL image objects + samplers. The reason for this has to do with how the hardware is optimised to texture memory accesses and a bit outside the scope of this post.

Although modern GPUs have high bandwidth to onboard memory on the order of 160 GB/s (ATI 6950) they have even higher floating point capacities (2.25Tflops) the convolution operation above is memory bound. Assuming that the $$out$$ and $$f$$ variables are stored in on-board memory the naive  algorithm above would be capable of at most $$4 \times 10^{10}$$ steps of the innermost loop and thus only utilize 1-2% of the total computational capacity.

In order to optimize on this and gain better performance we note that the same input image samples will contribute to many different output values, but will be multiplied with different filter coefficients before doing so. An obvious interpolation is thus to re-use the same input image values and use them to compute multiple output values.

The extreme case of this would be to load the whole image into high-speed memory (GPU registers or shared memory) -- but this would obviously not be possible for anything but trivial image sizes. However, we can do this by redesigning our compute kernels to compute $$k_o$$ number of outputs for each kernel and to re-use the same input image values for multiple output computations. We extend here the convolution filter $$f$$ to contain zeroes at all points outside the original convolution filter (in the code we can do this more flexible).

$$\mbox{out}(\vec p + \overline{(0,...,0,0)}) = 0$$
...
$$\mbox{out}(\vec p + \overline{(0,...,0,k_o)}) = 0$$
for $$j_1$$ = 0 ... $$k_1$$ do
...
for $$j_n$$ = 0 to $$k_{n-1}$$ do
for $$j_n$$ = 0 to $$k_n + k_o - 1$$ do
$$v \leftarrow M(\vec p+\vec j)$$
$$\mbox{out}(\vec p + \overline{(0,...,0,0)}) \leftarrow \mbox{out}(\vec p + \overline{(0,...,0,0)}) + v f(j - \overline{(0,...,0,0)})$$
...
$$\mbox{out}(\vec p + \overline{(0,...,0,k_o)}) \leftarrow \mbox{out}(\vec p + \overline{(0,...,0,0)}) + v f(j - \overline{(0,...,0,k_o)})$$

The above code requires $$k_1 ... k_n-1 (k_n + k_o - 1)$$ memory accesses to compute $$k_o$$ number of outputs. We thus gain a (theoretical) speed up of a factor of $$k_n k_o / (k_n k_o - 1)$$ times, which for large sizes of the convolution kernel goes to $$k_o$$.

So, now the obvious question: do we realy get a speedup factor of $$k_o$$ and where does the speedup tap out.

The short answer to this is: it depends on the number of available hardware registers. With larger values of $$k_o$$ we with exhaust the number of available hardware registers and the GPU will schedule fewer kernel invokations in parallel on each compute device.

Coming next: code + numbers demonstrating this speedup.

## Hello World\n

The purpose of this blog is to collect some thought, notes and algorithms of mine regarding GPGPU computations in general and some specific neat OpenCL code. We'll start with a small summary of the concepts and links to a few relevant sources to get you started with writing OpenCL code -- the intention of which is mostly for me remember the links and not to make a complete getting started tutorial in OpenCL.

First of, GPGPU stands for general purpose GPU (computation) where GPU in term stands for graphic processing unit. These units are the very powerful massively parallel processors that are in use today mainly for rendering graphics, but which incidentally also happens to be good for solving other parallelizable tasks requiring mostly floating point operations. When looking at the raw computational performance of these devices they are often one or two orders of magnitude (10-100) times faster than modern CPU's. However, the power of these devices come mainly from a higher level of parallelism rather than the clock frequency. As such, programming these devices efficiently is a challenge and require very different development tools as well as algorithms in order to be efficient.

CUDA is a language specific toolkit developed by NVidia that allows the programmer to perform general purpose computations (convolutions, fourier transforms, signal processing, bitcoin mining, ...) of NVidia cards. Since CUDA was one of the first widespread toolkits dedicated to GPGPU computations it has a large adoption in the high-performance computing community and there exists many GPU cluster machines (supercomputers build from GPU's) that perform computationally heavy tasks.

The main drawback (IMHO) of CUDA is that it restricts the user to NVidia specific platforms, and that it lacks support for some features. The main advantage is the ease of use coming from the mature toolkit and the integration into the programming language itself. (The later is also a drawback with it...)

OpenCL is an open standard maintained by Khronos (the same group that develops OpenGL) that serves to give a standardized interface for compute devices to be used from any programming language through a standardized API interfacing to one or more drivers. A compute device can here be a GPU or any other device that can perform parallel computations such as a multi-core CPU or a Cell-processor.

The main advantage (IMHO) with OpenCL is that is an Open Standard, that it works with a wide range of devices and that is (host) language agnostic. The main drawback, is that it is language agnostic. Using OpenCL directly from your C/C++/Python etc. code means a significant overhead in glue code - but once  you have this in place it gives you alot of flexibility as well as very good interfacing with OpenGL.

For the purpose of this blog I will write mainly on the development of algorithms and neat tricks for OpenCL and I use mainly AMD/ATI's OpenCL drivers and to some lesser extent Intel's OpenCL drivers.

• AMD 5870 / 5870M / 6870 cards running on AMD APP drivers
• AMD CPUs (x4, x6)  running on AMD APP drivers
• Intel core i7 CPUs (x4)  running on AMD APP drivers
• Intel core i7 CPUs (x4)  running on Intel's OpenCL drivers

In addition to the above device/driver combinations a common other combination is to use NVidia's OpenCL drivers. Currently I don't have access to modern NVidia hardware, but perhaps I'll get one for measuring differences at a later point.