Tuesday, April 6, 2010

Encapsulation on the GPU

Encapsulation is the separation of logical components by abstract interfaces. Though commonly touted as an object oriented feature it can be implemented in pure C and thus in pure OpenCL as well. However, encapsulation in OpenCL breaks down when arbitrary pointers are needed. The only way to obtain these pointers is to have them be a kernel argument.

Example: Optimizers

A simplified optimizer has two components: an objective function; and an optimizer to repeatedly call the objective function. The implementation in pseudo-code OpenCL looks like the following:



Note, OpenCL doesn't support function pointers so the function ObjectiveFunction is used instead to triage the appropriate objective function bits.

The problem

In order to get a pointer to memory in OpenCL it must be a bare kernel argument. What would be preferable is to be able to write a kernel like the following:



On the host the equivalent of the ObjectiveFunctionOptions struct could be written using the appropriate OpenCL types like the following:



Unfortunately, OpenCL implementations aren't smart enough to introspect into a struct argument and properly set cl_mem data types to their associated buffer or image pointers. Furthermore, I can't find anything in the standard stating that they should be this smart.

Possible solution: Enumerate them as kernel arguments

The first possible solution is to forget about encapsulation and just make everything a kernel argument that has to be. The kernel for the above would look like the following:


Pros

  • Easy to comprehend.
  • Easy offline compilation. 
  • The OpenCL standard allows for NULL pointers to be specified for an argument like component1Data. Therefore, there isn't much overhead for having extra arguments to global memory.
Cons

  • Another location to update when adding another feature.
  • There are limits to the number of arguments you can have of certain types. For example, the C1060 has the following limits:
    • CL_DEVICE_MAX_READ_IMAGE_ARGS = 128
    • CL_DEVICE_MAX_WRITE_IMAGE_ARGS = 8
    • CL_DEVICE_MAX_CONSTANT_ARGS = 9
  • The OpenCL standard doesn't allow for NULL image objects like the argument component2Data. Therefore, a dummy image object would have to be loaded onto the device and then not used.
  • CL_KERNEL_WORK_GROUP_SIZE no longer accurate for a given feature set. There is already a significant amount of register pressure from the optimizer. This may make it worse, though this something that needs to be tested.
    • Note, the code for the unused features could be contained in #if preprocessor blocks. Only the code needed for the specified features would be compiled alleviating the superfluous register pressure problem. Though doing this would cancel out the benefit of easy offline compilation.

Possible solution: Write separate kernels

This solution would entail writing a kernel for every combination of possible features. For the above example this would look like the following:



The driver then chooses the appropriate kernel to execute based on the user specified features.

Pros
  • The OpenCL compiler can fully optimize for a particular feature set. Hopefully counteracting the register pressure problem.
  • Easy offline compilation
Cons
  • For N binary features there are 2^N - 1 possible kernels. Not something to maintain by hand. It would be possible to use meta-programming to automatically generate the desired kernel code on the fly, though offline compilation is required to be able to hide source code before release to a wider audience.
    • Theoretically I should be able to offline compile our most sensitive code and then only meta-program the high-level kernel code. Essentially, using a linker to connect the components together. Unfortunately, clCreateProgramWithBinary doesn't support this type of operation.
Possible solution: Preprocessor macros

OpenCL supports a full fledged preprocessor. Boolean preprocessor macros can be set by some simple string replacement to effectively turn features on and off. This is pseudo-meta-programming since all the logic is handled inside the preprocessor macros. The above example would like the following:



Pros
  • Zero code duplication.
  • Common technique, everyone already knows about preprocessor macros.
  • The OpenCL compiler can fully optimize for a particular feature set. Hopefully counteracting the register pressure problem.
Cons
  • Harder to read and thus maintain.
  • Offline compilation has to enumerate all possible feature combinations using combinations of preprocessor macros.