Tuesday, April 6, 2010

Encapsulation on the GPU

Encapsulation is the separation of logical components by abstract interfaces. Though commonly touted as an object oriented feature it can be implemented in pure C and thus in pure OpenCL as well. However, encapsulation in OpenCL breaks down when arbitrary pointers are needed. The only way to obtain these pointers is to have them be a kernel argument.

Example: Optimizers

A simplified optimizer has two components: an objective function; and an optimizer to repeatedly call the objective function. The implementation in pseudo-code OpenCL looks like the following:



Note, OpenCL doesn't support function pointers so the function ObjectiveFunction is used instead to triage the appropriate objective function bits.

The problem

In order to get a pointer to memory in OpenCL it must be a bare kernel argument. What would be preferable is to be able to write a kernel like the following:



On the host the equivalent of the ObjectiveFunctionOptions struct could be written using the appropriate OpenCL types like the following:



Unfortunately, OpenCL implementations aren't smart enough to introspect into a struct argument and properly set cl_mem data types to their associated buffer or image pointers. Furthermore, I can't find anything in the standard stating that they should be this smart.

Possible solution: Enumerate them as kernel arguments

The first possible solution is to forget about encapsulation and just make everything a kernel argument that has to be. The kernel for the above would look like the following:


Pros

  • Easy to comprehend.
  • Easy offline compilation. 
  • The OpenCL standard allows for NULL pointers to be specified for an argument like component1Data. Therefore, there isn't much overhead for having extra arguments to global memory.
Cons

  • Another location to update when adding another feature.
  • There are limits to the number of arguments you can have of certain types. For example, the C1060 has the following limits:
    • CL_DEVICE_MAX_READ_IMAGE_ARGS = 128
    • CL_DEVICE_MAX_WRITE_IMAGE_ARGS = 8
    • CL_DEVICE_MAX_CONSTANT_ARGS = 9
  • The OpenCL standard doesn't allow for NULL image objects like the argument component2Data. Therefore, a dummy image object would have to be loaded onto the device and then not used.
  • CL_KERNEL_WORK_GROUP_SIZE no longer accurate for a given feature set. There is already a significant amount of register pressure from the optimizer. This may make it worse, though this something that needs to be tested.
    • Note, the code for the unused features could be contained in #if preprocessor blocks. Only the code needed for the specified features would be compiled alleviating the superfluous register pressure problem. Though doing this would cancel out the benefit of easy offline compilation.

Possible solution: Write separate kernels

This solution would entail writing a kernel for every combination of possible features. For the above example this would look like the following:



The driver then chooses the appropriate kernel to execute based on the user specified features.

Pros
  • The OpenCL compiler can fully optimize for a particular feature set. Hopefully counteracting the register pressure problem.
  • Easy offline compilation
Cons
  • For N binary features there are 2^N - 1 possible kernels. Not something to maintain by hand. It would be possible to use meta-programming to automatically generate the desired kernel code on the fly, though offline compilation is required to be able to hide source code before release to a wider audience.
    • Theoretically I should be able to offline compile our most sensitive code and then only meta-program the high-level kernel code. Essentially, using a linker to connect the components together. Unfortunately, clCreateProgramWithBinary doesn't support this type of operation.
Possible solution: Preprocessor macros

OpenCL supports a full fledged preprocessor. Boolean preprocessor macros can be set by some simple string replacement to effectively turn features on and off. This is pseudo-meta-programming since all the logic is handled inside the preprocessor macros. The above example would like the following:



Pros
  • Zero code duplication.
  • Common technique, everyone already knows about preprocessor macros.
  • The OpenCL compiler can fully optimize for a particular feature set. Hopefully counteracting the register pressure problem.
Cons
  • Harder to read and thus maintain.
  • Offline compilation has to enumerate all possible feature combinations using combinations of preprocessor macros.

8 comments:

  1. You wrote "Unfortunately, OpenCL implementations aren't smart enough to introspect into a struct argument and properly set cl_mem data types to their associated buffer or image pointers." So have you read clEnqueueNativeKernel. It's not what you want in that it's a host function and not a device function, but it does provides a form of introspection into a struct.

    ReplyDelete
  2. Whoa. Cool. So clEnqueueNativeKernel is at least designed in a way to allow for what I need. But there isn't the equivalent for GPU kernels using clEnqueueNDRangeKernel.

    So what I meant by "smart" is that since OpenCL kernel code is compiled by the same runtime it has the ability to look at struct fields and translate from cl_mem objects to the appropriate pointers in the argument passed to the kernel.

    Though I would settle for the midway point of how clEnqueueNativeKernel works by explicitly specifying the struct field locations of the cl_mem objects. This way be necessary due sizeof(void *) being different on the host and device.

    ReplyDelete
  3. Correct, that is, there isn't the equivalent for device kernels.

    How the translation of the struct's contents is done is implementation dependent, so it might be compiled into the kernel or just simply dynamically built by the runtime host functions. This linkage or special sauce is not exposed because it is very vendor and device dependent including the hardware interface whether it is PCIe or some other connection.

    Naturally you may write a Khronos OpenCL bugzilla for this capability, but it's up to the OpenCL working group to judge if it makes sense, and is appropriate, applicable, and doable for the majority of their customers.

    ReplyDelete
  4. You can do everything that you want to do in CUDA C++. You have templates, structs with __device__ methods, classes, and in CUDA 3.1, even inheritance. Encapsulate and overload to your heart's content.

    ReplyDelete
  5. The "if" business doesn't scale to large numbers of possible components. Why not metaprogramming? Your secrets aren't that much safer as PTX binaries than as OpenCL source code; and OpenCL binaries seem hideously brittle across driver versions (and vendors).

    I've had very good luck with metaprogramming on the GPU. Sticking together and compiling new kernels at runtime is extraordinarily powerful, akin to LISP/Haskell type power, nastier syntax but the insane speed of the GPU...

    ReplyDelete
  6. Re: Orion

    While I agree metaprogramming is very cool (my favorite language is Python :-), I don't think it's as applicable to software companies porting code to the GPU. The chief reason to port to the GPU is performance, and performance critical sections are often the real bread and butter of a software company. We have already had an open source implementation try to compete with our software by porting to the GPU. Since they don't have as much experience with it they ultimately failed, though it did send a signal to us that secrets are important for our well being.

    I would disagree with "Your secrets aren't that much safer as PTX binaries". Decompiling is difficult, and has never really worked well. The OpenCL source is a lot easier to understand than PTX. I'm reminded of this quote, "There is no expedient to which a man will not resort to avoid the real labor of thinking". Basically, trying to understand what PTX is doing requires way too much thinking. However, it's quite easy to start an application in a debugger and break on clBuildProgramFromSource to investigate the strings being passed.

    ReplyDelete
  7. Easy question for you

    Say I want to pass a float4 array (XYZT), but I want to overload T as an unsingned int like a union. Is there a way to pass a struct array to the OpenCL kernel, our you should just assign it into a union within the kernel?

    ReplyDelete
  8. Figured it out. You can pass it as a float4 g, then use as_uint(g.w) and as_float(g.w) to use it as either type. Slick. :)

    ReplyDelete