OpenCL in LWJGL

From LWJGL
Jump to: navigation, search

OpenCL is an extremely useful feature for any cpu intensive game. Most modern CPUs have multiple cores for parallel processing, but you do not need one of these for effective use of openCL. OpenCL is designed for parallel processing and can be setup to run on a computer's GPU, which is idle when you are not drawing your game. OpenCL can help you unlock the GPU's multithreading potential for purposes other than graphics.


A Bit of Terminology

  • Kernel This is an entrypoint into the OpenCL code you write. It is what will be called from your main application to initiate a program.
  • Work-Item The smallest unit of work. Each work item runs the program individually and has it's own unique id to identify it.
  • Work-Group Work items are organized into work groups. They are stored in a 1, 2 or 3 dimensional array. Work groups exist to allow sharing of memory between items and also have their own unique id.
  • N-D Range Work groups are further organized into another 1, 2 or 3 dimensional array specified by the N-D range.

The Kernel Example

The example I will be using throughout is that of adding together the contents of two arrays element by element, and returning the result in a different array. As a guide, I will write the equivalent code in Java and C (OpenCL is similar to C) and finally give and briefly explain the method as a OpenCL kernel.

In Java:

// Imagine this is in a class.
public static void sum(float[] a, float[] b, float[] result) {
    // Using an argument as a reference to change a desired piece of memory is an odd way to go about this in Java but bear with me.
    result = new float[a.length];
    for(int i = 0; i < a.length; i++) {
        result[i] = a[i] + b[i];
    }
}    

In an OpenCL kernel.

// OpenCL uses pointers (using the * symbol after the data type),
// which point to a location in memory, normally pointers point to
// the first element in an array. 
// For this example, we will assume that these variables refer
// to arrays with the same length as the 'int const size' variable.
kernel void sum(global const float* a, global const float* b, global float* result, int const size) {
   //Several thing in the declaration. First, the _kernel keyword and void return type. All kernels must have this but other 
   //functions need not. Second, the _global keyword for the arrays. These arrays must be accessible by all work groups and are
   //therefore in global space. I have made the input arrays const (final in Java) as they do not need to be written to, only read.
   //Last, size is needed as there is no easy way to get the size of an array.

   //In OpenCL, each work item will only do one element from each array. To know which elements to add together, we use the id of the work
   //item, assigned to it by OpenCL. The first item receives 0 and then they increment consecutively after that.
   const int itemId = get_global_id(0); 

   // The programmer defines how many work items are created, but you rarely create exactly the right amount, so we must check that 
   // this item's id is within the bounds of our array.
   if(itemId < size) {
      result[itemId] = a[itemId] + b[itemId];
      //The easiest part of all, adding the numbers together.
   }
}    

I realize that this is not the best explanation for the code itself, but I didn't want to spend to long on it. You are all smart people and this is essentially C syntax with _global and _kernel keywords.


Creating the Context and Command Queue

  • Creation

In LWJGL, just as in OpenGL and OpenAL, you must call the CL create method before doing anything in OpenCL.

public static void initializeCL() throws LWJGLException { 
   //Throws LWJGLException if an error occurs. If you get a message saying could not find CL Source(Or something similar), then
   //your hardware does not support openCL. (Try updating drivers first though). 
   CL.create();
}
  • Platform

The platform is a collection of the host (your program on the cpu) and a series of devices (see below). It allows for management and sharing of memory between devices. In LWJGL, a platform is represented by the CLPlatform class, which contains a very helpful static method for retrieving platforms.

   //The getPlatforms() method return a List<CLPlatform> which contains the available platforms on your hardware. Picking 0 will suffice for now.
   CLPlatform platform = CLPlatform.getPlatforms().get(0); 
  • Device

A device is any piece of hardware that can run an OpenCL kernel, like your GPU or CPU. It is represented in LWJGL by CLDevice and you can access it through a CLPlatform.

   //A great method from the lovely people at LWJGL which gives you a list of all devices of a particular type.
   //I am using the GPU. There are others devices including CL_DEVICE_TYPE_ALL and CL_DEVICE_TYPE_DEFAULT.
   List<CLDevice> devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
  • Context

The context ties the devices, kernels, memory, and queues. It is represented by the CLContext class and once more a nifty method exists to get an instance.

   //The last null argument is where an Intbuffer would be if you wanted to get an error code out of the method. I am leaving this null for simplicity.
   //This method throws a LWJGLException if "an exception occurs while creating the context". I take this to mean that it will thrown an error if the context could not be created,
   //which makes the error code useless. There are other overloaded methods taking more arguments such as a Drawable to share with OpenGL, but that is beyond this tutorial's scope. 
   CLContext context = CLContext.create(platform, devices, null);
  • Command Queue

The command queue, represented by the CLCommandQueue class, is the final object to initialize. It is created for a specific device and allows commands to be sent to the device via a queue.

   IntBuffer errorBuf = BufferUtils.createIntBuffer(1); //This time we'll check for errors
   //The first two arguments for clCreateCommandQueue() are the context and device (I just used the first device in our list).
   //The third argument is properties. This one is necessary to profile the queue later, you may not need do this but I like to use this property. The last property is an IntBuffer to store an error code.
   CLCommandQueue queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, errorBuff);
   // checkCLError throw a CLExcepetion if the error code does not equal CL_SUCCESS. This exception should be caught and all currently created resources released. See later. 
   Util.checkCLError(errorBuf.get(0)); 


And now the whole code, in a class and putting the various CL objects in static fields:

import java.util.List;
import java.nio.IntBuffer;
import org.lwjgl.BufferUtils;
import org.lwjgl.opencl.*;

public class MyClClass {
   public static CLContext context;
   public static CLPlatform platform;
   public static List<CLDevice> devices;
   public static CLCommandQueue queue;

   // For simplicity exception handling code is in the method calling this one.
   public static void initializeCL() throws LWJGLException { 
      IntBuffer errorBuff = BufferUtils.createIntBuffer(1);
      CL.create();
      platform = CLPlatform.getPlatforms().get(0); 
      devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
      context = CLContext.create(platform, devices, errorBuff);
      queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, errorBuff);
      Util.checkCLError(errorBuf.get(0)); 
   }
}

All further code is in this class allowing us to use our static CL Objects.


Allocating Memory

If all has gone well you now have a fully functional CL context ready to execute your kernels. Now we just have to load the program, allocate the memory, and run the program.

If you remember, the arrays in our kernel were global so they could be accessed by all of the running kernels. This means we must create a space for them in our context. Memory in an OpenCL context is represented in LWJGL by the CLMem class. There are several ways to create this object, some constructors take a buffer of data and copy it and other just take a size argument.

//In all these examples I use a null reference for the error IntBuffer. This is only for simplicity's sake.
//You should add error checking in final release code.

int length = 100; //100 elements.
long size = length * 4; // 4 bytes per float.

// This clCreateBuffer method doesn't put any data in the memory, just allocates a chunk of memory with the specified number of bytes.
// The memory is read only, and cannot be written to by the host (your main application). The buffer's access level in the kernel depends on how it is defined in the kernel.
CLMem memory = CL10.clCreateBuffer(context, CL10.CL_READ_ONLY, size, null);

//Create a FloatBuffer length 100, containing 0 - 99, to be copied into the memory. 
FloatBuffer dataBuff = BufferUtils.createFloatBuffer(100);
float[] dataArray = new float[100];
for(int i = 0; i < 100; i++) {
    dataArray[i] = i;
}
dataBuff.put(dataArray);
dataBuff.rewind();

//This time the memory we create cannot be read by the host. The flag CL_MEM_COPY_HOST_PTR means that the data stored in the buffer you pass to
//the method will be copied into the device's memory. LWJGL lets you skip the length argument by using the length of the input buffer as the length.
CLMem memory2 = CL10.clCreateBuffer(context, CL10.CL_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, dataBuff, null); 

Now you can give your kernels the data they need, all that remains is to create the kernels themselves, which were explained way back at the beginning of the tutorial.


Creating Programs and Kernels

Kernels are the entry points to your OpenCL code, but what about all the other functions and variables you write? Kernels are not the whole story, in fact they are just part of a larger OpenCL program represented by a CLProgram instance. In order to create a program, we need our source which means we will have to load a file. (You do not have to store the kernel's source code in a file, but I like to). Here is a method that will load a given file from your application's directory and return a String containing the text in the file. Also the kernel code is repeated here for ease of reference.

_kernel void sum(_global const float* a, _global const float* b, _global float* result, int const size) {
    const int itemId = get_global_id(0); 
    if(itemId < size) {
        result[itemId] = a[itemId] + b[itemId];
    }
}    

IO is pretty boring at times. Use your own method if you want, or just type the kernel as a string straight into your application source.

	public static String loadText(String name) {
		if(!name.endsWith(".cls")) {
			name += ".cls";
		}
		BufferedReader br = null;
		String resultString = null;
		try {
			// Get the file containing the OpenCL kernel source code
			File clSourceFile = new File(MyClClass.class.getClassLoader().getResource(name).toURI());
			// Create a buffered file reader to read the source file
			br = new BufferedReader(new FileReader(clSourceFile));
			// Read the file's source code line by line and store it in a string buffer
			String line = null;
			StringBuilder result = new StringBuilder();
			while((line = br.readLine()) != null) {
				result.append(line);
				result.append("\n");
			}
			// Convert the string builder into a string containing the source code to return
			resultString = result.toString();
		} catch(NullPointerException npe) {
			// If there is an error finding the file
			System.err.println("Error retrieving OpenCL source file: ");
			npe.printStackTrace();
		} catch(URISyntaxException urie) {
			// If there is an error converting the file name into a URI
			System.err.println("Error converting file name into URI: ");
			urie.printStackTrace();
		} catch(IOException ioe) {
			// If there is an IO error while reading the file
			System.err.println("Error reading OpenCL source file: ");
			ioe.printStackTrace();
		} finally {
			// Finally clean up any open resources
			try {
				br.close();
			} catch (IOException ex) {
				// If there is an error closing the file after we are done with it
				System.err.println("Error closing OpenCL source file");
				ex.printStackTrace();
			}
		}

		// Return the string read from the OpenCL kernel source code file
		return resultString;
	}

Now we can create our OpenCL program.

//The first argument is the context which we've already created. The second is the source as a String, replace this with whatever 
//method you are using. The last argument is the error IntBuffer, null for simplicity's sake...
CLProgram sumProgram = CL10.clCreateProgramWithSource(context, loadText("sumKernel.cls"), null);

//Hardware vendors like to create their own implementations of the OpenCL language, so programs need to be compiled at runtime. There is 
//a method for compiling the program which also store the program in a specific device.
//First argument, the program you created but haven't compiled. Second, the device the program will run on (make sure the device is
// one you have created a command queue for.) Third, a String of options. Just ignore this for now. 
int error = CL10.clBuildProgram(sumProgram, devices.get(0), "", null);

//Last, the error IntBuffer. However the clBuildProgram method also returns a value, which is an error code for any
//errors that may have occurred. You should check that the returned value equals CL10.CL_SUCCESS.
Util.checkCLError(error); //Handle any exceptions thrown.

We have a nice little program, but it is still useless without entry points, so we go on to creating a kernel for our program. Kernels in LWJGL are represented by an instance of CLKernel, and they are obtained like so:

//sumProgram is just the program we have created and built. The next String argument is the name of the Kernel in your source. If 
//the Kernel declaration was: _kernel void addTogetherAFewArraysOfFloats(), then we would put "addTogetherAFewArraysOfFloats" here.
//The last argument is again an error IntBuffer.
CLKernel sumKernel = CL10.clCreateKernel(sumProgram, "sum", null);

Now what we have a full OpenCL context with devices and a command queue, a simple program, and a kernel to launch that program with. All we have to do now is push the run button, sort of.


Setting Arguments For Kernels

Remember back to when we created CLMem objects to pass memory onto the program, well we just have to tell the kernel where that memory is, and there is a nice and easy way of doing this in LWJGL. In this example I will first create the pieces of memory necessary for our kernel (the two inputs, the result, and the size) then tell the kernel about them.

final int size = 100; //Can be whatever size you want.
IntBuffer errorBuff = BufferUtils.createIntBuffer(1); // Error buffer

// Create float array from 0 to size-1.
FloatBuffer aBuff = BufferUtils.createFloatBuffer(size);
float[] tempData = new float[size];
for(int i = 0; i < size; i++) {
    tempData[i] = i;
}
aBuff.put(tempData);
aBuff.rewind();
// Create float array from size-1 to 0. This means that the result should be size-1 for each element.
FloatBuffer bBuff = BufferUtils.createFloatBuffer(size);
for(int j = 0, i = size-1; j < size; j++, i--) {
    tempData[j] = i;
}
bBuff.put(tempData);
bBuff.rewind();

// Create OpenCL memory object containing the first buffer's list of numbers
CLMem aMemory = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuff, errorBuff);
Util.checkCLError(errorBuff.get(0));

// Create OpenCL memory object containing the second buffer's list of numbers
CLMem bMemory = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuff, errorBuff);
Util.checkCLError(errorBuff.get(0));

// Remember the length argument here is in bytes. 4 bytes per float.
CLMem resultMemory = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, size*4, errorBuff);
Util.checkCLError(errorBuff.get(0));

Now let's set the kernel's parameters by passing our newly created OpenCL memory objects to the kernel.

// Set the first kernel parameter, "_global const float* a", to point to the first buffer of numbers
sumKernel.setArg(0, aMemory);
// Set the second kernel parameter, "_global const float* b", to point to the second buffer of numbers
sumKernel.setArg(1, bMemory);
// Set the third kernel parameter, "_global const float* result", to point to the result memory buffer
sumKernel.setArg(2, resultMemory);
// Set the fourth kernel parameter, "int const size", to the number of units of work to do
// Since size is just a single constant variable, we will pass it straight into the kernel rather than through a CLMem object.
sumKernel.setArg(3, size);


Running Kernels

Since memory is defined within a context, rather than a device or program, multiple kernels can share memory and work on it individually. Just one of the joys of OpenCL. Now we want to run the kernel and create some work items. If you don't remember what a work item is, just jump to the start of this tutorial and read the note about work items. The first thing we must do is specify how many dimensions we want to use and how many work items we want to create. Optionally we can also set how we want our work groups organized, which can be little complex to begin with, so to make it simple we will only be working in 1 dimension. Now we setup the number of work units to execute and run our kernel.

final int dimensions = 1;
PointerBuffer globalWorkSize = BufferUtils.createPointerBuffer(dimensions); //In here we put the total number of work items we want in each dimension.
globalWorkSize.put(0, size); //Size is a variable we defined a while back showing how many elements are in our arrays.

//clEnqueueNDRangleKernel() creates and runs our kernel.
//The first argument, 'queue' is a CLCommandQueue which must be created for the same device we made our program for.
//Second, 'sumKernel' is the CLKernel that we want to run.
//Third, 'dimensions' is an integer specifying the number of dimensions of our work.
//Fourth, null value MUST ALWAYS BE NULL. It is there because the good people at Khronos are thinking ahead like expert programmers, but it has no use in OpenCL at the moment.
//Fifth, 'globalWorkSize' has already been explained.
//Sixth, the null localWorkSize argument is not necessary but I will explain it in a moment.
//The seventh and eighth null values are for queuing events and are not necessary.
CL10.clEnqueueNDRangeKernel(queue, sumKernel, dimensions, null, globalWorkSize, null, null, null);

//One last thing to do is wait for OpenCL to finish what it is doing, which we do by calling clFinish(). This method waits (pauses the current program)
//until every event in the queue is finished running before returning. A similar effect which is more efficient in some situations would be to use CLEvent
//objects (cl_event in original OpenCL). Look up OpenCL events if you want. 
CL10.clFinish(queue);

This created the correct number of work items in 1 dimension running the sum kernel, then wait for them to finish. If you remember work items are also organized into work groups. We set this up with localWorkSize. This is a PointerBuffer of length 'dimensions' that tells OpenCL how many work items are in each dimension of each work group. There are a few things we must remember. Each dimension of the localWorkSize must me a factor of the corresponding dimension of globalWorkSize. Imagine we wanted to create 125 work items in 3 dimensions. globalWorkSize could be 5, 5, 5. Since 5 * 5 * 5 = 125. However, 5's only factors are 5 and 1, so you would either have to have localWorkSize also being 5, 5, 5, which would mean having 1 work group of size 125. Or you could have localWorkSize 1, 1, 1 meaning you would have 125 work groups of size 1. There are other ways of doing it and you do not need to create the exact number of workItems. If you leave localWorkSize null as in the example, then OpenCL will work it out for you.

If all has gone well we have now run our kernel and the result should be in memory, but how to get at it.


Reading and Writing to Memory

This is another job for the command queue. Since memory is in a context and not a program, the command queue you use doesn't really matter provided it is in the same context, but why bother having two different queues, since we must still wait for the first to finish working before you can read the memory to avoid memory inconsistency errors.

FloatBuffer writeTo = BufferUtils.createFloatBuffer(size); //Remember size is the number of elements.
//The first argument, 'queue' is the CLCommandQueue to use. 'resultMemory' is the CLMem to read. (It must be CL_MEM_READ_ONLY or CL_MEM_READ_WRITE).
//The next value (CL_TRUE) tells OpenCL whether the command is blocking or not. A blocking command will return only after the operation is complete and you can use the result immediately.
//Non-blocking (CL_FALSE) will return immediately but you will not be able to use the result until the operation is complete which can be checked using events (which is what the last two arguments are for).
//You can also use use clFinish(queue) to wait for the command to finish.
//The next 0 value is the offset in bytes into the buffer that you want to read from. The 'writeTo' parameter is a FloatBuffer to write to (the buffer can be any type). 
//It must be the correct length since LWJGL uses the buffer's length to know how much data to read. Since this a blocking read, clFinish does not have to be called and the result can be used immediately. 
CL10.clEnqueueReadBuffer(queue, resultMemory, CL10.CL_TRUE, 0, writeTo, null, null);

And if you want to write into a CLMem object.

//This is exactly the same as reading. 'data' is is a FloatBuffer containing the data you want written. Again it can be any type of buffer but must be the correct length.
CL10.clEnqueueWriteBuffer(queue, memory, CL10.CL_TRUE, 0, data, null, null); 


Cleaning up After Ourselves

This should be done at the very end of the program, or after you have no further use for OpenCL. If you are never going to use aparticular program again you could just release it. The only CLObjects we have created that need destroying are: the kernel, the program, the memory, the command queue and the context. Also since we are in LWJGL we should also call the CL.destroy() command. The code to do this is:

CL10.clReleaseKernel(sumKernel);
CL10.clReleaseProgram(sumProgram);

CL10.clReleaseMemObject(aMemory);
CL10.clReleaseMemObject(bMemory);
CL10.clReleaseMemObject(resultMemory);

CL10.clReleaseCommandQueue(queue);
CL10.clReleaseContext(context);
CL.destroy();


The Full Code

This is the entire class file of the code I have written here.

import java.util.List;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URISyntaxException;
import java.nio.FloatBuffer;
import java.nio.IntBuffer;
import org.lwjgl.BufferUtils;
import org.lwjgl.LWJGLException;
import org.lwjgl.PointerBuffer;
import org.lwjgl.opencl.*;

public class MyClClass {
	// OpenCL variables
	public static CLContext context;
	public static CLPlatform platform;
	public static List<CLDevice> devices;
	public static CLCommandQueue queue;

	public static void doSumExample() throws LWJGLException {
		// Create our OpenCL context to run commands
		initializeCL();
		// Create an OpenCL 'program' from a source code file
		CLProgram sumProgram = CL10.clCreateProgramWithSource(context, loadText("sumKernel.cls"), null);
		// Build the OpenCL program, store it on the specified device
		int error = CL10.clBuildProgram(sumProgram, devices.get(0), "", null);
		// Check for any OpenCL errors
		Util.checkCLError(error);
		// Create a kernel instance of our OpenCl program
		CLKernel sumKernel = CL10.clCreateKernel(sumProgram, "sum", null);

		// Used to determine how many units of work to do
		final int size = 100;
		// Error buffer used to check for OpenCL error that occurred while a command was running
		IntBuffer errorBuff = BufferUtils.createIntBuffer(1);

		// Create our first array of numbers to add to a second array of numbers
		float[] tempData = new float[size];
		for(int i = 0; i < size; i++) {
			tempData[i] = i;
		}
		// Create a buffer containing our array of numbers, we can use the buffer to create an OpenCL memory object
		FloatBuffer aBuff = BufferUtils.createFloatBuffer(size);
		aBuff.put(tempData);
		aBuff.rewind();
		// Create an OpenCL memory object containing a copy of the data buffer
		CLMem aMemory = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuff, errorBuff);
		// Check if the error buffer now contains an error
		Util.checkCLError(errorBuff.get(0));

		// Create our second array of numbers
		for(int j = 0, i = size-1; j < size; j++, i--) {
			tempData[j] = i;
		}
		// Create a buffer containing our second array of numbers
		FloatBuffer bBuff = BufferUtils.createFloatBuffer(size);
		bBuff.put(tempData);
		bBuff.rewind();

		// Create an OpenCL memory object containing a copy of the data buffer
		CLMem bMemory = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuff, errorBuff);
		// Check if the error buffer now contains an error
		Util.checkCLError(errorBuff.get(0));

		// Create an empty OpenCL buffer to store the result of adding the numbers together
		CLMem resultMemory = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, size*4, errorBuff);
		// Check for any error creating the memory buffer
		Util.checkCLError(errorBuff.get(0));

		// Set the kernel parameters
		sumKernel.setArg(0, aMemory);
		sumKernel.setArg(1, bMemory);
		sumKernel.setArg(2, resultMemory);
		sumKernel.setArg(3, size);

		// Create a buffer of pointers defining the multi-dimensional size of the number of work units to execute
		final int dimensions = 1; 
		PointerBuffer globalWorkSize = BufferUtils.createPointerBuffer(dimensions);
		globalWorkSize.put(0, size);
		// Run the specified number of work units using our OpenCL program kernel
		CL10.clEnqueueNDRangeKernel(queue, sumKernel, dimensions, null, globalWorkSize, null, null, null);
		CL10.clFinish(queue);

		//This reads the result memory buffer
		FloatBuffer resultBuff = BufferUtils.createFloatBuffer(size);
		// We read the buffer in blocking mode so that when the method returns we know that the result buffer is full
		CL10.clEnqueueReadBuffer(queue, resultMemory, CL10.CL_TRUE, 0, resultBuff, null, null);
		// Print the values in the result buffer
		for(int i = 0; i < resultBuff.capacity(); i++) {
			System.out.println("result at " + i + " = " + resultBuff.get(i));
		}
		// This should print out 100 lines of result floats, each being 99.

		// Destroy our kernel and program
		CL10.clReleaseKernel(sumKernel);
		CL10.clReleaseProgram(sumProgram);
		// Destroy our memory objects
		CL10.clReleaseMemObject(aMemory);
		CL10.clReleaseMemObject(bMemory);
		CL10.clReleaseMemObject(resultMemory);
		// Destroy the OpenCL context
		destroyCL();
	}


	public static void initializeCL() throws LWJGLException { 
		IntBuffer errorBuf = BufferUtils.createIntBuffer(1);
		// Create OpenCL
		CL.create();
		// Get the first available platform
		platform = CLPlatform.getPlatforms().get(0); 
		// Run our program on the GPU
		devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
		// Create an OpenCL context, this is where we could create an OpenCL-OpenGL compatible context
		context = CLContext.create(platform, devices, errorBuf);
		// Create a command queue
		queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, errorBuf);
		// Check for any errors
		Util.checkCLError(errorBuf.get(0)); 
	}


	public static void destroyCL() {
		// Finish destroying anything we created
		CL10.clReleaseCommandQueue(queue);
		CL10.clReleaseContext(context);
		// And release OpenCL, after this method call we cannot use OpenCL unless we re-initialize it
		CL.destroy();
	}


	public static String loadText(String name) {
		if(!name.endsWith(".cls")) {
			name += ".cls";
		}
		BufferedReader br = null;
		String resultString = null;
		try {
			// Get the file containing the OpenCL kernel source code
			File clSourceFile = new File(MyClClass.class.getClassLoader().getResource(name).toURI());
			// Create a buffered file reader for the source file
			br = new BufferedReader(new FileReader(clSourceFile));
			// Read the file's source code line by line and store it in a string builder
			String line = null;
			StringBuilder result = new StringBuilder();
			while((line = br.readLine()) != null) {
				result.append(line);
				result.append("\n");
			}
			// Convert the string builder into a string containing the source code to return
			resultString = result.toString();
		} catch(NullPointerException npe) {
			// If there is an error finding the file
			System.err.println("Error retrieving OpenCL source file: ");
			npe.printStackTrace();
		} catch(URISyntaxException urie) {
			// If there is an error converting the file name into a URI
			System.err.println("Error converting file name into URI: ");
			urie.printStackTrace();
		} catch(IOException ioe) {
			// If there is an IO error while reading the file
			System.err.println("Error reading OpenCL source file: ");
			ioe.printStackTrace();
		} finally {
			// Finally clean up any open resources
			try {
				br.close();
			} catch (IOException ex) {
				// If there is an error closing the file after we are done reading from it
				System.err.println("Error closing OpenCL source file");
				ex.printStackTrace();
			}
		}

		// Return the string read from the OpenCL kernel source code file
		return resultString;
	}
}

The OpenCL kernel source file ("sumKernel.cls") should look like this if you save it in a text file.

kernel void sum(global const float* a, global const float* b, global float* result, int const size) {
   const int itemId = get_global_id(0); 
      if(itemId < size) {
      result[itemId] = a[itemId] + b[itemId];
   }
}

Now you should be able to write your own OpenCL kernels which do something useful, create a cl context to run them in, pass them the correct arguments, set them and read or write any extra data you need. Now go and do thou likewise.