Writeup

Summary

We created a high-level Lua/Terra API which uses CUDA to execute Terra functions on the GPU. Our API can translate arbitrary Terra code to CUDA code and handles all device memory management and similar issues behind the scenes.

Background

Writing GPU code is time-consuming. CUDA, the best available language for GPU programming, requires a lot of work from the programmer in order to do even simple tasks. The programmer must remember to keep host memory and device memory in sync. He must calculate the number of threads and warps he needs (and the term "warp" is quite confusing if you haven't heard it before). He must calculate for himself the appropriate index at each instance of his kernel. And most importantly, he can only write GPU code in CUDA!

Some APIs like PyCUDA and JCuda have attempted to tackle this problem. Unfortunately, they only address the last point raised above—one can use CUDA functions in the course of writing a Python program, but it is neither easy nor intuitive. In PyCUDA, for example, one must actually embed CUDA code as a string inside of the Python program. Terracuda's goal is to abstract away all these issues and let the programmer write GPU code at his convenience.

Approach

Lua is a fast, lightweight, and embeddable scripting language found in places like Wikipedia, World of Warcraft, Photoshop Lightroom, and more. Lua's simple syntax and dynamic typing also make it an ideal language for novice programmers. Terra is a language embedded in Lua which gives easier access to low-level functions (e.g. malloc).

Terra has a few key features we used: it JIT-compiles to LLVM, it interoperates with Lua, and its types are treated as values in Lua. This combined with the recent release of NVIDIA's NVPTX LLVM backend means that we can easily translate between Terra code and CUDA code. Its interoperation means that any Lua value can be used in Terra code, and its type system means that Lua can generate dynamic types on the fly.

Before we delve into the specifics of the API, let's look at a sample implementation of the SAXPY routine in regular Lua and using Terracuda.

Lua
local P, N = {X = {}, Y = {}, a = 2.0}, 1000
-- populate X and Y

for i = 1, N do
   P.Y[i] = P.a * P.X[i] + P.Y[i]
end
Terracuda
local P, N = {X = {}, Y = {}, a = 2.0}, 1000
-- populate X and Y

local ptype = cudalib.make_struct_type(P)
local saxpy = cudalib.make_kernel(
terra(p : &ptype, i : int)
   p.Y[i] = p.a * p.X[i] + p.Y[i]
end)

saxpy(P, N)

Both pieces of code above will produce the same output (i.e. P.C will contain the correct values). You may notice, however, that device memory, thread indexing, kernel calls, etc. are all conspicuously absent from the Terracuda code. This is all handled by the API. Here's how it works: consider that you have some table P and a Terra function F that operates on P which you want to convert to GPU code. The process is as follows:

  1. Run make_struct_type(P). Because Terra types are Lua values, Lua can loop through P and generate a new struct type that corresponds exactly with the given input. So in the above SAXPY example, the generated type is struct P { X: &int, Y: &int, a: double }.
  2. Run make_kernel(F). This pre-allocates device memory for every entry inside the struct if possible (the exception being if we have variable length arrays). Then, it JIT-compiles F to CUDA code for later use. Lastly, make_kernel returns a closure over these values which will copy the input into the allocated device memory, run the kernel on the device parameters, and copy the modified memory back to the host.
  3. Run the result of make_kernel with your input parameters as well as the number of threads N to create.

This approach gives great freedom to Terracuda users, as essentially any CUDA program can be translated to Terracuda without much trouble. Note that Terracuda is not purely data-parallel—you can run your kernel over any number of threads with any input! Initially, we wrote the API just in terms of data-parallel primitives like map and reduce, but over time as we learned more about Terra we realized that it empowers us to write more flexible code, and consequently we added more features to the API like struct type creation. For example, matrix multiply looks like this:

local P = {A = {}, B = {}, C = {}, N = 1000}
-- populate A and B

local stype = cuda.make_struct_type(P)
local kernel = cuda.make_kernel(
terra(p : &stype, idx : int)
   var i, j, sum = idx / p.N, idx % p.N, 0
   for k = 0, p.N do
      sum = sum + p.A[i * p.N + k] * p.B[k * p.N + j]
   end

   p.C[idx] = sum
end)

kernel(P, P.N * P.N)

For comparison, the map map primitive looked similar but granted less flexibility. It looked like this:

local A = {}
-- populate A with integers

local kernel = cuda.map(terra(x : int) : int
   return x + 1
end)

kernel(A)

That's the basics of the API. You can find the source for the API here: https://github.com/willcrichton/terracuda

Results

The API itself works great, so we concerned ourselves mostly with performance and concision when considering results. We wrote a myriad of applications in C, CUDA, Terra, and Terracuda ranging from SAXPY to a circle renderer we wrote for class. For each application, we sampled execution time and also recorded the size of the codebase required for the CUDA implementation versus the Terracuda one. All code was tested on a Mid-2012 Macbook Pro with an NVIDIA GTX 480 GPU. Each program used the input size listed. We found the following execution times:

So far, Terracuda (despite being JIT-compiled in a dynamically typed language) is mostly competitive with C and CUDA. In some cases, like with matrix multiplication, it actually performs better than hand-written CUDA! It doesn't work as well, however, on simple programs like SAXPY as the overhead of compiling/spinning up a kernel is greater than any performance gains. It's worth noting that Terracuda spends, on average, about 3x as much time booting up its kernel than CUDA, primarily because CUDA programs are precompiled and Terracuda is JIT-compiled, so our API does bring greater overhead (noticeable in the poor results for SAXPY).

Experience leads us to believe that the remaining difference between Terracuda and CUDA could be eliminated by implementing support for constant and shared memory annotations. The circle renderer, for example, has a loop over a set of circles that updates an array data for every circle. By switching the updates to data with a temporary array not in global memory which gets copied at the end, we found a 45% decrease in execution time in the Terracuda implementation (from about 2.3s to 1.5s). However, arrays like position and radius are still stored in global memory (as opposed to constant memory as is used in the CUDA version), so CUDA still beats out Terracuda on the circle renderer.

Terracuda also performs better than the CUDA SHA1 hash generator we wrote, and while promising, doesn't necessarily mean we expect Terracuda to be faster than CUDA. Due to time constraints, the CUDA SHA1 code we wrote could be made more efficient by perhaps utilizing shared or constant memory on the GPU.

The only exception to this was the hash generator, in which both the CUDA and Terracuda versions were of similar length. We attribute this to the fact that C has much simpler ways of dealing with individual bits and bytes (Lua 5.2 does not support bitwise operations), and that when implementing the SHA1 hashing algorithm, the majority of the work is in the algorithm itself, not in the C/CUDA/Terra setup. So even though it was definitely simpler to use Terracuda in this case, the algorithm itself requires a minimum number of steps, which translate directly into lines of code.

Code base comparison also yielded favorable results. Terracuda was consistently less than 1/2 the size of its CUDA counterpart, which emphasizes the concise nature of both Lua and the API we've created. Granted, lines of code is not a perfect benchmark for good/bad code (since it also includes comments/includes/etc.), but it gives us a good impression of relative code sizes. See the results:

References

DeVito, Zachary et al. Terra: A Multi-Stage Language for High-Performance Computing. 2013.
Ierusalimschy, Roberto. Programming in Lua: Third Edition. Lua.org, Jan 2013.

Work done

Equal work was performed by both project members.

Checkpoint

The project is proceeding mostly according to plan. Here's the highlights:

Proposal

Summary

We will create a CUDA API for Lua aimed at programmers unfamiliar with GPU-level parallelism.

Background

Lua is a fast, lightweight, and embeddable scripting language found in places like Wikipedia, World of Warcraft, Photoshop Lightroom, and more. Lua's simple syntax and dynamic typing also make it an ideal language for novice programmers. Traditionally, languages like Lua find themselves abstracted miles above low-level parallel frameworks like CUDA, and consequently GPU parallelism was limited to programmers using a systems language like C++. Frameworks like Terra, however, work to close that gap, making low-level programming accessible in a high-level interface. However, these interfaces still require a number of calls to C libraries and intimate knowledge of the CUDA library. For example, the following code runs a simple CUDA kernel in Terra:

terra foo(result : &float)
    var t = tid()
    result[t] = t
end

local R = terralib.cudacompile({ bar = foo })

terra run_cuda_code(N : int)
    var data : &float
    C.cudaMalloc([&&opaque](&data),sizeof(float)*N)
    var launch = terralib.CUDAParams { 1,1,1, N,1,1, 0, nil }
    R.bar(&launch,data)
    var results : &float = [&float](C.malloc(sizeof(float)*N))
    C.cudaMemcpy(results,data,sizeof(float)*N,2)
    return results;
end

results = run_cuda_code(16)

Other high-level CUDA bindings like PyCUDA and JCuda suffer the same problem.

The Challenge

The problem is challenging foremost on the level of architecture. Designing an API is never easy, and attempting to expose GPU-level parallelism to a language as high-level as Lua requires a great deal of care to be usable while still being useful. Creating such an API requires significant knowledge of the abstraction layers between Lua, C, and CUDA as well as knowledge of the typical use cases for high-level parallelism.

My partner and I know neither Terra nor LLVM (which Terra compiles to), so creating these high-level bindings requires a great deal of initial investment. The existing interface between Terra and CUDA is sketchy at best, so we will need to implement significant new functionality into Terra in order for the Circle Renderer to function properly.

Resources

For machines, we'll just be using any computers equipped with NVIDIA GPUs (i.e. Will's laptop and the Gates 5k machines). No other special hardware/software will be needed. We'll be building upon the Terra language and also using LuaGL for some of the demos.

Goals

The project has three main areas: writing the API, creating programs using the API, and benchmarking the code against other languages/compilers.

We plan to achieve:

We hope to achieve:

Platform

CUDA makes sense as we've already learned it in class, and Lua makes sense as Terra already laid the foundation for abstracting systems-level code.

Schedule

Original schedule

New schedule