Stochastic Iterative Tile-Based Compute Ray Tracing (SITBC-RT)

(or, how I manage to make ray tracing a bit faster without having one of those fancy new GPUs…)

Crimild’s exiting ray tracing solution (yes, that’s a thing) is still very basic, but it is capable of rendering complex scenes supporting several primitive types (i.e. spheres, cubes, planes). The problem is that it’s too slow, mainly because it’s completely software based, and runs on the CPU. True, it does make use of multithreading to reduce render times, but it’s still too slow to produce nice images in terms of minutes instead of hours.

Therefore, I wanted to take advantage of my current GPU(*) in order to gain some performance and decrease rendering times. Since I don’t own a graphics card with ray tracing support, the only option I had left was to use GPU Compute instead.

(*) My Late 2018 Macbook Pro has a Radeon Pro Vega 20 dedicated GPU, with only 4GB of VRAM. 

Requirements

What I needed was a solution that provides at least the following features:

  • Physically correct rendering: Of course, the whole idea is to render photorealistic images.
  • Short render times: Ray tracing is slow by definition, so the faster an image is rendered, the better.
  • Iterative: Usually, I don’t need the final image to be completely rendered in order to understand if I like it. As a matter of fact, only a few ray bounces are required to know that. So, the final image should be composed over time. The more time passes, the better the image quality gets, of course.
  • Interactive: This is probably the most important requirement. I need the app to be interactive while the image is being rendered, not only so I can use menus/buttons, but also to allow me to reposition the camera or change the lightning conditions and then check the results of those decisions as fast as possible.
  • Scalable: I want a solution that works on any relatively modern hardware. Like I said, my setup is limited.

The current solution (the CPU-based one) is already physically correct, iterative and quite scalable too, but it falls too short on the other requirements. And it is nowhere close to being interactive.

Initial attempts

My first attempts were very straightforward.

Using shaders, I started by casting one ray for each pixel in the image, making it bounce for a while and then accumulating the final color. It was, after all, the very same process I also do in the CPU-based solution, but now triggering thousands of rays in parallel thanks to “The Power of the GPU”.

Render times did drop considerable with this approach, generating images in a matter of seconds instead of minutes. And it seemed like the correct solution. But it had one major flow: it was not scalable and, most importantly, it wasn’t even stable.

For simple scenes, everything was ok. Yet, the more complex the scene became, the more computations the GPU has to do per pixel and per frame. That can cause the GPU to be stalled completely, eventually crashing my computer (yes, Macs can crash).

So, a different approach was needed.

Think different

I took a few step back, then.

I knew that the most important goal was not to produce image faster (keep reading), but to keep the simulation interactive, and stable too. That meant running it at more than 30 fps (or more, if possible), leaving the time budget for a single frame to be 33ms (or less).

In short, I needed to constraint the amount of work we do in each frame, even if that means the final image might take a bit longer to render. It will still be faster than the CPU-based approach, of course.

After some thought, I came up with a new approach: for each frame, only a single ray per thread is computed. Let me repeat that: compute one ray per frame, not one pixel. Obviously, producing the final color for a single pixel will require multiple frames but that’s fine.

How does it work?

  1. Given a pixel, we cast a ray from camera to that pixel.
  2. We compute intersections for the current ray
  3. If the ray hits an object, the material color is accumulated and a new ray is spawn based on reflection/refraction calculations. The new ray is saved in a bag of rays for a future iteration.
  4. The next frame, we get one ray from the bag and compute the intersections for it, spawning another ray if needed.
  5. If there are no more rays in the bag for that pixel, then we can accumulate the final color and update the image.
  6. And then we start over for that pixel.

And that’s it.

Notice that I mentioned that we grab any ray from the bag. I don’t care about the order in which they were created. Eventually, we’ll process all the rays so there’s really no need to get them in order. Even if more rays are generated, the final image will converge.

Results

This is indeed a very scalable solution, since we can choose how many threads to execute in the GPU each frame: 1, 30, 500, 1000 or even one per pixel in the image if the scene is not too complex or we have a powerful setup.

Of course, the less threads we execute, the more time the image will take to complete. But it still takes less than a minute to provide a “good enough” result in my current computer.

Next steps

There’s always more work to do, of course.

Always.

Beyond of adding new features to the ray tracer (like new shapes or materials), there is still room for optimization. For example, I’m not using any acceleration structure (i.e. bounding volume hierarchies) like the CPU-based approach does. Once I do that, performance will be even better.

And there is the problem of the noise in the final image, but I don’t really have a good solution for that yet.

See you next time!

A More Correct Raytracing Implementation

Happy 2021!!

I decided to start this new year continuing experimenting with compute shaders, specially raytracing. I actually managed to fix the issues I was facing in my previous posts and added some new material properties, like reflection and refraction.

My implementation uses an iterative approach to sampling, computing only one sample per frame and accumulating the result in the resulting image. If the camera moves, the image is reset (set to black, basically) and the process starts again.

Here’s a video of the image sampling process in action:

You might have noticed that the glass sphere in the center is black whenever the camera moves. That is because, if the camera moves, number of bounces for each ray is limited to one in order to provide a smoother experience when repositioning the view. Once the camera is no longer moving, bounces are set to ten or more and the glass sphere is computed correctly, showing proper reflections and refractions.

Here’s another example:

In the example above, you also see the depth of field effect in action, which is another property for the camera that can be tuned in real-time:

In this case, the image is reset whenever the camera’s focus changes.

I’m really happy with this little experiment. I don’t think that it’s good enough for production yet, since it’s still too slow for any interactive project. But it’s definitely something that I’m going to keep improving whenever I have the chance.

Victory!

Throughout this weird year I managed to accomplished a lot of different milestones when refactoring the rendering system in Crimild. Yet, the year was coming to an end and there was one feature in particular that was still missing: compute operations.

Then, this happened:

That, my friends, is the very first image created by using a compute pass in Crimild. The image is then used as a texture that is presented to the screen. Both compute and rendering passes are managed by the frame graph and executed every frame in real-time.

At the time of this writing I haven't implemented true synchronization between the graphics and compute queues, meaning that the compute shader might still be writing the image by the time it is read by the rendering engine, which produce some visual artifacts every once in a while. 

Of course, I had to push forward.

A few hours passed and the next compute shader that I made was used to implement a very basic path tracer completely in the GPU:

It’s not a true real-time ray tracing solution (since I don’t have a GPU with proper RTX support), but sampling is done incrementally, allowing me to reposition the camera in real-time:

I’m still amazed about how easy it was to port my software-based path tracer to the GPU.

So much power…

So much potential…

I wanted more…

I needed more…

I became greedy.

I flew too close to the Sun.

And I got burnt.

Then I learned a valuable lesson. It turns out that if I screwed up the shader code in some specific way (which I’m still trying to understand), weird things happens. Like my compute crashing… bad (as in having to turn it off and on again bad).

Next steps

I’m planning on (finally) merging the Vulkan branch at this point, since all major features are done. Sure, there are things that still need to be fixed and cleaned up, yet they don’t really depend on Vulkan itself, like behaviors, animations and sound, which is broken (again).

Plus, I really want to release Crimild v5.0 in the next decade.

See you next year!