Particle System Improvements

Here’s a little something that I’ve been doing on the side.

The fire effect has been created with a new Particle System node that includes tons of customization controls. For example, you can define shapes for the particle emitter from several mathematical shapes like cylinder (the one in the video), cones, spheres, etc.

In addition, some particle properties like size and color can be interpolated using starting and ending values. The interpolation method is fixed for the moment, but it will be customizable in the near future to use different integration curves.

At the moment, the particle system is implemented entirely in CPU, meaning no point sprites or any other optimization techniques are used, which lead to performance issues as particle count gets bigger (the effect in the video has a particle count of 50).

The particle system in the video above is configured on a Lua script as follow

{
   type = 'crimild::ParticleSystem',
   maxParticles = 50,
   particleLifetime = 2.0,
   particleSpeed = 0.75,
   particleStartSize = 0.5,
   particleEndSize = 0.15,
   particleStartColor = { 1.0, 0.9, 0.1, 1.0 },
   particleEndColor = { 1.0, 0.0, 0.0, 0.5 },
   emitter = {
      type = 'cylinder',
      height = 0.1,
      radius = 0.2,
   },
   precomputeParticles = true,
   useWorldSpace = true,
   texture = 'assets/textures/fire.tga',
 },

Even when the new system is quite easy to configure, it still requires a lot of try and error to get something nice on screen. Maybe it’s time to start thinking about a scene editor…

 

Praise the Metal – Epilogue

Oh, look, another post about Metal.

Don’t worry, I promise this will be the last one, at least for a while.

In this final post in the Metal series, I’ll make a summary of what we’ve achieved so far, the good and bad decisions, and some forecast for future improvements.

Let’s finish this…

What went right?

Working on an actual project instead of a demo. The first Metal demo was lacking a lot of features and was more of a test drive that helped me understand how the API worked. On the other hand, Le Voyage was already published by the time I start porting it to Metal, so whatever I ended up implementing had to support a minimal set of functionalities if I wanted to be part of the game. In addition, it helped me to focus on specific features rather than go with a more generic implementation. For example, there was no need for skeletal animation, so I skipped that feature completely.

Metal benefits showed up from the very beginning. Metal started to show its strength as soon as I completed the most basic implementation. The frame rate was up to at least 50% and the game felt much more fluid, even with all effects turned on and running at Retina resolutions. And this was before any optimization were even considered.

Crimild’s design was up to the task and did not require any major modifications. Honestly, this was a nice surprise. When I started working with Metal, one of my biggest concerns had to do with what changes were required in Crimild (specially at core-level libraries) in order to support the new paradigm. This, in turn, might have side effects on other implementations, which are [supposed to be] much more stable. But it turned out that it was possible to encapsulate all Metal hacks code within the Renderer itself. No major design changes were required and I was really happy about that.

Switching between OpenGL and Metal in iOS is done with a single flag. This was critical in order to understand how well the new renderer performed. The CrimildViewController class has a single flag that defines whether or not to use Metal. Most importantly, it will fallback to OpenGL automatically for those platforms that don’t support Metal.

Metal shaders are a win. I love the way Metal works with shaders and since they are precompiled along with the rest of the application, there’s no need for extra error handling at run-time. And the errors are really easy to understand.

GLSL translated fairly straightforward to Metal’s shading language. The language itself is a bit different, since MLSL is based on C++11 instead of ANSI C, yet the logic behind the shaders is pretty much the same. And the MLSL Standard Library provides much of the tools that are available in GLSL too. I’m still not sure how to achieve a level of abstraction for writing shaders regardless of the platform, thought (as I’m doing with GLSL).

Image effects are easier to work with on Metal. Handling FBOs turned out to be much easier than in OpenGL. The API is a lot less verbose and error checking is much more straightforward. This lead to a more simpler way to achieve post-processing effect. And, as I mentioned in a previous post, I still need to use one of Metal features to avoid using a vertex or fragment shader function if not needed with should make things even easier.

The tools. Both Xcode and Instruments were upgraded with a set of tools that made my life a lot easier when profiling and optimizing Le Voyage. The Frame Navigator and Shader Profiler were particularly useful for finding those bits of code that were causing the most performance problems.

It took me more time to write these posts than to actually implement the whole thing. I’m not a fast writer, that’s true. Still, implementing the entire rendering solution based on Metal was a lot easier than what I originally estimated. The API is really well documented and the example code provided by Apple is clear and easy to follow.

What went wrong?

Metal only work on selected platforms. This is my main concern for Metal, since it only works on selected Apple devices. For what I’ve seen so far, working with Metal is pretty similar as working with Vulkan, but we still need to maintain both APIs. And OpenGL too, of course

The Metal API require us to do more things than in OpenGL. While it’s true that some functionalities were easier to implement in Metal, most of the time the API was more verbose than it’s counterpart in OpenGL. But this was expected, due to Metal is supposed to be a more low level API. On the plus side, the mechanisms that required more code usually where invoked only once at the beginning or at very low rates.

FBOs are required for all render passes. The Metal Renderer assumes at least a forward-like render pass, with support for offscreen frame buffers by default. This leaves out Crimild’s basic render pass that is used for rendering scenes that don’t require too much graphic complexity.

Translation between Metal and Crimild object was not always straightforward. I said before that there was no need for major design changes in Crimild. That was one of the design goals from the beginning. But in order to keep this statement true, I ended up doing some weird conversions and hacks, like assigning render pipelines to shader programs instead of materials.

Uniform buffers could have been handled in better way. I said it before and I’m saying it again: I really don’t like the way I ended up implementing uniform buffers in Crimild. Probably, it was due to the lack of experience on my side but this is definitely a point of improvement for further versions.

No C API for Metal. The Metal API is completely written in Objective-C, which lead to some weird code and argument passing in the final implementation. It never represented a problem since I’ve worked with mixed code before, but the code looks funny. The object-oriented API design is a plus, thought.

What’s next?

I assume there’s going to be several improvements for Crimild’s Metal support in the near future as I start using it for other projects. Again, I don’t like the way uniforms are handled, so that’s going to be my first focal point for sure.

In addition, I’m planning on supporting deferred rendering in Metal and improve the lighting model to support PBR at some point, too.

Finally, at WWDC 2016 last week Apple announced  some new features for Metal that I would love to include in future versions. For example, Metal now supports dynamic polygon tessellation. In concept, it seems similar to OpenGL’s geometry shaders and I know for a fact that in order to support such feature I need to change the shader design in Crimild. But that’s on my wish list too.

Final words

Well, that’s it. I’m done.

aunlocked

I’ve never wrote this much about anything before, so this was a clear achievement for me, beyond Metal itself. I can’t promise that I will keep this pace for future Crimild features, thought. So don’t get too excited.

As usual, feel free to look at the code in Github and provide comments. Crimild’s Metal support is still experimental (look for the devel branch), but it will be included in the next released anyways. And if you wanted to see it in action, go and download Le Voyage in the App Store if you haven’t played it already.

All in all, it has been a great journey. I was hyped by Metal from the beginning but I wasn’t expecting such amazing results considering the small amount of time that I invested.

Now that I know the power of Metal, I can’t wait to get my hands on Vulkan…

 

Praise the Metal – Part 7: Profiling and Optimizing our App

As we approach the end of our journey, one question pops up in our head: was it worthy? Why did we face all the challenges of this new paradigm introduced by Metal? What exactly did we gain from all this? What is life?

Fortunately for us, Xcode comes with several tools that will let us answer these questions (maybe not the last one). So, without further ado, let’s take a look at them…

42?

The following image shows a capture of a single frame in Le Voyage:

Screen Shot 2016-06-16 at 10.52.50 AM.png

On the left, we have the Frame Navigator displaying all states and draw calls present on the frame, grouped by type (i.e. render encoder, command buffers, data buffers, etc). There’s an option to switch between viewing the frame by draw calls or by performance. I find the latter one much more useful when looking for optimization points, due to the timings displayed for each operation:

Screen Shot 2016-06-16 at 11.16.33 AM.png

In the center of the screen we have the Attachment Viewer. Since Crimild does not support multiple attachments for Metal at the moment, only the first color attachment is shown above.

It’s important to note that, as we move in the frame navigator, the attachment color can display partial results. For example, it’s possible to display the scene with and without the post-processing filter by selecting the corresponding render command encoder on the left.

Screen Shot 2016-06-16 at 11.00.18 AM.png

The Resource Inspector will show all resources currently in use for the given frame

Screen Shot 2016-06-16 at 11.06.19 AM.png

 

Notice both source and accumulation framebuffer objects in the right hand panel, as well as textures and data buffers. For textures, we can see not only the source image, but also mipmap levels, cube maps, etc. Regarding buffers, we can check their contents as well.

Finally, the State Inspector is shown at the bottom of the screen, allowing us to inspect properties for all available Metal objects.

Moving on, there’s the GPU report providing measurements in frames per second and timings for both CPU and GPU processes. The most expensive render operations are shown as well, indicating which shaders and draw calls we should focus when optimizing.

Screen Shot 2016-06-16 at 11.12.18 AM.png

But maybe the most interesting tool of all is the Shader Profiler and editor.

Screen Shot 2016-06-16 at 11.19.24 AM.png

Not only it’s possible to edit and recompile shaders on the fly, but have you notice the grey bars with the numbers in the image above? Those values indicate which operations are the most expensive (relative to that shader) and the ones that should require our attention. Yes, the profiler will show which lines are the slowest ones!!

Also, notice the warning marks? Good guy Xcode will tell us when we’re doing things wrong with clear messages:

Screen Shot 2016-06-16 at 1.02.32 PM.png

Did I mention that all this works on iOS? Amazing!

What about Instruments?

But wait, there’s more. All the tools in Xcode are incredible useful, yet it’s the Metal System Trace in Instruments the one that really shines, allowing us to profile both the CPU and the GPU down to the microsecond level.

Screen Shot 2016-06-16 at 11.28.03 AM.png

The image above shows an in-depth look at our application’s graphic workload over time across all layers of the graphics stack. Starting with the CPU at the top, the new trace tool will let us inspect shader compilations, driver activity, each of the individual command stages and, finally, the drawables themselves.

It’s worth mentioning that this new tracer works a bit different than the one for OpenGL, as we won’t get the profiler analysis in real-time. Once we start the tracing tool, it will start recording the app indefinitely until stopped and only then we will be able to see the results. This is called Windowed Mode by Instruments.

In the timeline, colors are used to identify each frame so we can easily track their start and end times and how long it took until they were displayed. Probably the one that will require most of our attention is the white color, since that basically means wasted time. I’ll explain this later in this post.

The Details Panel at the bottom of the screen is also very useful. For example, the image below show timings for each of the encoders in a very clear way

Screen Shot 2016-06-16 at 11.46.37 AM.png

Things to look for

The tools are great, but what exactly do they show us? When looking at the timelines and traces, we should keep an eye for the following:

  • CPU and GPU parallelism, indicated by how sparse the operations are from one another. Basically, try to minimize the white spaces in timelines. A white space may indicate that either the CPU is waiting for the GPU or viceversa. This was the very first problem I try to solve for Le Voyage.
  • Pattern breakers. Each frame should look pretty much the same as the previous one. Therefore, any timing spike or new operation should be analyzed and refactored if needed.
  • Surfaces should not be displayed for more than one vsync operation. If so, it’s indicating that a frame is taking more time to process than what we’re expecting which could end up hurting our targeted FPS. For example, if a surface is displayed in between two vsync calls, we’re running at 30fps instead of 60fps.
  • Avoid shader compilation at run-time. Shaders should be pre-compiled if posible and almost no activity should be visible in the Shader Compilation stack. In Le Voyage, all shaders are pre-compiled.
  • Aim to profile early and often.

This list is by no means complete, but it’s enough to avoid the most common performance problems with a Metal application.

Best Practices and Optimizations

OK, there are a lot of best practices to follow. For the sake of brevity, I’m going to focus only on those that made the biggest impact while optimizing Le Voyage.

Expensive object creation upfront

Remember this graphic from the first post?

opengl vs metal

Well, we need to follow it by hearth. We should create the most expensive objects (buffers, pipelines, shader libraries, etc) as early in our application as possible and reuse them as much as we can. Avoid creating these objects in between frames, since they are the source of most performance bottlenecks.

Dynamic Shared Resources

Of course, there will be objects that simply cannot be create upfront. Think about uniform buffers, shader constants and dynamic textures, just to name a few. They may depend on which geometries are on the screen, which in turn are created dynamically too.

In these cases, the best approach is to use a pool of resources (i.e. buffers of a given size) and reuse them whenever required. The number of preallocated resources could vary depending on the requirements of our app, but can be easily adjusted on the fly. Keep in mind that you may need some sort of synchronization mechanism (as in semaphores) in order to ensure that this approach works on parallel systems.

Now, go back to the very first image in this post. Notice all those warning marks? Well, that’s a good indication that we’re creating too many objects during a frame and probably most of them can be replaced by object pools (spoiler alert: they can and they will).

CPU-GPU Parallelism

As I mentioned before, switching to Metal provided a great performance boost in Le Voyage from the very beginning. Even so, after executing the first trace I noticed that there was pretty much no parallelism at all between the CPU and the GPU, meaning they were waiting on each other most of the time. Look at this timeline:

Screen Shot 2016-06-16 at 12.38.19 PM.png

As we can see, the CPU works on a frame and then waits for that frame to be displayed in order to continue on the next one. Look at all that white space. This is clearly inefficient.

It turned out that there was a very simple optimization to be done here. The image above shows the app working with only one command buffer active at any given time and therefore there was no way to achieve parallelism. All it took to improve this was to change the number of in-flight buffers from 1 to 3 and that lead to a much better result:

Screen Shot 2016-06-16 at 12.47.15 PM.png

Now, as one frame is being displayed, the CPU can start processing the next ones almost immediately, ensuring parallelism in our render loop.

Acquire drawables at the latest opportunity

So far we’ve been talking about doing most things upfront. Well, not everything should be done in this way.

As it was defined before, a drawable is the visible output for our app (usually, that would be the screen). A Metal layer has a limited number of drawable objects for us to use, which are returned at display intervals. That means that we need to wait for a drawable to be ready in order to start drawing into it.

Remember this line from a previous post?

_drawable = [getLayer() nextDrawable];

That’s a blocking operation. If a drawable is not ready, the app will wait for one. At 60fps that could be as long as 16ms. Not good.

In practice there’s no need to wait for drawables. After all, we first need to render the scene to an offscreen buffer in order to apply the post-processing effect. Only then we actually need a drawable to render the resulting frame on the screen. So, the Metal Renderer will process the frame in the offscreen FBOs first, and it’s only going to request a drawable when everything is ready to be displayed. This strategy hides long latencies if no drawables are available.

Multi-threaded render passes

This is something that I have in my wish list. Although Metal allow us to dispatch commands buffers on multiple threads, Crimild still implements render passes using a single threaded approach, a fact that comes from years of working with OpenGL.

The idea is to move to a fully parallel render pass at some point in the not so distant future (maybe when I start working with Vulkan), which will bring even more benefits for Metal. But for the moment, we’re stuck with a single-threaded approach. Sorry.

Closing Comments

Phew! This was a long post with too much information in it. When I started this series, the topic of profiling and optimizations was the one that I was most excited to write about. It truly shows the power of Metal and Xcode when working with graphical applications. Too bad the OpenGL tools are not a the same level.

Don’t miss the final post sometime next week, when I’m going to do a proper post-mortem for the whole adventure as well as give my thoughts about some future upgrades in Crimild. See you soon.

To be continue…