Praise the Metal – Part 7: Profiling and Optimizing our App

As we approach the end of our journey, one question pops up in our head: was it worthy? Why did we face all the challenges of this new paradigm introduced by Metal? What exactly did we gain from all this? What is life?

Fortunately for us, Xcode comes with several tools that will let us answer these questions (maybe not the last one). So, without further ado, let’s take a look at them…

42?

The following image shows a capture of a single frame in Le Voyage:

Screen Shot 2016-06-16 at 10.52.50 AM.png

On the left, we have the Frame Navigator displaying all states and draw calls present on the frame, grouped by type (i.e. render encoder, command buffers, data buffers, etc). There’s an option to switch between viewing the frame by draw calls or by performance. I find the latter one much more useful when looking for optimization points, due to the timings displayed for each operation:

Screen Shot 2016-06-16 at 11.16.33 AM.png

In the center of the screen we have the Attachment Viewer. Since Crimild does not support multiple attachments for Metal at the moment, only the first color attachment is shown above.

It’s important to note that, as we move in the frame navigator, the attachment color can display partial results. For example, it’s possible to display the scene with and without the post-processing filter by selecting the corresponding render command encoder on the left.

Screen Shot 2016-06-16 at 11.00.18 AM.png

The Resource Inspector will show all resources currently in use for the given frame

Screen Shot 2016-06-16 at 11.06.19 AM.png

 

Notice both source and accumulation framebuffer objects in the right hand panel, as well as textures and data buffers. For textures, we can see not only the source image, but also mipmap levels, cube maps, etc. Regarding buffers, we can check their contents as well.

Finally, the State Inspector is shown at the bottom of the screen, allowing us to inspect properties for all available Metal objects.

Moving on, there’s the GPU report providing measurements in frames per second and timings for both CPU and GPU processes. The most expensive render operations are shown as well, indicating which shaders and draw calls we should focus when optimizing.

Screen Shot 2016-06-16 at 11.12.18 AM.png

But maybe the most interesting tool of all is the Shader Profiler and editor.

Screen Shot 2016-06-16 at 11.19.24 AM.png

Not only it’s possible to edit and recompile shaders on the fly, but have you notice the grey bars with the numbers in the image above? Those values indicate which operations are the most expensive (relative to that shader) and the ones that should require our attention. Yes, the profiler will show which lines are the slowest ones!!

Also, notice the warning marks? Good guy Xcode will tell us when we’re doing things wrong with clear messages:

Screen Shot 2016-06-16 at 1.02.32 PM.png

Did I mention that all this works on iOS? Amazing!

What about Instruments?

But wait, there’s more. All the tools in Xcode are incredible useful, yet it’s the Metal System Trace in Instruments the one that really shines, allowing us to profile both the CPU and the GPU down to the microsecond level.

Screen Shot 2016-06-16 at 11.28.03 AM.png

The image above shows an in-depth look at our application’s graphic workload over time across all layers of the graphics stack. Starting with the CPU at the top, the new trace tool will let us inspect shader compilations, driver activity, each of the individual command stages and, finally, the drawables themselves.

It’s worth mentioning that this new tracer works a bit different than the one for OpenGL, as we won’t get the profiler analysis in real-time. Once we start the tracing tool, it will start recording the app indefinitely until stopped and only then we will be able to see the results. This is called Windowed Mode by Instruments.

In the timeline, colors are used to identify each frame so we can easily track their start and end times and how long it took until they were displayed. Probably the one that will require most of our attention is the white color, since that basically means wasted time. I’ll explain this later in this post.

The Details Panel at the bottom of the screen is also very useful. For example, the image below show timings for each of the encoders in a very clear way

Screen Shot 2016-06-16 at 11.46.37 AM.png

Things to look for

The tools are great, but what exactly do they show us? When looking at the timelines and traces, we should keep an eye for the following:

  • CPU and GPU parallelism, indicated by how sparse the operations are from one another. Basically, try to minimize the white spaces in timelines. A white space may indicate that either the CPU is waiting for the GPU or viceversa. This was the very first problem I try to solve for Le Voyage.
  • Pattern breakers. Each frame should look pretty much the same as the previous one. Therefore, any timing spike or new operation should be analyzed and refactored if needed.
  • Surfaces should not be displayed for more than one vsync operation. If so, it’s indicating that a frame is taking more time to process than what we’re expecting which could end up hurting our targeted FPS. For example, if a surface is displayed in between two vsync calls, we’re running at 30fps instead of 60fps.
  • Avoid shader compilation at run-time. Shaders should be pre-compiled if posible and almost no activity should be visible in the Shader Compilation stack. In Le Voyage, all shaders are pre-compiled.
  • Aim to profile early and often.

This list is by no means complete, but it’s enough to avoid the most common performance problems with a Metal application.

Best Practices and Optimizations

OK, there are a lot of best practices to follow. For the sake of brevity, I’m going to focus only on those that made the biggest impact while optimizing Le Voyage.

Expensive object creation upfront

Remember this graphic from the first post?

opengl vs metal

Well, we need to follow it by hearth. We should create the most expensive objects (buffers, pipelines, shader libraries, etc) as early in our application as possible and reuse them as much as we can. Avoid creating these objects in between frames, since they are the source of most performance bottlenecks.

Dynamic Shared Resources

Of course, there will be objects that simply cannot be create upfront. Think about uniform buffers, shader constants and dynamic textures, just to name a few. They may depend on which geometries are on the screen, which in turn are created dynamically too.

In these cases, the best approach is to use a pool of resources (i.e. buffers of a given size) and reuse them whenever required. The number of preallocated resources could vary depending on the requirements of our app, but can be easily adjusted on the fly. Keep in mind that you may need some sort of synchronization mechanism (as in semaphores) in order to ensure that this approach works on parallel systems.

Now, go back to the very first image in this post. Notice all those warning marks? Well, that’s a good indication that we’re creating too many objects during a frame and probably most of them can be replaced by object pools (spoiler alert: they can and they will).

CPU-GPU Parallelism

As I mentioned before, switching to Metal provided a great performance boost in Le Voyage from the very beginning. Even so, after executing the first trace I noticed that there was pretty much no parallelism at all between the CPU and the GPU, meaning they were waiting on each other most of the time. Look at this timeline:

Screen Shot 2016-06-16 at 12.38.19 PM.png

As we can see, the CPU works on a frame and then waits for that frame to be displayed in order to continue on the next one. Look at all that white space. This is clearly inefficient.

It turned out that there was a very simple optimization to be done here. The image above shows the app working with only one command buffer active at any given time and therefore there was no way to achieve parallelism. All it took to improve this was to change the number of in-flight buffers from 1 to 3 and that lead to a much better result:

Screen Shot 2016-06-16 at 12.47.15 PM.png

Now, as one frame is being displayed, the CPU can start processing the next ones almost immediately, ensuring parallelism in our render loop.

Acquire drawables at the latest opportunity

So far we’ve been talking about doing most things upfront. Well, not everything should be done in this way.

As it was defined before, a drawable is the visible output for our app (usually, that would be the screen). A Metal layer has a limited number of drawable objects for us to use, which are returned at display intervals. That means that we need to wait for a drawable to be ready in order to start drawing into it.

Remember this line from a previous post?

_drawable = [getLayer() nextDrawable];

That’s a blocking operation. If a drawable is not ready, the app will wait for one. At 60fps that could be as long as 16ms. Not good.

In practice there’s no need to wait for drawables. After all, we first need to render the scene to an offscreen buffer in order to apply the post-processing effect. Only then we actually need a drawable to render the resulting frame on the screen. So, the Metal Renderer will process the frame in the offscreen FBOs first, and it’s only going to request a drawable when everything is ready to be displayed. This strategy hides long latencies if no drawables are available.

Multi-threaded render passes

This is something that I have in my wish list. Although Metal allow us to dispatch commands buffers on multiple threads, Crimild still implements render passes using a single threaded approach, a fact that comes from years of working with OpenGL.

The idea is to move to a fully parallel render pass at some point in the not so distant future (maybe when I start working with Vulkan), which will bring even more benefits for Metal. But for the moment, we’re stuck with a single-threaded approach. Sorry.

Closing Comments

Phew! This was a long post with too much information in it. When I started this series, the topic of profiling and optimizations was the one that I was most excited to write about. It truly shows the power of Metal and Xcode when working with graphical applications. Too bad the OpenGL tools are not a the same level.

Don’t miss the final post sometime next week, when I’m going to do a proper post-mortem for the whole adventure as well as give my thoughts about some future upgrades in Crimild. See you soon.

To be continue…

 

 

Advertisements

Praise the Metal – Part 1: Rendering a single frame

In my previous post I made an overview of the key concepts of Metal and how they differ from OpenGL. Now I’m going to describe the rendering process for a single frame in Le Voyage as well as introducing the high level architecture for the Metal Renderer implementation.

Screen Shot 2016-05-15 at 3.03.19 PM

As mentioned before, Metal let us do the most expensive things less often and that’s why this post is split into two big sections: Initialization and Rendering

Initialization

Before going deeper into the rendering process, I’m going to mention a couple of things that need to be properly set up from the start in order for us to render anything on the screen. These steps should be performed as few times as possible and usually at the beginning of  the program or, for example, when the encapsulating View Controller is created/loaded.

Creating a Device

In Metal, the MTLDevice protocol is an abstraction of the GPU and can be considered the root of the Metal API.

    
_device = MTLCreateSystemDefaultDevice();
if ( _device == nullptr ) {
   // fallback to OpenGL renderer
}

From now on, every object that we need to create (like command buffers, data buffers, textures, etc) will be done so by using this MTLDevice instance.

As a side note, if you want to know if your system (like your iPhone) supports Metal, you only need to check if a device can be created. If not, you can assume Metal is not supported and you should fallback to, for example, OpenGL for rendering. Or, you know, exit(1)…

In Crimild, a device is created when initializing a CrimildMetalView and, if successful, it continues the configuration process by instantiating the MetalRenderer. Which brings me to the next topic.

Keep in mind that if you’re running on OS X, you may end up with more than one device depending on your system’s capabilities.

Crate a Metal-based View

Much like when working with OpenGL, in order for Metal to work you need a UIView and the right layer class. iOS developers know that if you want to obtain anything drawn on the screen, it needs to be part of the Core Animation Layer tree. Metal provides a new layer type for this: CAMetalLayer.

@implementation CrimildMetalView

+ (Class) layerClass
{
    return [CAMetalLayer class];
}

@end

In Crimild, the Metal view implementation is pretty straightforward. It’s the job of the encapsulating View Controller to add it to the view hierarchy and update them using the CADisplayLink class. That allows the same view controller to easily work with both Metal and OpenGL.

Renderer Configuration

The CrimildMetalView implementation is also responsible for instantiating the MetalRenderer class, provided Metal is supported of course.

As in any other Renderer implementation, the MetalRenderer class provides a configuration mechanism, which in this case will set up both the layer and the command queue as follows:

void MetalRenderer::configure( void )
{
   Renderer::configure();
   _layer = (CAMetalLayer *) getView().layer;
   _layer.contentsScale = [[UIScreen mainScreen] scale]; // 1
   _layer.pixelFormat = MTLPixelFormatBGRA8Unorm;
   _layer.framebufferOnly = true;
   _layer.presentsWithTransaction = false;
   _layer.drawsAsynchronously = true;
   _layer.device = _device;

   _commandQueue = [getDevice() newCommandQueue]; // 2
}

The MTLCommandQueue protocol defines a queue for commands buffers. Each MTLCommandBuffer implementation will contain the translated GPU commands based on what we’re trying to do. Usually, we use command buffers to draw, but Metal provides other types of buffers too, like compute or blit buffers. Additionally, implementations of the MTLCommandEncoder protocol are the ones that perform the translations of our pipelines into GPU commands. Am I going too fast? Don’t worry, I’ll talk about these guys a lot in later posts.

Ok, everything’s setup. Let’s move on to the drawing phase.

Rendering a Frame

In Le Voyage, a single frame may contain several different types of entities, such as opaque and translucent game objects, lights, textures, post-processing effects, and so on. Each of those objects need to be processed order which can be summarized as follows:

  1. Preparing the frame and cleaning the screen

    Screen Shot 2016-05-15 at 2.59.59 PM

  2. Render opaque objects first

    Screen Shot 2016-05-15 at 2.58.59 PM

  3. Render transparent and overlay objects

    Screen Shot 2016-05-15 at 3.01.16 PM

  4. Composite the scene with all objects

    Screen Shot 2016-05-15 at 2.57.49 PM

  5. Apply post-processing effects and display

    Screen Shot 2016-05-15 at 3.03.19 PM

For the sake of brevety, in the rest of this post I’m going to explain only the first and last steps, leaving the details for how to render individual objects and applying image effects for later entries in this series.

Begin Frame

void MetalRenderer::beginRender( void )
{
   Renderer::beginRender();
   _commandBuffer = [_commandQueue commandBuffer]; // 1
}

The very first step in our render loop is to obtain the next available command buffer from the command queue, so we can start pushing commands into it.

Rendering individual objects

Each object that is going to be rendered on screen must do so by specifying a MTLRenderPipeline object, containing compiled shaders, vertex attributes and uniforms, among other things. Then, rendering different objects only requires switching to the corresponding pipeline avoiding the extra overhead of changing states.

I’ll be talking about the MTLRenderPipeline class in the next post.

Presenting the Frame

Once our objects have been drawn and all effects applied, the frame it’s ready to be displayed:

void MetalRenderer::presentFrame( void )
{
   _drawable = [getLayer() nextDrawable]; // 1

   Renderer::presentFrame(); // 2

   [_commandBuffer presentDrawable: _drawable]; // 3
   [_commandBuffer commit]; // 4
}

We do so by requesting the device to provide a drawable (basically, a texture). I’ll explain why we do this at this step in the final posts where I’m going to talk about optimizations, but for now just keep in mind that this is a blocking operation and it should be done only when you’re actually going to draw something.

Once a drawable has been acquired, we render the the frame in it, then we tell CoreAnimation to present the drawable and finally we commit the command buffer to the command queue, to be processed as soon as possible.

And that’s it. Our frame is displayed and we’re ready to work on the next one.

Buffer Synchronization

There’re several things that are missing in the process described above, but I want to focus on a special one right away: buffer synchronization. In a perfect world, we can create as many buffers as we want without having to worry about the actual resources. Of course, in practice we should only create and try and reuse a finite number of buffers in order to avoid wasting memory.

Yet, a problem arises when trying to reuse buffers (or any kind of resources in multi-threaded environment). Remember that I said Metal was multithreaded by design? Well, it comes with a price. Check the following diagram:

metal_buffer_sync_1

The diagram above shows two buffers being reused. Can you spot the problem? (hint: it’s highlighted with a friendly red box). First, we encode buffer #0 and dispatch it to the GPU. Then, we encode and dispatch buffer #1. After that, we cannot reuse buffer #0 right away because it’s still being use by the GPU. The solution?

metal_buffer_sync_2

Simple, right? We just need to wait for at least one buffer to be released by the GPU in order to continue encoding new ones.

As suggested by the Metal documentation, I’m using a dispatch_semaphore_t instance to implement the waiting step. In the setup process, the semaphore is created with the maximum number of buffers we need.

void MetalRenderer::configure( void )
{
   /* ... */
   _inflightSemaphore = dispatch_semaphore_create( CRIMILD_METAL_IN_FLIGHT_COMMAND_BUFFERS );
}

Then, before starting a new frame, we wait for at least one buffer to be available.

void MetalRenderer::beginRender( void )
{
   /* ... */
   dispatch_semaphore_wait( _inflightSemaphore, DISPATCH_TIME_FOREVER );
}

Finally, when committing a command buffer, we can use a handler to signal the semaphore in order to indicate that the resource is no longer used:

void MetalRenderer::presentFrame( void )
{
   /* ... */
   __block dispatch_semaphore_t dispatchSemaphore = _inflightSemaphore;
   [_commandBuffer addCompletedHandler:^(id<MTLCommandBuffer>) {
       dispatch_semaphore_signal( dispatchSemaphore );
   }];

    [_commandBuffer presentDrawable: _drawable];
    [_commandBuffer commit];
}

The MetalRenderer class in Crimild uses a triple buffer technique. That is to say, that at any point in time there are at most three buffers being used for rendering.

Time for a break

The above description is quite simplistic but it remarks the most important aspects for rendering a single frame. If you have the felling that Metal is a little bit more cumbersome to setup than OpenGL, it’s because it actually is. We never had to worry about devices or synchronization when using OpenGL, right? Metal provides blazing fast context switching at draw time at the expense of a more verbose configuration and a more careful execution.

Like I said before, we need to start thinking things differently. When shaders became mainstream (around 2004, maybe?), we needed to do all the transform and lighting calculations ourselves in order to truly take advantage of them, which added more complexity to our programs. Well, I think the same can be said for Metal (and the newest APIs, like Vulkan too). No pain, no gain…

In the next episode post we’re going to start filling in the blanks in the above algorithm and to present the mechanisms to actually draw some objects on the screen.

To be continued…

Crimild v4.1.0

A new year. A new version.

In the past few months I’ve been working hard on a small iOS game called “Le Voyage”. If you haven’t heard about it, you can check it out on its official website and download it for free. No ads guarantee.

“Le Voyage” has been a great opportunity to improve Crimild’s iOS support, particularly regarding rendering and simulation. Enhanced image effects, performance tweaks, a more robust scene builder, more tools for debugging and handling platform specifics, and a lot of bug fixing. Go to GitHub to get the full release notes.

Here’s the trailer for “Le Voyage”, where you can see the latest version in action. Enjoy!