As we approach the end of our journey, one question pops up in our head: was it worthy? Why did we face all the challenges of this new paradigm introduced by Metal? What exactly did we gain from all this? What is life?
Fortunately for us, Xcode comes with several tools that will let us answer these questions (maybe not the last one). So, without further ado, let’s take a look at them…
The following image shows a capture of a single frame in Le Voyage:
On the left, we have the Frame Navigator displaying all states and draw calls present on the frame, grouped by type (i.e. render encoder, command buffers, data buffers, etc). There’s an option to switch between viewing the frame by draw calls or by performance. I find the latter one much more useful when looking for optimization points, due to the timings displayed for each operation:
In the center of the screen we have the Attachment Viewer. Since Crimild does not support multiple attachments for Metal at the moment, only the first color attachment is shown above.
It’s important to note that, as we move in the frame navigator, the attachment color can display partial results. For example, it’s possible to display the scene with and without the post-processing filter by selecting the corresponding render command encoder on the left.
The Resource Inspector will show all resources currently in use for the given frame
Notice both source and accumulation framebuffer objects in the right hand panel, as well as textures and data buffers. For textures, we can see not only the source image, but also mipmap levels, cube maps, etc. Regarding buffers, we can check their contents as well.
Finally, the State Inspector is shown at the bottom of the screen, allowing us to inspect properties for all available Metal objects.
Moving on, there’s the GPU report providing measurements in frames per second and timings for both CPU and GPU processes. The most expensive render operations are shown as well, indicating which shaders and draw calls we should focus when optimizing.
But maybe the most interesting tool of all is the Shader Profiler and editor.
Not only it’s possible to edit and recompile shaders on the fly, but have you notice the grey bars with the numbers in the image above? Those values indicate which operations are the most expensive (relative to that shader) and the ones that should require our attention. Yes, the profiler will show which lines are the slowest ones!!
Also, notice the warning marks? Good guy Xcode will tell us when we’re doing things wrong with clear messages:
Did I mention that all this works on iOS? Amazing!
What about Instruments?
But wait, there’s more. All the tools in Xcode are incredible useful, yet it’s the Metal System Trace in Instruments the one that really shines, allowing us to profile both the CPU and the GPU down to the microsecond level.
The image above shows an in-depth look at our application’s graphic workload over time across all layers of the graphics stack. Starting with the CPU at the top, the new trace tool will let us inspect shader compilations, driver activity, each of the individual command stages and, finally, the drawables themselves.
It’s worth mentioning that this new tracer works a bit different than the one for OpenGL, as we won’t get the profiler analysis in real-time. Once we start the tracing tool, it will start recording the app indefinitely until stopped and only then we will be able to see the results. This is called Windowed Mode by Instruments.
In the timeline, colors are used to identify each frame so we can easily track their start and end times and how long it took until they were displayed. Probably the one that will require most of our attention is the white color, since that basically means wasted time. I’ll explain this later in this post.
The Details Panel at the bottom of the screen is also very useful. For example, the image below show timings for each of the encoders in a very clear way
Things to look for
The tools are great, but what exactly do they show us? When looking at the timelines and traces, we should keep an eye for the following:
- CPU and GPU parallelism, indicated by how sparse the operations are from one another. Basically, try to minimize the white spaces in timelines. A white space may indicate that either the CPU is waiting for the GPU or viceversa. This was the very first problem I try to solve for Le Voyage.
- Pattern breakers. Each frame should look pretty much the same as the previous one. Therefore, any timing spike or new operation should be analyzed and refactored if needed.
- Surfaces should not be displayed for more than one vsync operation. If so, it’s indicating that a frame is taking more time to process than what we’re expecting which could end up hurting our targeted FPS. For example, if a surface is displayed in between two vsync calls, we’re running at 30fps instead of 60fps.
- Avoid shader compilation at run-time. Shaders should be pre-compiled if posible and almost no activity should be visible in the Shader Compilation stack. In Le Voyage, all shaders are pre-compiled.
- Aim to profile early and often.
This list is by no means complete, but it’s enough to avoid the most common performance problems with a Metal application.
Best Practices and Optimizations
OK, there are a lot of best practices to follow. For the sake of brevity, I’m going to focus only on those that made the biggest impact while optimizing Le Voyage.
Expensive object creation upfront
Remember this graphic from the first post?
Well, we need to follow it by hearth. We should create the most expensive objects (buffers, pipelines, shader libraries, etc) as early in our application as possible and reuse them as much as we can. Avoid creating these objects in between frames, since they are the source of most performance bottlenecks.
Dynamic Shared Resources
Of course, there will be objects that simply cannot be create upfront. Think about uniform buffers, shader constants and dynamic textures, just to name a few. They may depend on which geometries are on the screen, which in turn are created dynamically too.
In these cases, the best approach is to use a pool of resources (i.e. buffers of a given size) and reuse them whenever required. The number of preallocated resources could vary depending on the requirements of our app, but can be easily adjusted on the fly. Keep in mind that you may need some sort of synchronization mechanism (as in semaphores) in order to ensure that this approach works on parallel systems.
Now, go back to the very first image in this post. Notice all those warning marks? Well, that’s a good indication that we’re creating too many objects during a frame and probably most of them can be replaced by object pools (spoiler alert: they can and they will).
As I mentioned before, switching to Metal provided a great performance boost in Le Voyage from the very beginning. Even so, after executing the first trace I noticed that there was pretty much no parallelism at all between the CPU and the GPU, meaning they were waiting on each other most of the time. Look at this timeline:
As we can see, the CPU works on a frame and then waits for that frame to be displayed in order to continue on the next one. Look at all that white space. This is clearly inefficient.
It turned out that there was a very simple optimization to be done here. The image above shows the app working with only one command buffer active at any given time and therefore there was no way to achieve parallelism. All it took to improve this was to change the number of in-flight buffers from 1 to 3 and that lead to a much better result:
Now, as one frame is being displayed, the CPU can start processing the next ones almost immediately, ensuring parallelism in our render loop.
Acquire drawables at the latest opportunity
So far we’ve been talking about doing most things upfront. Well, not everything should be done in this way.
As it was defined before, a drawable is the visible output for our app (usually, that would be the screen). A Metal layer has a limited number of drawable objects for us to use, which are returned at display intervals. That means that we need to wait for a drawable to be ready in order to start drawing into it.
Remember this line from a previous post?
_drawable = [getLayer() nextDrawable];
That’s a blocking operation. If a drawable is not ready, the app will wait for one. At 60fps that could be as long as 16ms. Not good.
In practice there’s no need to wait for drawables. After all, we first need to render the scene to an offscreen buffer in order to apply the post-processing effect. Only then we actually need a drawable to render the resulting frame on the screen. So, the Metal Renderer will process the frame in the offscreen FBOs first, and it’s only going to request a drawable when everything is ready to be displayed. This strategy hides long latencies if no drawables are available.
Multi-threaded render passes
This is something that I have in my wish list. Although Metal allow us to dispatch commands buffers on multiple threads, Crimild still implements render passes using a single threaded approach, a fact that comes from years of working with OpenGL.
The idea is to move to a fully parallel render pass at some point in the not so distant future (maybe when I start working with Vulkan), which will bring even more benefits for Metal. But for the moment, we’re stuck with a single-threaded approach. Sorry.
Phew! This was a long post with too much information in it. When I started this series, the topic of profiling and optimizations was the one that I was most excited to write about. It truly shows the power of Metal and Xcode when working with graphical applications. Too bad the OpenGL tools are not a the same level.
Don’t miss the final post sometime next week, when I’m going to do a proper post-mortem for the whole adventure as well as give my thoughts about some future upgrades in Crimild. See you soon.