Deferred Shading and packed Uniform Buffer Objects
This past week I did some more work on the rendering engine for Rival Fortress by adding a deferred shading path.
Deferred rendering is particularly useful when working with many dynamic lights, and since I really want to have light/dark mechanics in the game, a combination of forward and deferred shading is the right fit for the engine.
I also started laying the groundwork for the post process chain as well as conditional shader compilation based on user settings and hardware capabilities.
Automatic OpenGL packed uniform layouts
While working on the renderer I also extended the reflection preprocessor by adding the ability to generate OpenGL uniform buffer object metadata for both shared and packed memory layouts.
The advantage of using shared
or packed
comes in the form of better memory utilization, and potentially better performance (due to better memory alignment) compared to the std140
memory layout.
The downside is that you have to query uniform memory locations and strides manually for each uniform block as there are no constraints on how the driver will rearrange or pad your uniforms. This means you can’t simply memcpy
your struct into the uniform buffer when you want to update it.
Preprocessing struct metadata
To avoid the error prone process of manually having to fuss with the alignment code whenever I add or modify a uniform, I extended the engine preprocessor to look for structs annotated like the following:
The MREFLECT
annotation is a no-op macro that tells the preprocessor that it should generate code for a uniform block named MPECamera
.
The preprocessor then goes ahead and generates all the boilerplate code required to query the graphics card for uniform alignment, padding, and stride of each of the members of the struct, as well as the code needed to update the uniform.
The void*
points to one of many types of packing information that is determined at runtime based on how the driver decides to layout the data and is used whenever the uniform needs to be updated.
Since all the code that does the dirty work is automatically generated by the preprocessor using a void*
has more flexibility and less baggage than using templates or inheritance as type safety is enforced at compile time.
Aligning for better packing
Most modern cards will pad uniform entries so that they are aligned to 16 byte boundaries when using the shared
or packed
memory layout.
For this reason it is often better to aggregate variables in four component vectors like vec4
or ivec4
and use swizzling to access the individual variables in the shaders. That is also the reason why I chose a MPEVec4
(a four component vector) to represent camera position and direction in the previous code snippet.
Pooling uniform blocks in large buffers
Another useful optimization when dealing with uniform buffer objects is pooling multiple blocks into large buffer objects, similarly to how you would pool multiple meshes in VBO.
For example, in Rival Fortress the lighting, camera, and world data uniform blocks are all stored in a single buffer object, thus reducing buffer switches.
Adding a path for when the ARB_buffer_storage extension is supported allows you to get a pointer that you can write into whenever you need to update your uniforms without having to do any gl
calls. For more information on this and other low overhead OpenGL techniques I suggest you take a look at the Steam Dev Days talk Beyond Porting: How Modern OpenGL Can Radically Reduce Driver Overhead (The slides are here).