Deferred Shading and packed Uniform Buffer Objects

This past week I did some more work on the rendering engine for Rival Fortress by adding a deferred shading path.

Deferred rendering is particularly useful when working with many dynamic lights, and since I really want to have light/dark mechanics in the game, a combination of forward and deferred shading is the right fit for the engine.

I also started laying the groundwork for the post process chain as well as conditional shader compilation based on user settings and hardware capabilities.

Automatic OpenGL packed uniform layouts

While working on the renderer I also extended the reflection preprocessor by adding the ability to generate OpenGL uniform buffer object metadata for both shared and packed memory layouts.

The advantage of using shared or packed comes in the form of better memory utilization, and potentially better performance (due to better memory alignment) compared to the std140 memory layout.

The downside is that you have to query uniform memory locations and strides manually for each uniform block as there are no constraints on how the driver will rearrange or pad your uniforms. This means you can’t simply memcpy your struct into the uniform buffer when you want to update it.

Preprocessing struct metadata

To avoid the error prone process of manually having to fuss with the alignment code whenever I add or modify a uniform, I extended the engine preprocessor to look for structs annotated like the following:

MREFLECT(UniformBlock, Name:"MPECamera")
struct MPECameraUniformBlock
{
  MPEMatrix4 ViewProjectionMatrix;
  MPEMatrix4 InverseProjectionMatrix;
  MPEVec4 CameraPosition;
  MPEVec4 CameraDirection;
  void* PackingInfo;
  u32 Offset;
  u32 Size;
};

The MREFLECT annotation is a no-op macro that tells the preprocessor that it should generate code for a uniform block named MPECamera.

The preprocessor then goes ahead and generates all the boilerplate code required to query the graphics card for uniform alignment, padding, and stride of each of the members of the struct, as well as the code needed to update the uniform.

The void* points to one of many types of packing information that is determined at runtime based on how the driver decides to layout the data and is used whenever the uniform needs to be updated.

Since all the code that does the dirty work is automatically generated by the preprocessor using a void* has more flexibility and less baggage than using templates or inheritance as type safety is enforced at compile time.

Aligning for better packing

Most modern cards will pad uniform entries so that they are aligned to 16 byte boundaries when using the shared or packed memory layout.

For this reason it is often better to aggregate variables in four component vectors like vec4 or ivec4 and use swizzling to access the individual variables in the shaders. That is also the reason why I chose a MPEVec4 (a four component vector) to represent camera position and direction in the previous code snippet.

Pooling uniform blocks in large buffers

Another useful optimization when dealing with uniform buffer objects is pooling multiple blocks into large buffer objects, similarly to how you would pool multiple meshes in VBO.

For example, in Rival Fortress the lighting, camera, and world data uniform blocks are all stored in a single buffer object, thus reducing buffer switches.

Adding a path for when the ARB_buffer_storage extension is supported allows you to get a pointer that you can write into whenever you need to update your uniforms without having to do any gl calls. For more information on this and other low overhead OpenGL techniques I suggest you take a look at the Steam Dev Days talk Beyond Porting: How Modern OpenGL Can Radically Reduce Driver Overhead (The slides are here).

Metric Panda Games

One pixel at a time.

Deferred Shading and packed Uniform Buffer Objects

Automatic OpenGL packed uniform layouts

Preprocessing struct metadata

Aligning for better packing

Pooling uniform blocks in large buffers