Miscellaneous

Perspective projection, reverse depth buffer, clipping

The transform from view-space to projection-space is used not just for mapping vertices to the screen, but for a couple of other purposes as well:

When updating a water reflection texture, the viewing frustrum is narrowed to the (possibly asymmetric) bounds of the water area for clipping.
In temporal anti-aliasing, the viewing frustrum is slightly shaken (jittered) on each frame, to produce better input for the temporal anti-aliasing algorithm when the camera is not moving.

A reverse depth buffer is used, because it has better precision than a traditional depth buffer, especially when using a floating-point format. It also supports infinite far-clipping-plane distance. See the links below for details.

Perspective projection is calculated in HLSL by a simple function with a small structure as a parameter, as shown in the code below, instead of a projection matrix, because this is slightly faster and takes less space in constant buffers. The enable flag allows enabling/disabling perspective projection for 2D graphics in shaders that support both 2D and 3D modes.

struct PerspectiveProjection {
    float2 scale;
    float2 translation;
    float enable;
    int padding[3];
};

// r1 is the left-top and r2 the right-bottom corner of the viewport,
// expressed as angular coefficients of the frustrum planes (x/z and y/z).
// This function would typically be executed on CPU side.
PerspectiveProjection createPerspectiveProjection(float2 r1, float2 r2, bool enable) {
    PerspectiveProjection p;
    p.scale = 2 / (r2 - r1);
    p.translation = - (1 - 2 * r2 / (r2 - r1));
    p.enable = enable;
    return p;
}

float4 calculatePerspectiveProjection(float4 v, PerspectiveProjection p) {
    return float4(
        p.scale * v.xy + p.translation * v.z,
        lerp(float2(0.0, 1.0), float2(g_nearClipDepth * v.w, -v.z), p.enable));
}

Premultiplied alpha

Textures and color images are kept in premultiplied alpha format, i.e. the RGB color values of each pixel have been multiplied by the alpha value. The main benefit of this is that when the GPU interpolates texels during texture sampling, the colors get automatically weighted by alpha, so that more opaque texels will have more impact on the final color, as they should. The same benefit applies to mipmap generation and image resampling, although in those cases it would be possible to do the weighting by alpha also in the interpolation algorithm. In the GPU texture sampling case, alpha premultiplication is indispensable, because we cannot modify the GPU interpolation hardware to do the weighting.

In shaders and other pieces of code that do calculations on colors, the colors are converted back to non-premultiplied format after reading from the texture, for easier processing.

Another, smaller benefit of premultiplied alpha is that when compressing transparent textures using BC7, the average compression error seems to slightly improve with premultiplied alpha. This was measured by estimating the error as the Euclidean distance of RGB values multiplied by alpha, taking into account only partially-transparent texels: the average error dropped from 0.0144 to 0.0127 in a foliage texture of pine tree needles.

Doing the premultiplication in linear color space is more appropriate than in sRGB space and will yield more accurate results when sampling textures on the GPU, because the GPU will do the sRGB-to-linear conversion before interpolation. However, some libraries may still be processing pixels in the "old way" in sRGB space without conversion to linear, and might not work properly with images premultiplied in linear space. For example, GDI+ may fail to draw pixels at all when an input image has pixels with sRGB color values greater than the alpha value, which can occur when premultiplying in linear space. In this project, such libraries were not used, and all blending is done in linear space also in 2D graphics.

Link

Alpha Blending: To Pre or Not To Pre

World coordinates and floating-point precision

With a world size of a few kilometers, single-precision floating point numbers can, in theory, represent coordinates with a sub-millimeter precision (e.g. 0.25mm precision at up to 4km from the origin). However, when e.g. matrix operations are applied in a traditional way to transform vertices from model-space to world-space and then from world-space to view-space, the precision is not sufficient in the calculations and there is visible wobbling of vertices when the camera is located a couple of kilometers from the origin. This occurs even when there is no cumulative error that would accumulate over the course of several frames.

Using double-precision when representing matrices and when preforming vertex transforms would solve this, but GPU support for double-precision is limited and can be much slower (e.g. 1/32 of single-precision performance on Nvidia Maxwell). Therefore, a solution is preferred that allows always using single-precision on the GPU and possibly higher precision on the CPU, when needed. Essentially we should avoid any situation where large coordinate values are handled as single-precision floats, unless they are far away from the camera.

One solution is to never transform vertices to world-space on the GPU but instead transform them directly from model-space to view-space with one matrix transform, using a single-precision matrix that has been formed on the CPU, and preform all calculations in view-space. The position and orientation of the models and the camera are defined as double-precision matrices, and the CPU combines them into a model-to-view-space matrix using double-precision arithmetic, before converting the result to a single-precision matrix for the GPU.

Vertex positions and other bulk coordinate data are always stored in single-precision, not just because of GPU support but because of memory usage and bandwidth. High precision is not needed for individual vertex positions, because they are relative to a 3D model origin, and models of several kilometers in size don't usually have millimeter-scale details (and if they had, they would have to be split anyway to allow rendering different parts with different levels of detail).

Double-precision is used on the CPU for model and camera positions and in other non-performance-critical situations. This does not mean that every model's position would have to be presented in double-precision, as they can also be specified as relative to e.g. a terrain tile. Desktop CPUs are relatively fast at computing with double-precision, as the arithmetic throughput is half (with SSE) or equal (without SSE) to single-precision throughput. Memory usage and bandwidth is doubled, but this is not a concern as long as the number of double-precision coordinates and matrices is small.

Camera-centered-world-space

Sometimes it is awkward for the GPU to calculate everything in view-space, e.g. when lighting data is stored in grids that are oriented along world-space axes. Some shaders will first transform vertices into a "camera-centered-world-space", which has the same orientation as the world-space and the same origin as the view-space. The single-precision matrices for these transforms are prepared on the CPU and can be derived from double-precision matrices, if necessary. The key benefit of using camera-centered-world-space over world-space is similar to that of using view-space only, i.e. precision will be good for points that are close to the camera and worse for points farther away, where the error is not noticeable. The benefit over view-space on the other hand is that it's easier/faster to perform the lighting calculations in camera-centered-world-space than to rotate the light grids into view-space.

Alternatives

It could be possible to avoid using double-precision altogether by carefully modifying the transformations and other calculations so that errors are not amplified from the unnoticeable sub-millimeter range to the noticeable centimeter-or-larger range. For example, instead of forming a model-to-view-space matrix by calculating a matrix product of the model and camera matrices, we could first subtract the camera position from the model position and only then combine the rotation. But trying to be so careful in all calculations, including physics simulation and procedural world generation, would be unnecessarily complicated and error-prone, given that using double-precision on the CPU did not significantly affect overall performance.

It would also be possible to use a higher precision format for positions than for orientations, which could be particularly useful if even higher precision than double was needed, but this would also be more complicated. Note that it's not generally sufficient to simply use mixed-precision matrices where the position elements have higher precision than the rotation elements, or at least extra care would have to be taken avoid situations where some intermediate matrices require high precision also in the rotation elements.

Link

Explaining FP64 performance on GPUs

Graphics programming tips

Miscellaneous little things in GPU programming that might be surprising and could lead to bugs:

A pixel shader is executed for 2x2 blocks of pixels and, by default, derivatives of texture coordinates and other variables are calculated from the difference of their values between pixels in the same block. It is important to be aware of this when e.g. neighboring pixels might read from a different decal texture with a different texture coordinate system. In this case, the derivatives calculated from the block might not be valid, and will have to be calculated manually. Note that texture sampling may be using pixel block derivatives even if you don't explicitly call the derivative functions, and you will need to invoke the texture sampling function that takes manually-calculated derivatives as parameters instead.
The GPU driver may optimize shaders depending on how the shader is used, and shaders should be reloaded to the GPU when certain conditions change, even if the shader code has not changed. On Nvidia with Direct3D 11, performance dropped in the following cases until shaders were reloaded:

Full-screen mode switch.
Wireframe mode switch.
Activating/deactivating the use of a render target slot in shaders (e.g. enabling/disabling temporal anti-aliasing velocity buffer).

Geometry shaders and instancing can be surprisingly slow (see this).
HLSL syntax allows defining a SamplerState in the HLSL code, but this is meant only for the (obsolete) effects framework and won't work as one would expect with the regular HLSL compiler. If you try to define SamplerState parameters in HLSL, they will be ignored and the sampler will have default state.
With Direct3D, you have to remember to call CheckFeatureSupport() and CheckFormatSupport() for features that are only optionally supported. There does not seem to be a test mode that would allow you to check if the application is using such features.
When reading data back from GPU to CPU, the program should wait at least two frames to prevent stalls (see Direct3D documentation).

Miscellaneous

Perspective projection, reverse depth buffer, clipping

Links

Premultiplied alpha

Link

World coordinates and floating-point precision

Camera-centered-world-space

Alternatives

Link

Graphics programming tips