GeistHaus
log in · sign up

Martin Fuller's Graphics Ramblings

Part of wordpress.com

Personal views, not those of my employer

stories
Anisotropic Scaling in Indiana Jones and the Great Circle and DOOM: The Dark Ages
Uncategorizedgpuperformanceprogramming
While working on Indiana Jones and the Great Circle and later DOOM: The Dark Ages, it became apparent that both titles are particularly sensitive to anisotropic sampling cost on the Xbox Series consoles. This is despite both having very different renderers, Indiana Jones uses what I would best describe as a semi-forward system, while Doom […]
Show full content

While working on Indiana Jones and the Great Circle and later DOOM: The Dark Ages, it became apparent that both titles are particularly sensitive to anisotropic sampling cost on the Xbox Series consoles. This is despite both having very different renderers, Indiana Jones uses what I would best describe as a semi-forward system, while Doom use deferred texturing, with the novel development of a barycentric buffer. (See GPC 2025: Visibility Buffer and Deferred Rendering in Doom the Dark Ages) The fact that both are so sensitive to anisotropic tap count I ascribe partly to the system of layering materials, and partly to the fact that both games use Samper Feedback Streaming (SFS) on Xbox Series consoles. (SFS adds an additional texture fetch to discover what the most detailed mip is in memory for a given texture region, which is then used to limit the access of a subsequent texture read)

As usual, the problem arises due to variation in scene content. Some scenes could afford high levels of anisotropic sampling, while others benefitted from reducing the count to ensure a locked 60hz output. I set about one Saturday (immediately prior to code lock) to attempt to tie anisotropic sampling level to the current dynamic resolution scale, not entirely convinced this was a visually acceptable thing to do.

Most non-engineers would be amazed how much DRS resolution choice can jump around even with a steady camera. A good TAA/super resolution implementation ensures nobody notices per-frame resolution changes. In fact, I think it can help, providing additional temporal information on a macro scale, while sub-pixel jitter provides additional information on a micro scale.

The problem with changing anisotropic sample count is that it’s extremely obvious, visually nasty, and particularly distracting if the count flip flops. The difficulty therefore was to create an algorithm that will change the count, but is reluctant to do so, and will even stick with a setting it considers to be ‘wrong’ rather than change it.

The solution I arrived at is two part; the first is a filter, which is used to ensure the same recommendation for change has been made for a number of frames, less frames the higher the confidence. This on its own however was not enough. The second part of the algorithm is an aid to prevent flip flopping, so if the last change was to increase the anisotropic tap count, increasing it again is easier than switching direction and decreasing the tap count.

This algorithm varied the anisotropic tap count between 1 (bilinear) and 8 on Xbox Series S and between 4 and 10 on Xbox Series X. (10 being in effect is a bias towards 8)

Something I expected I might need to implement was an additional reluctance to up sample unless the camera transform had altered significantly. However, I did not find this necessary, the algorithm has enough resistance to change that in practice, it rarely changes the sample count with a steady camera anyway, and when it does, it’s hard to argue the choice. Anisotropic scaling wasn’t applied to the terrain system as this used different samplers, but likely could have been for additional savings. Indiana Jones does vary the sample count on terrain overlays such as road tracks, which is where changing the level was most noticeable due to these overlays often running at max sample count and on a large screen area.

The code ported straight over from Indiana Jones to DOOM: The Dark Ages and shipped in that title also with only a minor tweak to parameters. Both games shipped with a perf or quality win depending on scene load, and nobody noticed or complained (that I noticed). Indiana Jones won Digital Foundry’s technology of the year 2024 (despite the ‘knocked up in a day’ anisotropic scaling algorithm), huge congrats to everyone involved!

The same algorithm can be used to tie any graphical setting to DRS where a frequent change or flip-flip of state is objectionable, but an infrequent change particularly with a moving camera may go unnoticed. For example, I fully expect the same algorithm could be used for shadow map size or shadow filtering quality. Other parameters can be tied to DRS scale directly without this algorithm, e.g. VRS shading rate, which we did on Doom. (Another shameless plug! See GPC 2025: Variable Rate Compute Shaders in Doom the Dark Ages)

Always great to work with the talented teams at Machine Games and id Software and thank you to both for allowing me to share.

The code below is of course trivial to generalise to other settings, not just anisotropic sampling. The code is modified from as-shipped only to aid portability and understanding.

/* 
    MIT License

    Copyright (c) Microsoft Corporation.

    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE
*/

float GetAnisotropicSampleCount(
    float minAniso,		    // min sample count for current platform settings
    float maxAniso,		    // max sample count for current platform settings
    float currentResScale,	// current resolution scale
    float minAnisoResScale,	// resolution scale <= which we return minAniso
    float maxAnisoResScale,	// resolution scale >= which we return maxAnsio
    float bias,			    // prevents flip flop of ansio settings, higher numbers reduce oscillation, 1.25f default, suggested range 0..2
    float smoothFactor,		// smooth factor, closer to one takes longer to smooth, default 0.925f
    float smoothFactorPanic // smooth factor when resolution is less than min, closer to one takes longer to smooth, default 0.8f, should be less than smoothFactor
)
{
    static float currentAniso = -1.0f;
    static float smoothedAniso = -1.0f;
    static float upBias = 0.0f;
    static float downBias = 0.0f;

    if (currentAniso < 0.0) {
        // first frame initialization
        currentAniso = minAniso;
        smoothedAniso = minAniso;
    }
    else {
        float anisoThisFrame;
        float smooth;

        if (currentResScale <= minAnisoResScale) {
            // panic, favor this frames preferred aniso and make it easy to scale down
            anisoThisFrame = minAniso;
            upBias = 2.0f;
            downBias = 0.0f;
            smooth = smoothFactorPanic;
        }
        else {
            // ok, everything is normal, no panic
            if (currentResScale >= maxAnisoResScale) {
                anisoThisFrame = maxAniso;
            }
            else {
                // could precalculate reciprocal
                float t = (currentResScale - minAnisoResScale) / (maxAnisoResScale - minAnisoResScale);   
                anisoThisFrame = minAniso + (t * (maxAniso - minAniso));
            }
            smooth = smoothFactor;
        }
        // smooth the running target, takes out a lot of per frame noise
        smoothedAniso *= smooth;
        smoothedAniso += (1.0f - smooth) * anisoThisFrame;
    }
    // check if we should scale up or scale down, we don't want to do this often, but we need to do it sometimes!
    float candidateAniso = roundf(smoothedAniso + 0.499f);

    // Might want to prevent scaling up if no or very low camera movement, that check would go here
    if ((smoothedAniso - (upBias * bias)) > currentAniso) {		
        if (currentAniso != candidateAniso) {
            // scale up
            currentAniso = candidateAniso;
            // easy to scale up again next frame, but hard to scale down
            upBias = 0.5f;
            downBias = 2.0f;
        }
    }
    if ((smoothedAniso + (downBias * bias)) < currentAniso) {
        if (currentAniso != candidateAniso) {
            // scale down
            currentAniso = candidateAniso;
            // easy to scale down again next frame, but hard to scale up
            upBias = 2.0f;
            downBias = 0.5f;
        }
    }
    return currentAniso;
}

http://martinfullerblog.wordpress.com/?p=378
Extensions
Massaging the Shader Compiler to emit Optimum Instructions
Uncategorizedcompilergpuperformanceprogrammingshader-compilershaders
Modern GPU’s feature sophisticated instruction sets for executing shaders. However this blog details how difficult and unintuiative it can be to have the shader compiler leverage the optimum instruction. Consider the simple example of swapping the endian of a 32bit word. (Something that was common on the PlayStation3 in particular) Below are 5 different HLSL […]
Show full content

Modern GPU’s feature sophisticated instruction sets for executing shaders. However this blog details how difficult and unintuiative it can be to have the shader compiler leverage the optimum instruction.

Consider the simple example of swapping the endian of a 32bit word. (Something that was common on the PlayStation3 in particular) Below are 5 different HLSL implementations of EndianSwap along with a simple count of the number of HLSL bitwise operations, which I think of as the shader compiler’s target to beat. It looks like 9 HLSL bitwise ops ought to be the best, but what do these functions actually compile to? (With the RGA 2.6.2 backend)

Observations

We can see from the above, we have five HLSL implementations of EndianSwap and 4 very different outputs from the shader compiler. The compiler often emits v_lshl_or_b32 which combines a bitwise shift and OR instruction, and v_or3_b32, which bitwise OR of three inputs, both of which are nice optimisations. Looking at the examples in turn:

EndianSwapA & B result in essentially the same code gen with a mostly literal translation of the HLSL, that is bitwise shifts, ANDs and ORs. The compiler spots that the AND operations with a 24bit shift are redundant and strips them, which is great, but only gets us to 7 hardware instructions, down from 10 or 11 HLSL bitwise operations.

EndianSwapC has the highest HLSL bitwise instruction count (13), however uniquely in these examples, the compiler leverages v_bfe_u32, (‘bit field extract’) and achieves a win from the fact this one instruction performs multiple bitwise operations, so EndianSwapC beats out A & B, at 6 hardware instructions, despite the higher HLSL bitwise operation count.

In EndianSwapD the compiler generates two v_perm_b32 instructions, making it our best implementation so far at 5 hardware instructions, the two 24bit shifts are executed as seperate instructions and the results bitwise OR’d. It’s quite elegant but not optimal.

EndianSwapE, bingo, the GPU can implement EndianSwap in just one v_perm_b32 instruction! Not only does this implementation use the least hardware insturcitons, it also uses the least number of vector registers.

The Problem

Lets compare EndianSwapD & E:

My expectation is that while an experienced/senior engineer might come up with EndianSwapD, they would not add the redundant AND operations seen in EndianSwapE. (The 24bit shifts both zero extend, and therefore you don’t need a bitwise AND to extract the high and low byte. And again, the compiler leveraged this optimisation when compiling EndianSwapA & B) Yet counter intuitively, adding these redundant operations produces by far the best compiled output!

For completeness. its a good idea to test ordering and form. If we consider EndianSwapB, and slight rewrites of EndianSwapE.

EndianSwapB emits 7 hardware insturcitons while E, F, G all emit just 1. The critical difference is to place the shift operation first, followed by a bitwise AND, which must be present even if its redundant. The order each byte is processed doesn’t appear to influence the hardware instruction used, though it does change the literal used.

Conclusions

We can see:

  1. It is not obvious what pattern of HLSL instructions the compiler is matching to emit the GPU’s more powerful instructions.
  2. The exact form the compiler is looking for may require operations that an engineer would reasonably consider redundant and therefore not provide.
  3. Engineers failing to match the correct form results in sub-optimal code generation, and this is likely to happen a lot!
  4. It possible that the optimum form may differ on different IHVs(!)

Some might say that this is on the compiler to do a better job, and that is a valid point. Currently however this is a problem that can addressed and optimisations made, at least on AMD GPU’s with public tools. Certainly this ia great example of why its good to validate the quality of code generated by a compiler.

Personally, I would like to see HLSL expose instrincs for many of the more complex and powerful instructions such as v_perm_b32 and v_bfe_u32. I am not a compiler engineer, however intuitiavely it would seem easier for the shader compiler to decompose intrinics into individual instructions on unsupported hardware than it is for the compiler to reduce complex sequences of HLSL to fewer instructions. From the shader author’s point of view, it might be easier to use an intrinsic than to figure out how to get the shader compiler to produce the code you want from vanialla HLSL.

Credits

Thanks to my collegues James Stanard and in particular Adam Miles, who were riffing on this issue.

This investigation used Compiler Explorer, with the latest AMD backend (RGA). A huge thanks both to AMD for making their ISA and tooling public (via GPUOpen) and to Matt Godbolt, the author of Compiler Explorer.

The functions: https://godbolt.org/z/75c1ToYqs

http://martinfullerblog.wordpress.com/?p=291
Extensions
Dynamic Resolution Scaling (DRS) on PC
Uncategorizedgamingtechnology
DRS is one of those technique that is a lot easier to implement on console than PC, or perhaps more accurately, its easier to run closer to the wire, maximising GPU use without dropping frames. My previous post on dynamic resolution scaling briefly touched on this, however I’ve received a couple of PM’s asking for […]
Show full content

DRS is one of those technique that is a lot easier to implement on console than PC, or perhaps more accurately, its easier to run closer to the wire, maximising GPU use without dropping frames. My previous post on dynamic resolution scaling briefly touched on this, however I’ve received a couple of PM’s asking for a more indepth explanation. The following details the various challenges unique to PC.

A section of the very cool GPU wall at Microsoft offices in Redmond featuring GPUs from the 1990s to the present day. (actually this is an old photograph from the previous building, the display has now moved to a new office with more GPUs added) Photograph Author.

1. Graphical Options & Driver Updates

Consoles provide an extremely predictable and stable platform, with no ‘post launch’ driver updates to change performance without a corresponding title executable update. AAA consoles titles typically provide just two or three modes that have a material impact on GPU time, e.g. performance/quality. On console therefore it is reasonable to tune the ‘DRS regulator’, which decides what resolution to select, for each mode. Potentially this can even including pre-computing the expected frame time delta between each resolution step in a lookup table. There might be different tables for different maps, e.g. a jungle map vs a city scape.

While console has maybe 2 or 3 modes to deal with, PC titles must deal with a huge matrix of different GPUs, (with overclocking, varying RAM capacity/speed etc..), and typically a large array of user graphics settings. These settings affect both the fixed frame time costs, and costs that scale with resolution. Resolution scaling in linear increments produces a non-linear change in frame time. As resolution diminishes, so fixed costs increasingly dominate the frame time, (e.g. vertex processing, rendering shadow maps, water physics, GI update, building BVHs & etc..) Worse, these costs might change with a driver update.

The only practical solution on PC therefore is to dynamically solve for the relationship between resolution and the frame time at different resolutions given any GPU, driver, and graphics settings. This same solution is also viable on console of course, and can be particularly useful during development when code and content is changing.

2. Latency of Information from the GPU and the Presentation Queue

On console, typically the CPU can acquire GPU timing data the instant the GPU has written it, without having to transfer that information across a bus, as happens on PC’s with discrete GPUs. This means that a console title can be earlier informed what the last frame time was, and has more time to correct the resolution to avoid a frame drop, or conversely, to avoid wasted GPU time if running at too low a resolution.

Perhaps more significantly consoles have a much better insight into what is happening in the presentation queue and the latencies all the way through to display out. This allows titles to run ‘closer to the wire’ on frametime when a large display latency exists. Or to be more aggressive on reducing resolution when a short display latency exists, e.g. a panic mode. Unfortunately this information is not available on PC, however it is true that not all console DRS implementations use this measurement, so it is not essential.

3. Non-exclusive GPU

PC implementations can be upset by GPU tasks executed by different processes on the PC consuming GPU time and worst case, causing spikes. Placing a timing query at the start and end of your frame might capture the time taken by another process, or it might not, depending on when the workload happened.

One potential solution is to capture not just time taken by not only the title, but everything else running concurrently on the GPU. You might compare not just timing queries for the start and end of your GPU frame but also the end time of the previous frame.

On console, the additional processes which can consume GPU time while playing the game are extremely limited, e.g. an achievement popup. These are lightweight and safety ignored by console DRS implementations.

4. Mouse Look

Rapidly changing view direction is a problem for DRS. On PC for first person games in particular, mouse look can change the camera transformation faster than a controller can, reducing temporal coherence of GPU load and making it more difficult for the DRS to make a good prediction for next frame’s resolution.

This can be a particular issue for professional players who might make faster mouse moves.

5. Dynamic GPU Clock Speed

Any process that can dynamically change the GPU clock speed outside of the title’s control, e.g. battery vs charging, or an overclocking app e.g. Afterburner.

So what’s the Solution on PC?

On PC a title using DRS has to dynamically adjust its measurement of what changing the resolution does to the frame time. One credible idea is to use the derivative of the last few frames choices to make a new prediction. You might also include historical data, what was the frame time last time that resolution was chosen, weighted by the time since that resolution was last chosen. This is not a bad idea on console also.

An effective and I suspect widely adopted ‘get out of jail’ card on PC is to run with a higher safety margin vs console, e.g. aim for 15.8ms instead of 16.3ms, especially since you don’t know how much latency exists in the presentation pipeline, though this is obviously wasteful of GPU performance.

I find it interesting that it is more difficult to have DRS to run to the wire on PC and maximise the GPU, not because of the hardware, but because of the platform.

http://martinfullerblog.wordpress.com/?p=230
Extensions
Dynamic Resolution Scaling (DRS) Implementation Best Practice
Uncategorized
There was a point early in console Gen8, where I’d had some involvement in over half the published DRS implementations, anything from providing advice to hands on coding. Subsequently DRS has become a ‘must have’ feature for AAA games, particularly on consoles. There appears to be relatively little technical written about the technique, at least […]
Show full content

There was a point early in console Gen8, where I’d had some involvement in over half the published DRS implementations, anything from providing advice to hands on coding. Subsequently DRS has become a ‘must have’ feature for AAA games, particularly on consoles. There appears to be relatively little technical written about the technique, at least publicaly, and hence perhaps the following might be of interest.

Without DRS console games typicaly have to choose a resolution and level of content, such that the GPU ordinarily runs under the frame time budget, with sufficient overhead to avoid dropped frames when all the action kicks off. For example a 60fps console title might normally run with a GPU time of 14ms, leaving 2.6ms of overhead available for combat effects & etc.. Adhering to this budget while making good use of the GPU frequently takes a lot of development energy and discipline.

The idea of DRS is to lower resolution to avoid dropping frames, or conversely use spare GPU time to maximise image quality. DRS can also provide a development cost benefit, reducing time spent fine tuning content.

1. Viewports vs Memory Aliasing

Creating different size render targets each frame is a non-starter, the cost of runtime memory management is prohibitive, and an out of memory error runtime allocating a render target would be extremely difficult to deal with.

Very early DRS implementations used viewports to render the 3D scene to a limited portion of each render target. (In fact, this was the only way of implementing DRS with DirectX11 on PC) Issues with this approach include setting the viewport at all relevant points in the pipeline, and modify the UVs of any shader reading the render target, to only access the portion you rendered to. This is typically invasive, error prone due to border conditions(*), has hidden performance issues(**) and introduces an ongoing maintenance cost. It might also conflict with other use of viewports, e.g. splitscreen, adding complexity to resolve.

For example, in the above, if you were accessing a pixel with X coordinate of 3199 using UVs, instead of a U of 1.0, you would need a U of 0.833. (Typically you want to address pixel centres making the calculation a little more involved, but you get the idea)

Fortunately with DirectX12, there is a far better solution leveraging memory aliasing. For each render target, instead create an array of render targets of different sizes that alias the same physical memory pages required by the resolution with the highest memory requirement, counter intuitively this might not be the largest size (see Adam Sawaki’s excellent blog), so take the max memory size of each supported size. Selecting which render target to use from this array avoids the use of viewports, without increasing memory requirements.

See ID3D12Device::CreatePlacedResource

* The DirectX specification states that any ‘out of bounds’ access of the render target returns 0, and with render targets aliasing the same memory, you get a consistent out of bounds behaviour no matter what the resolution. With viewports, except at max res, out of bounds access on two edges returns stale data from a previous frame. (It can be surprising how often shaders rely on out of bounds behaviour)

** Clears and lossless render target compression technology are often optimized for a full target clear, rather than a region clear.

2. Which Render Targets?

Typically, the main change to an engine to support DRS is modifying the handling of descriptors for each aliased render target (RTV, SRV, UAV), in such a way that the DRS implementation is hidden from higher level code. That is, a high level render target abstraction object becomes under the hood, multiple DX12 render targets, one for each resolution.

You can either create an array of descriptors, one for each resolution, or create the descriptors on demand for the next frame’s resolution. In the latter case being extremely careful not to overwrite a descriptor the GPU is currently using! Descriptors are small and cheap to create, so there really isn’t much to choose between the two options.

It is necessary to flag which render targets should scale with DRS, which don’t (e.g. shadow maps), and which have DRS, but using last frame’s resolution choice. Those a frame behind typically store some form of history, temporal colour data for anti-aliasing perhaps.

3. Frame time and Resolution Choice

DRS normally has a large range of resolutions to pick from, 2 pixel increments can be a good choice.

Some titles choose to scale only in the horizontal axis, which can be an advantage in maintaining resolution on features that might suffer from geometry aliasing in the vertical axis, e.g. stairs. Another idea is to scale horizontally first, then scale vertically. This creates more usable increments in resolution vs scaling in both axis at the same time.

Of course, you need to decide what the resolution should be for a given frame time. If the GPU time runs over the target frame time, you need to reduce resolution, and visa versa. You need to drop resolution sufficiently to avoid dropped frames, but ideally without compromising visual quality by dropping further than required. Conversely when increasing resolution, its important to do this quickly, being over cautious results in the GPU being under utilised and resolution unnecessarily low. Making sure you don’t either drop resolution too low, or fail to increase resolution quickly enough requires careful checking with a good debug mode, since the problem is non-obvious, at least compared to a frame drop.

Only marginally exceeding frame time does not immediately result in a dropped frame, instead there is a latency in the presentation pipeline you can eat into before a drop occurs. Advanced implementations of DRS might monitor this latency to implement a ‘panic mode’ when latency is low, dropping resolution more aggressively.

Exactly how changing resolution affects the GPU’s frame time unfortunately ‘depends’. Not all GPU tasks scale linearly with resolution, and many do not scale at all. For example, shadow map rendering, particle update, culling or acceleration structure building normally do not scale. There may also be passes that initially scale down nicely with resolution, e.g. g-buffer rendering, but as resolution diminishes, so vertex processing increasingly becomes the bottleneck. That is, GPU time saving from resolution drop is non-linear, varies with content, and will vary depending on technology choices.

A scientific approach for console is to have several automated test points in the game, render these at each resolution and record the GPU time taken. In this way a table can be built of the expected GPU time delta any resolution change might have. This delta will change during development, as different GPU workloads are added or optimised and content is added to the game.

On PC, due to the varying hardware, wide array of graphics option that can be changed at any time, and variable power states, this table is perhaps best dynamically updated. The issue here is that a given resolution may not have been chosen for a while, and the scene may have changed significantly since that resolution was last chosen. A solution is to weight the frame time history with a confidence value, depending on how long it is since that resolution was last chosen. The inverse of this confidence value can be used to extrapolate from recently chosen resolutions, or select some known good default.

Typically, games might aim for a target frame time that is under the target frame rate, e.g. 16ms target for 16.6ms frame time (60fps). If the game consistently runs 0.6ms under the frame time, it will progressively build the maximum possible latency in the presentation pipeline, until the GPU is throttled waiting for an available swap chain buffer. Advanced DRS implementations which monitor this latency, might change the target frame time to be closer to 16.6ms when a large presentation latency exists.

It is important to ensure the CPU obtains GPU frame times with the least delay possible.

4. Frame Time Bubbles

A naive implementation for determining frame time is to add a single timestamp query (D3D12_QUERY_TYPE_TIMESTAMP) at the start and end of the frame, the difference being the time the GPU took to render the frame.

Advanced titles using multiple GPU queues (e.g. graphics and async compute) might want to time only the critical path, using a timestamp query for select ExecuteCommandList calls, and being careful not to double count any time on different queues. Non-critical tasks that might not be timed for example include GI update or sky simulation.

Its possible a title can create a bubble between two command lists, while async compute is operating. But this async compute task might not be a critical task, and in fact the title could run at a higher resolution with little or no adverse affect on frame time, by increasing resolution until the bubble collapses. PIX is great for spotting bubbles.

5. Debugging

I’ve found two debug features to be extremely useful in any implementation. Firstly, a graph over time showing resolution, GPU time and dropped frames. (very similar to the frame time graph Digital Foundry use, but with resolution graphed as well, perhaps the graph Digital Foundry wish they had!)

Naturally you want to optimise the GPU to be consistently busy, without dropping frames. Missing a presentation and dropping a frame is bad, but so is unused GPU time. Not dropping resolution fast enough and missing a frame is obvious, however less obvious is dropping resolution too far, or recovering resolution too slowly.

The second debug feature I’ve found useful is a mode which simply changes resolution every frame, increasing then decreasing, or randomly jumping around. This quickly exposes errors in shaders, or perhaps synchronisation issues which are otherwise hard to find, or hard to reproduce from a test report.

A useful trick I found to surface issues more easily, particularly synchronisation issues (e.g. cache flush) is rendering down to extremely low resolutions, far below the floor the title will actually ship with.

6. Camera Movement

In early DRS implementations there was a view that DRS should be allowed to scale down at any time, but only scale up resolution if the camera is moving or rotating. The issue was that changing resolution with a static camera was considered noticeable, particularly at lower resolutions.

Titles often have infrequent GPU tasks, e.g. updating the sky, causing an occasional GPU frame time spike. This could cause a static camera to pin to a lower resolution than the GPU was normally capable of, causing some observers to wonder if a title actually had DRS at all! Resolution would only increase when the camera was on the move.

I discovered that with modern Temporal Anti-Aliasing or Super Resolution, allowing the resolution to scale even with a static camera can in fact improve image quality. These techniques all implement a sub-pixel jitter to accumulate the ground truth over time. Adding a constantly varying resolution with DRS assists this accumulation process on what you might consider a macro scale.

7. Problem Shaders

It’s common that a title implementing DRS might need to fix a few shaders which incorrectly make some assumptions about resolution. Normally, these would be in deferred passes or post processing. It might be that the shader only works correctly for resolutions that are a multiple of 8 pixels, (most common resolutions are divisible by 8) or perhaps a multiple of 2 pixels. Constraining render target sizes to be a multiple of 2 pixels is reasonable for DRS, 8 pixels less so.

8. User Interface

Games don’t apply DRS to the user interface, and render the user interface at a fixed maximum resolution that frequently differs from the 3D render. There are many different solutions which can achieve this. You can use a second display plane on console, alternatively upscale the image and either render the UI over the top, or composite a UI render target.

Using a display plane or compositing both have the advantage that you can render the UI at a lower framerate than the 3D scene, e.g. 30hz UI update for 60hz gameplay. You also have more choice about when to render the UI in the frame, and therefore perhaps make better use of async compute to do ‘something else’ at the same time as UI rendering. Rendering the UI over the top requires less RAM. As is often the case, there is no right or wrong solution, only trades.

9. CPU load?

One interesting idea is if the CPU running over the frame budget but the GPU is not, let DRS increase resolution and deliberately run over time, to match the time the CPU is taking. If doing this, be careful to filter out any big CPU spikes!

One issue with this trick is, if a future patch improves CPU peformance, the title maybe be criticised for worsening GPU performance evidenced by a lower resolution, when GPU performance is in fact unchanged. This trick can also cause issue for content optimisers who believe an area is good because the resolution is good. I recommend if you are going to implement this GPU time overrun to match CPU time, its in released code only, and not development builds used for content building/optimisation, where the CPU is not a final performance..

10. The Future

One idea with some adoption, is to scale settings as well as resolution e.g., LOD swap distances, VRS tolerances etc..

A second interesting idea is to apply DRS not only to the internal resolution the title is rendered at, but to the output resolution also. ‘Dual DRS’ perhaps? For example, 1800p is often cited as ‘enough’ resolution that a player cannot notice the difference between 1800p and 4K on a console at living room distance, even with a large TV. (Of course, you can tell with a magnifying glass!) Typically, upscaling happens ‘mid frame’, meaning there a number of expensive post processing passes that happen after upscaling, particularly any which blur the image. Running these at 1800p instead of 4K means processing only 69% of the pixels, and a decent speed up that can be invested instead in increasing the internal resolution, the result of which can be a net higher image quality. However, if the title has even more GPU time spare, it could increase the 1800p output resolution towards 4K as a second DRS range, that perhaps kicks in when internal resolution exceeds say 1440p.

http://martinfullerblog.wordpress.com/?p=182
Extensions
Compute Shader thread index to 2D coordinate for QuadReadAcross and HLSL derivative operations
Uncategorized
I often find its useful to have four threads (or lanes, depending on your preferred parlance) collaborate on 2×2 pixels. Traditionally this would be done with group shared memory. However with Shader Model 6, we have the QuadReadAcrossX/Y/Diagonal intrinsics. The advantage of these intrinsics instead of that group shared memory is that group shared memory […]
Show full content

I often find its useful to have four threads (or lanes, depending on your preferred parlance) collaborate on 2×2 pixels. Traditionally this would be done with group shared memory. However with Shader Model 6, we have the QuadReadAcrossX/Y/Diagonal intrinsics.

The advantage of these intrinsics instead of that group shared memory is that group shared memory is a resource, using too much of it can restrict occupancy and reduce shader performance. Using shared memory may also come with synchronisation requirements and depending on your hardware, subtle performance penalties such as bank conflicts, which I won’t go into here. Generally speaking, I expect the QuadRead intrinsics to outperform group shared memory.

The documentation (see Quad-wide Shuffle Operations) gives the required order of the threads, the expected behaviour being:

[0, 1][2, 3]QuadReadAcrossX: Thread0 will obtain Thread1's value and visa versa.QuadReadAcrossX: Thread2 will obtain Thread3's value and visa versa. QuadReadAcrossY: Thread0 will obtain Thread2's value and visa versa. QuadReadAcrossY: Thread1 will obtain Thread3's value and visa versa.

Which is the exact order required in Shader Model 6.6, to issue HLSL operations that require derivatives in Compute, Mesh and Amplification shaders. See documentation.

The question then is for a given threadIndex (obtained via WaveGetLaneIndex()), how do we generate the 2D coordinate the thread is to operate on? There is no default in spec. Of course, you could just do a buffer read to map thread ID to a 2D coordinate, but you might want to use maths instead, depending on what your shader’s bottlenecks are.

I came up with two different layouts. A cheap version with two periods of tiling, and a more expensive version with three periods, which can be more useful for reduction operations, especially creating mip maps with wave64.

(Below: the lane to 2D coord cheap ‘rectangular’ mapping, I’ve highlighted the first 4 quads of a wave64)

In the more expensive version I was able to reduce the number of operations by packing the x.y coordinate into the top and bottom 16bits of a dword, thus allowing me to perform the same operation concurrently on both x and y. The code might take some dissecting, so the unoptimised version is reproduced in the comments.

(Below: the more epxensive lane to 2D coord ‘square’ mapping, I’ve highlighted the first 4 quads of a wave64)

The comment block show what tiling patterns are for wave64, but its easy to extrapolate what the code does for other wave sizes.

/*     MIT License    Copyright (c) Martin Fuller    Permission is hereby granted, free of charge, to any person obtaining a copy    of this software and associated documentation files (the "Software"), to deal    in the Software without restriction, including without limitation the rights    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell    copies of the Software, and to permit persons to whom the Software is    furnished to do so, subject to the following conditions:    The above copyright notice and this permission notice shall be included in all    copies or substantial portions of the Software.    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE    SOFTWARE*//*     Generate correct mapping for HLSL derivatives and QuadReadAccross        Sightly more expensive square version produces:     0  1  4  5 16 17 20 21     2  3  6  7 18 19 22 23     8  9 12 13 24 25 28 29    10 11 14 15 26 27 30 31    32 33 36 37 48 49 52 53    34 35 38 39 50 51 54 55    40 41 44 45 56 57 60 61    42 43 46 47 58 59 62 63    // 3 periods of tiling, non optimised for clarity:    x = ( index & 1      ) + ((index & 4) >> 1) + ((index & 16) >> 2);    y = ((index & 2) >> 1) + ((index & 8) >> 2) + ((index & 32) >> 3);*/uint2 ThreadIndexToQuadCoordSquare(uint threadIndex){    // duplicate index in top 16 bits, but pre-shifted right by 1    threadIndex |= threadIndex << 15;    // two bitwise ANDS for the price of one    uint2 coord;    coord.x  = threadIndex & 0x10001;    coord.x |= (threadIndex >> 1) & 0x20002;    coord.x |= (threadIndex >> 2) & 0x40004;    coord.y = coord.x >> 16;    coord.x &= 0x7;      return coord;}/*     Generate correct mapping for HLSL derivatives and QuadReadAccross        Cheap rectangular version produces:     0  1  4  5  8  9 12 13     2  3  6  7 10 11 14 15    16 17 20 21 24 25 28 29    18 19 22 23 26 27 30 31    32 33 36 37 40 41 44 45    34 35 38 39 42 43 46 47    48 49 52 53 56 57 60 61    50 51 54 55 58 59 62 63    // 2 periods of tiling, non optimised for clarity:    x = (index & 1) + ((index & 12) >> 1);    y = ((index & 2) >> 1) + ((index & 48) >> 3);*/uint2 ThreadIndexToQuadCoordRect(uint threadIndex){    uint indexSHR1 = threadIndex >> 1;    uint2 coord;    coord.x = (threadIndex & 1) + (indexSHR1 & 6);    coord.y = (indexSHR1 & 1) + ((threadIndex & 48) >> 3);    return coord;}

You could simplify the math a little for wave16, but these functions handle the common wave32 and wave64 sizes equally well.

http://martinfullerblog.wordpress.com/?p=154
Extensions
PS2 Vector Unit Lighting on Shadowman2
Uncategorized
I got my break as a graphics engineer working on the first Shadowman title at Acclaim, starting before graduating, though I did graduate. Shadowman2 was my first full game, and the studio’s first PS2 title. Development was troubled to say the least. I stuck it out, writing all of the core rendering code, mostly in […]
Show full content

I got my break as a graphics engineer working on the first Shadowman title at Acclaim, starting before graduating, though I did graduate. Shadowman2 was my first full game, and the studio’s first PS2 title. Development was troubled to say the least. I stuck it out, writing all of the core rendering code, mostly in ASM on the vector units. However I then spent almost a year as lead fire fighting while the renderer remained basically static. This was a first generation PS2 title, unfortunately delayed by a year. We never did implement triangle strips, an essential optimisation for PS2, but the world geometry did support quads.

Early on character artist Robert Nash wanted an increase in poly budget to open up Shadowman’s chest, adding a point light and effect, which enabled the night to be truly dark without being unplayable, but allowing monsters to lurk unseen. (Since popularised by Diablo titles) Additionally we wanted to support environment and weapon based point lights, a maximum of four lights affecting any one ‘VU packet’ of world geometry at once. The problem was, we could not afford the required divide and sqrt instructions. (And this is per-vertex, the PS2 did not have pixel shaders) Four lights is also a natural number for the PS2 VU’s, since floating point SIMD instructions typically had a latency of 4 cycles between issue and being able to use the result, but you could issue an instruction every cycle.

The Playstation2 VU’s had two pipes, the upper is a SIMD4 single precision floating point unit, which operated on each of .xyzw in parallel. From memory, divides were very expensive at 7 or 13 clock cycles (the general divide instruction being faster than the reciprocal instruction!), sqrt and rsqrt were similarly expensive. Worse these operated on a single floating point value, there was no vector4 divide or vector4 sqrt instruction.

The text book point light implementation for 4 point lights would have required well over 100 clock cycles, and of course we had to transform the vertex, apply 1/w to the UV’s etc.. I needed something which ran an order of magnitude faster!

The first ‘trick’ was not store vectors as .xyz, as it was almost impossible to use the w component of the 4 component registers effectively. (and this remained true throughout the next console generation as well, X360/PS3) So the four point light positions were stored in three vector registers by packing all four lights position X components in one register, all Y in the next and all Z in the last. (.xxxx, .yyyy. .zzzz) The PS2 VU’s had a great instruction set which allowed you code the following very efficiently:

float4 Lx = lightPositionX.xyzw - worldPos.x;float4 Ly = lightPositionY.xyzw - worldPos.y;float4 Lz = lightPositionZ.xyzw - worldPos.z;float4 distToLightSqr = Lx * Lx + Ly * Ly + Lz * Lz;

Quick and with no wasted register lanes. The next part was how to produce the attenuation term as cheaply as possible? This is easiest explained with a code fragement (though this isn’t what shipped)

float4 attenuation = 1.0 - distToLightSqr.xyzw * light.invRadius2.xyzw;vertexColour += attention.x * lightColour0;vertexColour += attention.y * lightColour1;vertexColour += attention.z * lightColour2;vertexColour += attention.w * lightColour3;

Here we pre-calculate 1/radius^2 on the CPU. In fact the above is a simplification for understanding. The final step was to multiply the attenuation calculation by light.colour and negate the last term, so that we can use the multiply-add instruction.

vertexColour += distToLightSqr.x * lightColour0DivRadius2 + lightColour0;vertexColour += distToLightSqr.y * lightColour1DivRadius2 + lightColour1;vertexColour += distToLightSqr.z * lightColour2DivRadius2 + lightColour2;vertexColour += distToLightSqr.w * lightColour3DivRadius2 + lightColour3;

The CPU pre-calculated: -(lightColour / radius^2).

Just for completeness, here’s the whole ‘shader’ run on VU1, and with the required max check to avoid a negative contribution. This of course doesn’t look anything like modern code for point lighting.

float4 Lx = lightPositionX.xyzw - worldPos.x;float4 Ly = lightPositionY.xyzw - worldPos.y;float4 Lz = lightPositionZ.xyzw - worldPos.z;float4 distToLightSqr = Lx * Lx + Ly * Ly + Lz * Lz;vertexColour += max(distToLightSqr.x * lightColour0DivRadius2 + lightColour0, float3(0.0, 0.0, 0.0));vertexColour += max(distToLightSqr.y * lightColour1DivRadius2 + lightColour1, float3(0.0, 0.0, 0.0));vertexColour += max(distToLightSqr.z * lightColour2DivRadius2 + lightColour2, float3(0.0, 0.0, 0.0));vertexColour += max(distToLightSqr.w * lightColour3DivRadius2 + lightColour3, float3(0.0, 0.0, 0.0));

The eagle eyed will have noticed also there’s no N.L term, I simply couldn’t afford that, so point lights attenuated with distance only, they didn’t attenuate with the surface angle. But that was it, 4 points lights without divide or sqrt instructions, efficently utilising all four SIMD lanes and the multiply-add instruction.

This super cheap attenuation curve is not a great model of reality, somewhat the inverse of what real light does, but it worked well enough for magical ‘voodoo’ effects, with small/medium sized triangles, and the rasterizers non-perspective correct(!) interpolation of these radiance values across the triangle. (Only textures were perspective correct on PS2, vertex colour was not)

I shared the point lighting code with Acclaim Studios Cheltenham who used is in the 60fps racer Extreme G3 on PS2. I’m not aware of many other PS2 games doing point lighting.

A personal bug bear was that the collision system was extremely ropey, and there was no seperate collision skin, so the artists kept the world geometry simple in areas the player could traverse, with a lot of the triangle budget often going to the ceiling! (Which didn’t help the point lighting look great next to the player) I eventually entirely rewrote the collision system prior to shipping, but far too late to effect the art content.

Dynamic lighting for instanced objects (as opposed to world geometry) worked differently. Here the CPU converted point lights into direction lights before uploading to VU1. That is the CPU computed a single distance attenuation value and a single light direction vector for the whole object. Meaning you could compute N.L angle attenuation very cheaply on the vector unit, but could not do per-vertex distance attenuation. I assume a lot of games did this.

Shadowman’s chest light didn’t look great automatically converted to a direction light, being inside the model, direction vectors flicked widly with the animation played. So I forced the light to point straight down, which helped also with his shadow and the visual aid for making jumps. The undesirable feature of using the render geometry as the collision geometry had a silver lining. I was able to use the collision geometry to project a disc between shadowman’s feet and draw a shadow. The CPU found the subset of collision triangles overlapping the bounds of the shadow, and then rendered these with UV’s computed to render a disk, darkening the underlying geometry between Shadowman’s feet, following any undulations in the geometry perfectly. (Unlike Shadowman1, which rendered shadowman squashed onto a single plane and in black, without any transparency)

Something I wish I’d had time to code was an option to light large instance objects the same way as world gometry, where the conversion of a point light to a direction light didn’t work well. Again I was fire fighting other problems all the way to shipping.

The live side levels (as opposed to dead side) had a sun/moon and a real time day night cycle was essential to the game. Instanced objects periodically raycast to the sun, I think every 16 frames, and would fade their contribution from the sun/moon up or down over time. Live side world geometry vertices had precalculated visibility from the sun for 16 different times of day, stored as individual bits. This same visibility was used at night for the moon. Something which really helped our RAM and storage problems was that the only world geometry that needed vertex normals was liveside outdoor sectors. So deadside levels and indoor sectors didn’t have vertex normals. We also had height fog, which was pretty unusual for the time and I don’t think even possible on PC with DirectX7.

I had a prototype of just-in-time world lighting on VU0 which was a lot faster but didn’t ship, due to there being some edge cases I never found time to code, including adding in the day night cycle shadows. We did however do just-in-time vertex skinning on VU0, so skinning and render ran in parallel on the two vector units.

Something I prototyped after Shadowman2 shipped was ‘dot3’ bump mapping on the PS2. This was not the full screen multi-pass algorithm Sony developed. Instead I uploaded 256 normals to VU0, computed N.L and point rendered the result to a texture palette as the render target. The idea was that I would do distance attenuation per vertex as in Shadowman2, and then modulate this per texel with the N.L term from the texture. This did look good for the time (I wish I had a screenshot!), however the problems here were that because the PS2 could only single texture, you needed a pass per light, and then you had to render the geometry again to blend the albedo texture on top. We were going to try this for Forsaken2, perhaps only for the ships and select models, avoiding organic shapes, so the 256 normal limitation was likely going to work out ok.

Sadly Forsaken 2 was canned, Acclaim folded, and I never found the opportunity to implement this tech in another title. Of course ‘dot3’ didn’t really take off, and was superseded by the far superior and now ubiquitous tangent space normal mapping.

http://martinfullerblog.wordpress.com/?p=108
Extensions
Fast, Near Lossless ‘Compression’ of Normal Floats
Uncategorized
Here’s a trick for lossless storage of ‘normal’ floating point numbers I came up with years ago, but was only reminded of recently. Realising I haven’t seen it anywhere else since, time for a blog. The IEEE754 single precision ‘float’ is ubiquitous in computer graphics, and much better undertood than it used to be, thanks […]
Show full content

Here’s a trick for lossless storage of ‘normal’ floating point numbers I came up with years ago, but was only reminded of recently. Realising I haven’t seen it anywhere else since, time for a blog.

The IEEE754 single precision ‘float’ is ubiquitous in computer graphics, and much better undertood than it used to be, thanks to some great blogs and engineers pushing the envelope being forced to get to grips with its limitations.

In computer graphics, its extremely common to store normal numbers, signed [-1..1] or unsigned [0..1], so much so, we have universal GPU support for SNORM and UNORM formats. Of course its also common to quanatize normal numbers to use less than 32bits, with great research in particular into high quality, compact storage of three dimensional normal vectors, for g-buffers and other applications. These are lossy, but that’s the point.

My technique stores an unsigned normal 32bit floating point number using only 24 bits with a maximum error of 5.96046448e-8 (0.0000000596448), and with zero error at 0.0 and 1.0. This is trivially extended to signed normal numbers.

To give one use case, storing normalised linear depth after rendering, you could pack linear depth into 24 bits and stencil into the other 8 bits. Giving an old school style D24S8 surface, but with negilable loss of precision vs a full 32bit float.

There are plenty of excellent resources on how floating point storage work, I’m not going to repeat these, but I need to cover just a little of how a ‘float’ is stored to explain the technique. This is the simplisitic way I think of the three components of the IEEE754 single precision float:

  • A sign bit – simple
  • An 8 bit exponent – the power of 2 range that contains the number
  • 23 bits of mantissa – an interpolation from the lower power of 2 in this range, up to but not quite including the next power of 2.

So for example, the exponent might specify ranges [0.5..1} or [1..2}, or [4..8} etc.. Its the range [1..2} which is key to this technique, since the delta of the stored numbers in this range is 1, or nearly 1 to be precise.

Dealing with unsigned normal numbers only for a moment, if we add 1 to our number, then we can store off the 23bits of mantissa and discard the rest of the floating point representation. To reconstruct we bitwise OR in the infamous 0x3f800000 (1.0) and then subtract 1 to get back into the original range. Unfortunately we also want to handle the case that the number stored is exactly 1, so we need another bit for that. This then is how we get to 24 bits, move the normal float into the [1..2} range, store the 23bit mantissa and store an extra bit to indicate if the value is exactly 1.

Here’s the code in HLSL, note there’s actually a problem with the compress function, but I’ll come to that in a bit.

// note this function has an edge case where it breaks, see below for why and a fixed version!
uint CompressNormalFloatTo24bits(float floatIn)
{
    return (floatIn == 1.0) ? (1 << 23) : asuint(floatIn + 1.0) & ((1 << 23) - 1);
}

// input needs to be low 24 bits, with 0 in the top 8 bits
float DecompressNormalFloatFrom24bits(uint uintIn)
{
    return (uintIn & (1 << 23)) ? 1.0 : asfloat(uintIn | 0x3f800000) - 1.0;
}

Clearly both ‘compression’ and ‘reconstruction’ are extremely cheap operations, especially as the compiler can resolve some of the bitwise operations to a constant. Why any error at all? The error creeps in from the fact we are manipulating the floating point number out of the [0..1} range, the storage of which uses one of many different possible exponents, then by adding 1 we move into a single exponent range that covers all of [1..2}, and this is not a lossless operation. However typically in computer graphics, an engineer is unlikely to be put off by a max floating point accuracy error of 5.96046448e-8.

So what’s the problem with the above compression function? There issue is, there is one number which can be stored in the [0..1} range, but when we add one, it cannot be represented in the [1..2} range. This is 0.99999994, the hexidecimal 0x3f7fffff gives a clue as to the problem, all mantissa bits are set. When we add 1.0 to this, we get 2.0, not 1.99999994 (as this number is not representable), 2.0 is not covered by our chosen exponent, and so the above function breaks. Fortunately the fix for our compression function is simple and ordinarily no additional cost, at least on a GPU:

uint CompressNormalFloatTo24bits(float floatIn)
{
    // careful to ensure correct behaviour for normal numbers < 1.0 which roundup to 2.0 when one is added
    floatIn += 1.0;    
    return (floatIn >= 2.0) ? (1 << 23) : asuint(floatIn) & ((1 << 23) - 1);
}

The eagle-eyed will have noticed I changed == to >=, this is just a safety feature for bad input and not actually part of the fix, clamping our input for free, which is always nice.

Handling signed normal floats we need to store the sign bit also which is trivial, and then we can use the same functions by taking the abs of the input. Of course you might wish to keep to 24 bits, and so you might sacrifice the least significant mantissa bit.

24bits is of course a bit of an odd size for a computer to deal with, so this is really a tool in your toolbox for packing with other data. The ability to drop least significant mantissa bits gives some flexibility in packing.

I’ve only used this on IEEE754 single precision floats, thinking out loud, there are some interesting possibilities for other floating point representations:

  1. Half precision floats (and NVidia’s TensorFloat) have 10 bits of mantissa. A three component signed normal vector would require 12+12+12 = 36 bits. To get into 32bits you could either drop 1 or 2 mantissa bits from each component, or you might chose to drop the ability to store exactly -1 and 1., saving a bit from each and only having to drop 1 mantissa bit total.
  2. Brain floats have 7 bits of mantissa, this trick for a unsigned normal numbers would only require a byte.

As a bonus, here’s some functionality for C++ guys wanting to run the same functions

union floatint32_u
{
    float f;
    uint32_t u;
    int32_t s;
};

uint32_t asuint(float input)
{
    floatint32_u t;

    t.f = input;
    return t.u;
}

float asfloat(uint32_t input)
{
    floatint32_u t;

    t.u = input;
    return t.f;
}

http://martinfullerblog.wordpress.com/?p=75
Extensions
Min/Max Buffer Precision Improvement
Uncategorized
This is a simple trick I came up with years ago. I’ve finally decided to create a blog and share a little with the community, wish me luck! In computer graphics you often end up storing min/max data pairs. An obvious example is in deferred lighting, where a game engine will compute and store the […]
Show full content

This is a simple trick I came up with years ago. I’ve finally decided to create a blog and share a little with the community, wish me luck!

In computer graphics you often end up storing min/max data pairs. An obvious example is in deferred lighting, where a game engine will compute and store the min and max of all depth values in a screen tile, say 16×16 pixels large. Lights are then culled against this depth range. (or multiple ranges, in the case of clustered schemes)

Of course, graphics engineers are concious of memory use and moreover implications for bandwidth and cache pressure. Therefore its common to quantize data to the smallest type we can get away with. So for example we might chose to store min/max linear depth values as 16bit unorm’s. e.g. using a 2D texture with the format DXGI_FORMAT_R16G16_UNORM. Probably as I do, converting from reverse non-linear Z during rasterization to forward linear Z for deferred passes.

The min/max texture for a terrain scene looks like this:

The red channel stores the minimum depth value, and the green channel the maximum in each screen tile. RG 1,0 (an illegal value) is being used to denote a clear tile, i.e. sky. Where min and max depth are similar, we have some shade of yellow based on distance. When there is a large depth range, the colour tends towards green, as the green channel is storing a max value substantially larger than the min value. Intuitively this occurs on silouettes. Such storage is common, but wasteful of precision, since both min and max channels can store anything in the range 0..1.

Depth was originally a 32bit float and in converting to 16bits we lost a lot of information.

Fortunately we have an exploitable constraint in our data, i.e. min <= max and conversely max >= min. The trick is to make one of these values relative to the other. In the following I choose to gift min more precision, but its just as easy to do the same for max instead.

Using the same texture format, instead of storing min and max directly, I store max and a delta which interpolates between 0 and max. So as long as max is less than 1.0, we have improved the precision of min. This is trivial to code:

// encode for texture write
encodedRG = float2(min / max, max);
// decode after texture read
min = encodedRG.x * encodedRG.y;
max = encodedRG.y;

In this scheme the green channel of the texture looks exactly as it did before, however the red channel is drastically altered.

When there is very little difference between min and max, the encoded delta value is close to 1. Only when a large depth discontinuity exists do we see smaller values in the red channel.

If the max value is 0.25 for example, which the far mountains are in this scene. The minimum value benefits from effecively four times the precision, since the same 16 bits are now being used to store values in the range 0 – 0.25, instead of 0 – 1.

(Note I have modified the histogram of the images slightly to make the colours stand out more)

This results in ~0.4% speed up in my deferred lighting, due to less pixels processed with zero attenuation. Not bad for such a small change, but not earth shattering either. YMMV, and of course improvements in precision are sometimes about quality rather than optimisation.

A future extension would be to stop using a 2 channel texture and instead pack delta min and max into a single R32_UINT. The potential benefit here would be to gift a different number of bits to each of delta min and max. Say giving max 17 bits and delta min 15 bits. This of course requires the shader to perform more operations in packing and unpacking.

http://martinfullerblog.wordpress.com/?p=45
Extensions