Tuesday, July 25, 2017

Vectorized Production Path Tracing

I'll be giving a presentation at the High-Performance Graphics 2017 conference on our adventures vectorizing our new in-house renderer. Drop by and say hi if you happen to be there.

ABSTRACT

This paper presents MoonRay, a high performance production rendering architecture using Monte Carlo path tracing developed at DreamWorks Animation. MoonRay is the first production path tracer, to our knowledge, designed to fully leverage Single Instruction/Multiple Data (SIMD) vector units throughout. To achieve high SIMD efficiency, we employ Embree for tracing rays and vectorize the remaining compute intensive components of the renderer: the integrator, the shading system and shaders, and the texturing engine. Queuing is used to help keep all vector lanes full and improve data coherency. We use the ISPC programming language to achieve improved performance across SSE, AVX/AVX2 and AVX512 instruction sets. Our system includes two functionally equivalent uni-directional CPU path tracing implementations: a C++ scalar depth-first version and an ISPC vectorized breadth-first wavefront version. Using side by side performance comparisons on complex production scenes and assets we show our vectorized architecture, running on AVX2, delivers between a 1.3× to 2.3× speed-up in overall render time, and up to 3×, 6×, and 4×, speed-ups within the integration, shading, and texturing components, respectively.

Link to Vectorized Production Path Tracing paper.

Tuesday, January 10, 2012

Safer Data Sharing Between Threads

[This article was originally posted on AltDevBlogADay]

Let's say we are given the task of optimizing a serial program to run on a multi-core system. In the process, we've isolated one block of the code where we can achieve a very nice speedup by delegating it to a second thread, thus allowing the main thread to get a head start on the next portion of work. At some later point in time the main thread then syncs to this worker thread before it uses the results. Since the original serial code is all working correctly, we're trying to be as minimally invasive as possible so as to not inadvertently break anything; we decide to keep all the data structures as they are and allow the worker thread to directly modify the single 'global' copy of the data. Also, since we're interested in performance, any fine grained, per variable, locking is frowned upon. All of this should be fine, we reason, since as the worker thread is modifying the data, the main thread won't need access to it, and we have a sync in place for when the main thread does eventually need access to guarantee the worker thread is done. Here is the scenario in code:

Main thread:

    // single shared variable here, but may be hundreds of variables in practice
    int shared_var = 2;
    uint32_t task = SubmitAsyncTask(DoWorkerTask);

    // main thread goes to work on next task...

    SyncOnTask(task);

    // prints 5
    printf("%d\n", shared_var);

Worker thread:

    DoWorkerTask()
    {
        shared_var += 3;
    }

All is well until a new feature request comes in months down the road which causes us to innocently modify some of the shared data during a portion of the frame which we really shouldn't have:

Main thread:

    // single shared variable here, but may be hundreds of variables in practice
    int shared_var = 2;
    uint32_t task = SubmitAsyncTask(DoWorkerTask);

    // main thread goes to work on next task...

    // BAD, we shouldn't be touching this variable during this portion of the frame
    shared_var = 5;

    SyncOnTask(task);

    // is this 5, or 8?
    printf("%d\n", shared_var);

Don't let the simplicity of this contrived example fool you, the same access pattern, when intermixed with another 100k lines of code, may not be quite so easy to spot.

From my experience, this is a pretty common pattern in games programming. It manages to completely avoid the overhead from any fine grain synchronization primitives, such as critical sections, when accessing data. Instead, it depends on an implicit contract between the main thread and the worker thread, which goes something like: "I, the main thread, promise not to modify any elements belonging to a small, well defined, subset of data which I will share with you, during any point in the frame that it is possible that you, the worker thread, may be transforming it. In return, you, the worker thread, may only modify this small, well defined, subset of this data and nothing else."

Already we see that this contract is pretty fragile at best, filled with good intentions, but with little in the way of actual concrete enforcement. Maybe it's documented with comments, maybe not. Either way, the code as it stands is tough to maintain and modify. Fellow co-workers are likely to dread the thought of going near it.

The problem is that it's non-obvious, from just looking at any block of code partaking in this contract, just what data it's allowed to touch at that point in time. It may even be quite tough to reason about when in a frame it even gets executed, and even trickier if that execution point jumps around from frame to frame, due to unpredictable events. Comments can only get you so far, ideally we'd like some compiler guarantees about what we can and can't access. I don't know how to make the compiler detect these cases, but the next best thing, runtime checking, is actually quite simple.

During some recent delving into what's been going on in the C++11 world with regards to concurrency (lambdas, you rock!), I came across this surprisingly straightforward solution to the problem (the relevant section is around 19:45). Overlooking the syntactical C++ sugar coating, the concept itself is very basic and translates to any language; let's simply transfer ownership of the shared data when we cross these thread boundaries. We can do this by determining what data actually needs to be shared, putting it into its own class/struct, and then ensuring that only one pointer can ever point to it at any one instance in time.

In C++11, the machinery for this is built-in by making use of std::unique_ptr and the move semantics which r-value references allow (not important for the understanding of this article). It's perhaps simpler to visualize in C though; taking the previous example and wrapping the shared data up, as described, might look something like the code below:

Main thread:

    struct Shared
    {
        int shared_var;
    };

    Shared shared;

    Shared *main_thread_access = &shared;
    Shared *worker_thread_access = nullptr;

    main_thread_access->shared_var = 2;

    uint32_t task = SubmitAsyncTask(DoWorkerTask);

    // ...

    // error: race condition, we'll likely crash at some point during
    // development/testing
    main_thread_access->shared_var = 5;


    // ...

    SyncOnTask(task);

    printf("%d\n", main_thread_access->shared_var);

Worker thread:

    void DoWorkerTask()
    {
        Shared *worker_thread_access;
        DataPointerAcquire(&main_thread_access, &worker_thread_access);

        worker_thread_access->shared_var += 3;

        DataPointerRelease(&main_thread_access, &worker_thread_access);
    }

Notice how the conditions of the contract between the main and worker thread are now more explicit. We can see which data is intended to be shared between threads since they're accessed through specific pointers. We can even search in a text editor for all the spots in a program in which we are altering any particular block of shared data. Furthermore, we've introduced the concept of data ownership. Either the main thread or the worker thread owns the data at any point in time. If the main thread makes any attempt to modify a shared variable whilst the worker thread owns it, we'll dereference a null pointer and can readily see in the debugger where the race condition happened, rather than having to backtrack from the artifacts the race leaves behind at a later part in the frame (on someone else's machine; on a bug which can only be reproduced once every 10 hours of testing; on a final code built; oh, and we have to hit gold by next week; just sayin').

The DataPointerAcquire function should be called on the thread which wants to take ownership of the shared data. Later, a matching call to DataPointerRelease should be also called from the same thread when the threads want to relinquish ownership. Those functions might looks something like:

    // called on the thread which wants to take
    // ownership of the shared data
    template <class T>
    inline void DataPointerAcquire(T **old_owner, T **new_owner)
    {
        assert(*old_owner);
        T *ptr = *old_owner;
        *old_owner = nullptr;
        MemoryBarrier();
        *new_owner = ptr;
    }

    // called on the thread which currently owns the shared data
    // (i.e. the thread which issued the previous DataPointerAcquire)
    template <class T>
    inline void DataPointerRelease(T **orig_owner, T **thread_ptr)
    {
        assert(*orig_owner == nullptr && *thread_ptr);

        // make the computation results visible before releasing our ownership
        MemoryBarrier();

        *orig_owner = *thread_ptr;

        // make the master pointer (orig_owner) visible to all other threads
        MemoryBarrier();

        *thread_ptr = nullptr;
    }


You may have noticed there is now an extra deference required to access the data. This, however, seems a small price to pay for the additional peace of mind and maintainability, both for you, and your co-workers.

Tuesday, June 21, 2011

How a DSLR could help to take better handheld photos

I got a canon digital Rebel XTi back in 2007, mainly to take family photos. It wasn't a coincidence that it happened to be the year my daughter was born. More recently however I've started to take more interest in the subject of photography. It's a great hobby for all the graphics programmers out there to build up their rendering intuition since there's no faking it - this is how light actually works!

In common with graphics programming, much of photography is an optimization problem. Most of the time we don't have as much light hitting the sensor as we'd like, so the problem becomes how do we optimize the ISO, shutter speed, and f-number settings to give the best quality image free of camera shake. It may be that the Rebel XTi being a low end camera doesn't have all the wizbang features that the latest and greatest have, but I find the procedure of setting up the optimal combination of these 3 settings kind of haphazard to say the least. How many times has what could have been a great shot turned out to be a little too "soft"? There a powerful computer inside of the DSLR in my hands, why can't it better help me determine the optimal settings, much in the same spirit that it can auto-focus for us. I could use manual focus myself all the time but it's usually faster and more accurate.

I've been thinking a bit about how the situation could be improved lately and here is one idea. There are many photographic scenarios, but the one I find myself in the most is the case where I'm shooting handheld and light isn't abundant. The most common example would be taking indoor pictures. I try not to use the flash to avoid the harsh look it gives. So the question is then, given that I don't have all the light I would like, and am shooting handheld, what are the optimal combination of those 3 settings (namely, ISO, shutter speed and f-number)?

What I do currently for indoor handheld photos is set the ISO to something like 400 and the camera mode to Av (aperture priority). This allows me to set the focal length and aperture based on how I want to compose the image, which fully dictates the f-number the camera uses. The only variable left is the shutter speed which the camera picks based on how much light it meters in the scene. This is quite an in-optimal arrangement. It'll happily pick relatively long shutter speeds leaving me with blurry images which are only fit for the trash. A much more preferable scheme would be a new "handheld" mode. It might work like this:

  1. I tell the camera I'm shooting handheld (using a new setting on the mode dial).
  2. I setup a focal length and aperture based on my composition desires. This will result in a f-number which is usable by the camera.
  3. The camera computes a "smart" shutter speed which will keep the image sharp for the average handheld user. 

But what is a good value for this "smart" shutter speed? A typical heuristic for handheld shutter speeds is 1 / focal length. For example, if shooting at 100mm, then the shutter speed should be 1/100 seconds. This is generally regarded at a higher bound on the shutter speed, meaning that we should really pick a value where the shutter is open for slightly less time. The heuristic also assumes a full frame sensor so we need to take the crop factor of the camera into consideration. So, in this example, we'd need less than 1/160 second for a rebel (since it has a crop factor of 1.6). Also if we are using a lens with image stabilization or vibration reduction, then that should be taken into account here also (let's see how IS really measures up to the 3 to 4 stop manufacturer claims.)

Once a shutter speed is chosen that will stack the odds in our favor against camera shake, then the camera picks an ISO - this fills in the last remaining parameter. The settings menu should allow the user to set the maximum permissible ISO and if the camera needs to pick an ISO higher than allowed, it just won't take the picture (much like it doesn't take a picture when it can't focus). So why have ISO be the final parameter as opposed to the first (as with Av mode). Well, there are two reasons which spring to mind:

  1. I can do more with a photo with high noise levels than a blurry photo. In both cases we're missing information and can't fully recover the original source data. But algorithmically, we can do a better job removing a certain amount of noise from an image than we can by trying to make blurry image sharp, which in most cases is a lost cause.
  2. Higher-end, more expensive DSLRs respond much better at high ISO settings than cheaper ones. A blurry photo is a blurry photo on any camera, but a noisy photo on a cheap camera may be a clean photo on an expensive one.

So how would this work in practice? Wikipedia has a good page on exposure value here. When a camera meters a scene it effectively comes up with a single exposure value (EV), which is a number the 3 camera settings of interest must combine to produce. Plotted logarithmically in 2D, with time on one axis, and the f-number on the other, we get diagonal lines of constant exposure value as seen here.

If we extended this concept into 3D by adding an extra dimension for ISO (which is probably what cameras are doing anyway I'd guess?), we'd get planes of constant exposure instead. A given DSLR computed EV value would naturally dictate the plane we need to use. This together with an f-number from the user would allow the camera to compute a "smart" shutter speed, set the resultant ISO and voila.

Sports photography, where we want to freeze motion, could also benefit from a variation on this mode. Instead of computing an appropriate handheld shutter speed, the camera would use a speed suitable for whats being shot - say 1/1000th second for a football game. The final step would remain the same and an appropriate ISO value would be computed.

This is a relatively new subject to me, do cameras already do this perhaps?

[Edit: it's been pointed out that factors such as handheld skill, fatigue, and environment are also factors affecting optimal shutter speed. I think these would best fit into a single shutter speed compensation setting (in much the same spirit as cameras currently have an EV compensation setting).]

Thursday, February 24, 2011

Improved Normal-map Distributions

[Note, this was originally posted here on AltDevBlogADay]
Many moons ago at Insomniac, we used to use a partial derivative scheme to encode normals into texture maps. Artists complained that it was having detrimental effects on their normal maps, and rightly so. Then one day, our resident math guru Mike Day turned me onto this little trick in order to make amends. I hadn’t come across it before so thought it might be worth sharing.
A common technique for storing normals in a texture map is to compact them down to a 2 component xy pair and reconstruct z in the pixel shader. This is commonly done in conjunction with DXT5 compression since you can store one component in the DXT rgb channels and the other in the DXT alpha channel, and they won’t cross pollute each other during DXT compression stages. DirectX 10 and 11 go a step further and introduce a completely new texture format, BC5, which caters for storing 2 components in a texture map explicitly.
The code to reconstruct a 3 component normal from 2 components typically looks like this:
float3 MakeNormalHemisphere(float2 xy)
{
    float3 n;
    n.x = xy.x;
    n.y = xy.y;
    n.z = sqrt(1 - saturate(dot(n.xy, n.xy));
    return n;
}
Graphically, what we’re doing is projecting points on the xy plane straight up and using the height of intersection with a unit hemisphere as our z value. Since the x, y, and z components all live on the surface of the unit sphere, no normalization is required.



Is this doing a good job of encoding our normals? Ultimately the answer depends on the source data we’re trying to encode, but let’s assume that we make use of the entire range of directions over the hemisphere with good proportion of those nearing the horizon. In the previous plot, the quads on the surface of the sphere are proportional to the solid angle covered by each texel in the normal map. As our normals near the horizon, we can see these quads are getting larger and larger, meaning that we are losing more and more precision as we try to encode them. This can be illustrated in 3D by plotting the ratio of the differential solid angle over the corresponding 2D area we’d get from projecting it straight down onto the xy plane. Or equivalently, since the graph is symmetrical around the z-axis, we can take a cross section and show it on a 2D plot.


The vertical asymptotes at -1 and 1 (or 0 and 1 in uv space if you prefer) indicate that as we near the borders of our normal map texture, our precision gets worse and worse until eventually we have none. Your mileage may vary, but in our case, we tended to use wide ranges of directions in our normal maps and would like to trade off some of the precision in the mid-regions for some extra precision at the edges. Precision in the center is the most important so let’s try and keep that intact. What options do we have? Well it turns out we can improve the precision distribution to better meet our needs with a couple of simple changes…
Nothing is constraining us to using a hemisphere to generate our new z component. By experimenting with different functions which map xy to z, better options become apparent. One function we have had success with is that of a simplified inverted elliptic paraboloid (that’s a mouthful, but is more simple than it sounds).
z = 1 – (x*x + y*y)
This is essentially a paraboloid with the a and b terms both set to 1, and inverted by subtracting the result from 1. Here is a graphical view:



Notice that we don’t have the large quads at the horizon anymore and overall, the distribution looks a lot more evenly spread over the surface. Wait a second though… this doesn’t give us a normalized back vector anymore! Fortunately the math to reconstruct the normalized version is roughly the same as before in terms of cost. We need to do a normalize operation but save on the square root.
float3 MakeNormalParaboloid(float2 xy)
{
    float3 n;
    n.x = xy.x;
    n.y = xy.y;
    n.z = 1 - saturate(dot(n.xy, n.xy));
    return normalize(n);
}
Below, the overlaid green plot shows the angle to texture area ratio of the new scheme. The precision is still highest in the center where we want it, and we’re trading off some precision in the mid-ranges for the ability to better store normals near the horizon.



Looking at the 3D graphs, another observation is that we’re wasting chunks of valid ranges in our encodings. In fact, a quick calculation shows that ~14% of the possible xy combinations are going to waste. A straightforward extension to the scheme would be to use a function which covers the entire [-1, 1] range in x and y to make use of this lost space. There are an infinite amount of such functions, but a particularly simple one is this sort of dual inverted paraboloid shape.
z = (1-x^2)(1-y^2)
which looks like:



One downside of this is that we need a little more math in the shaders in order to reconstruct the normal. Because of this, on current generation consoles, we went with the basic inverted paraboloid encoding. Going forward though, with the ratio of ALU ops to memory accesses still getting higher and higher, the latter scheme might make more sense to squeeze a little extra precision out.
Finally, here is the code used when generating the new normal map texels (the initial circular version not the second square version). The input is the normal you want to encode and the output is the 2D xy which is written into the final normal map.
Vec2 NormalToInvertedParaboloid(Vec3 const &n)
{
    float a = (n.x * n.x) + (n.y * n.y);
    float b = n.z;
    float c = -1.f;
    float discriminant = b*b - 4.f*a*c;
    float t = (-b + sqrtf(discriminant)) / (2.f * a);
    Vec2 p;
    p.x = n.x * t;
    p.y = n.y * t;
    return p;                              // scale and bias as necessary
}