Casually PathTracing with WebGPU Part2 -

In the first part we talked about how we got started on path tracing with WebGPU. We began with simple parametric shapes just for testing, then continued with what we were really interested in, polygon meshes. We made a simple improvement to our intersection testing, saw how it work, and more importantly how it didn’t work, then concluded we need a proper acceleration structure in order to get things going. In this part we’ll be focusing on such an acceleration structure: Bounding Volume Hierarchies

Bounding Volume Hierarchy (BVH)

Normally, one would talk about the different types of acceleration structure first, the history behind them, explain the differences and the pros and cons. I won’t be doing that, since smarter people have already done it here and especially here. I strongly recommend reading these last two as they are very insightful and well written.

We’re almost done with the intro, just a quick thing first. I feel I need to explain why I’m choosing BVH over other acceleration structure, and also why I’m choosing a particular type of BVH. Since we’re developing for the web, in javascript, we don’t really have access to proper multi threading. BVHs need to be built and possibly updated in real time(-ish) so we can’t really build them fast enough on the CPU. We want to build them on the GPU. That’s why I chose to use linear BVH which can be constructed on the GPU much faster, albeit they are allegedly less performant. But we’ll see.

Starting out

Right, so what we could do, is just snatch a random BVH JS library out there and call it a day. There’s this for example, which might be good. But we don’t really want that. We’ll be building our BVH inside a (several) compute shader(s). At the time of this writing I have not found any JS libraries for WebGPU which build BVHs on the GPU. Also, the main goal here is to learn and understand, so what better way to do just that, than to implement it ourselves.

Because we want to implement it, and not re-invent it, we need a reference to start from. There might be several out there, but I chose this one. Which is inspired by ⁽⁵⁾ and ⁽⁶⁾. It seems a bit old, and it’s for DirectX, but we should manage. Right?

There are basically two parts to a BVH implementation. Building and Traversal. After studying the reference for a bit, one thing was clear. It would be really difficult to attempt both building and traversal on the GPU right from the start. There are simply too many ways things can go wrong, especially since we’re basically porting this to JS and WebGPU. That’s why I chose to first try “software” mode, were I do both building and traversal on the CPU just to make sure things do actually work.

Building and Traversing the BVH on the CPU

A good reference for the theory behind this is available at ⁽⁵⁾ and ⁽³⁾ . There’s not much to say about this, since it’s a JS port of the original reference and also more or less just a test. Here’s how a node looks:

class Node {
    parent: number; // index of parent
    childL: number; // index of left child
    childR: number; // index of right child
    public code: number; // assigned morton code
    public bbox: Box = new Box(); // bounding box of the triangle object, or if it's an internal node, a union of children bounding boxes
    public index: number; // base index of the triangle inside the indices buffer
    public color: number; // for debugging only
};

A quick run though of the steps involved in building the tree:

Morton Code Generation: I used this out of a few, which provided actual good results. The morton code is generated based on the centroid of the triangle object. In this step we also compute and store the bounding box of each triangle in it’s node, and also we store the base index of the triangle objects
Sort the nodes: The nodes need to be sorted based on their assigned morton codes since we want objects that are closer to each other position wise, to be closer in the tree hierarchy.
Build the internal nodes: We build the hierarchy of internal nodes recursively.

Porting the tree building code to JS was a bit difficult, especially the sorting step which caused some issues^*. It’s really hard to tell if BVH is correct or not just by looking at some logs. From what I could tell the hierarchy made sense so I decided to push through and see if the tree is good for rendering.

Because I wanted quick feedback, my first attempt of rendering using the BVH was “software” in javascript. After a few struggles, I got this image

It’s not much, but it’s honest work! The image is just a depth representation of the scene, and a correct one too! This meant, the BVH was working decently. Next step was to try and render using the CPU generated BVH on the GPU.

Traversing the BVH on the GPU

Before starting on this topic, I want to make a note. Several months had past before I started to implement BVH traversal in our compute shader. I didn’t have enough free time, so I had to put the project on hold. When I came back, WebGPU had update it’s API and also some of it’s feature availability. First thing I had to do was to update the current state of the project to the latest working API version. While doing this, I noticed atomic operations are no longer available. I posted this question here hoping to find out what’s happening.

Right, so back to BVH traversal. We have an allegedly^* correctly built BVH. We need to send it to the GPU, so we’re going to make a few changes to our compute shader. Previously we stored our scene geometry inside a buffer like so

layout(std430, set = 0, binding = 5) readonly buffer Meshes {
     Mesh meshes[];
};

struct Mesh {
    int offset;
    int triangle_count;
    vec3 diffuse;
    vec3 emission;
    vec3 boxMin;
    vec3 boxMax;
};

We’re going to ditch this entirely and replace it with our BVH :

// Same as before
layout(std430, set = 0, binding = 3) readonly buffer Vertices {
    Vertex vertices[];
};

// Same as before
layout(std430, set = 0, binding = 4) readonly buffer Triangles {
    int triangles[];
};

layout(std430, set = 0, binding = 5) readonly buffer BVHTree {
    Node nodes[];
};

struct Node {
    int childL; // 4 bytes. Index of left child in the tree
    int childR; // 4bytes. index of right child in the tree
    uint materialIndex; // 4 bytes. index of the material to be used ( more on this below )
    uint index; // 4 bytes. base index of the triangle inside the indices buffer
    vec3 bbMin; // 12 bytes. bounding box min
    // 4 bytes padding
    vec3 bbMax; // 12 bytes. bounding box max
    // 4 bytes padding
};

We’re also going to need at least a basic material system so we can define some material properties from the javascript app. Previously, all our material properties were stored inside the Mesh structure, but we’ve eliminated that. We’ll go ahead and define basically the same properties in their own struct

struct Material {
    vec4 baseColor;
    vec4 emission;
};

Now we want to send material information from the javascript app

layout(std140, set = 0, binding = 7) uniform Materials {
    Material materials[5];
};

For now we’re storing the material index in the above uniform block required by each triangle object at node level. It’s probably not the best approach since we’re repeating the same information for each triangle of all the meshes. But we’ll leave it at that for now. We’re using a uniform and not a storage buffer because currently WebGPU caps storage buffer usage per program at 4. We’re already using 3, so let’s keep the last one available for something more important.

What else do we need? We’ll add another general purpose uniform block where we send the following:

layout(std140, set = 0, binding = 6) uniform RayTraceBuffer {
    uint numObjects; // used in traversal
    uint screenWidth; // general purpose
    uint screenHeight; // general purpose
} rayTraceBuffer;

With all this in place we can start traversing. We’ll be replacing our previous function hit_world with a new one

struct ColTri {
	bool collision;
	float distance;
	uint index;
    uint materialIndex;
	Triangle tri;
};

#define MAX_FLOAT 3.402823466e+30F
#define STACK_SIZE 32 // Relates to max tree depth

bool findCollision(Ray ray, out ColTri colTri) {
	// set no collision
	colTri.collision = false;
	colTri.distance = 0;

	// create the stack vars (1 item is unusable in the stack
	// for controle flow reasons)
	int stackIndex = 0;
	int stack[STACK_SIZE];

	stack[0] = -1;

	// start from the root
	int nodeID = int(rayTraceBuffer.numObjects);

	// tmp vars
	int leftChildNode, rightChildNode;

	bool leftChildCollision, rightChildCollision;
	int leftLeaf, rightLeaf;

	Triangle triTmp;
	float distance;

	uint index;
    bool does_hit = false;
    float t = 0.0;
    float best_min_t = MAX_FLOAT;
    float u, v;
	do
	{
		// record the index of the children
		leftChildNode = nodes[nodeID].childL;
		rightChildNode = nodes[nodeID].childR;

		// check for leaf node
		if (leftChildNode == -1 && rightChildNode == -1)
		{
			// get the index
			index = nodes[nodeID].index;
			// get triangle
			for (uint i = 0; i < 3; i++)
                triTmp.vertices[i] = vertices[triangles[index + i]];

            // run collission test
            vec3 v0 = triTmp.vertices[0].position;
            vec3 v1 = triTmp.vertices[1].position;
            vec3 v2 = triTmp.vertices[2].position;
            distance = hit_triangle_mt(ray, v0, v1, v2, t, u, v) ? t : -1; // might want to avoid this branch here
			//  update the triangle
			if (distance != -1 && (!colTri.collision || distance < colTri.distance))
			{
				// set index for material reasons later
				colTri.index = triangles[index];
                colTri.materialIndex = nodes[nodeID].materialIndex;

				// set collision to true
				colTri.collision = true;

				// store the vertex
				colTri.tri = triTmp;
                colTri.tri.u = u; // store barycentric coords
                colTri.tri.v = v; // store barycentric coords
				colTri.distance = distance;
			}

			// remove the stack top
			nodeID = stack[stackIndex--];

			continue;
		}

		// check for collisions with children
        Box boxLeft;
        boxLeft.bbMin = nodes[leftChildNode].bbMin;
        boxLeft.bbMax = nodes[leftChildNode].bbMax;
		leftChildCollision = rayIntersectWithBox(boxLeft.bbMin, boxLeft.bbMax, ray);

        Box boxRight;
        boxRight.bbMin = nodes[rightChildNode].bbMin;
        boxRight.bbMax = nodes[rightChildNode].bbMax;
		rightChildCollision = rayIntersectWithBox(boxRight.bbMin, boxRight.bbMax, ray);

		if (!leftChildCollision && !rightChildCollision)
		{
			// remove the stack top
			nodeID = stack[stackIndex--];
		}
		else
		{
			// store the right node on the stack if both right and left
			// boxes intersect the ray
			if (leftChildCollision && rightChildCollision)
				stack[++stackIndex] = rightChildNode;
            
			// update the nodeID depending on if the left side intersected
			nodeID = leftChildCollision ? leftChildNode : rightChildNode;
		}
	} while (stackIndex != -1 && stackIndex < STACK_SIZE/* This shouldn't be here. But it helps the broken BVH work*/);
    return colTri.collision;
}

It’s more or less the same with the original reference, we just replaced the triangle and box collision function calls with our own version. Initially I could not get traversal work. The compute shader would literally crash the driver and/or the OS as well ( on Windows at least). From what I could tell, the stackIndex would reach values greater than the stack size. This is because our BVH tree build in the beginning of the article is faulty. I did notice some “dead” nodes while debugging, that’s why I previously stated that the BVH tree is allegedly correct. By adding the stackIndex < STACK_SIZE condition in the while loop we avoid the issues the faulty tree creates while traversing. However it does not fix it. I will need to fix the tree at build time, and I’m currently looking at another reference for doing so. For more information on the traversal above I strongly suggest reading ⁽⁶⁾ .

Even so, we do get some nice results.

This is the same scene (6k triangles) we used in Part 1 where with our simple bounding box acceleration we got a frame time of over 500ms. With a (albeit faulty) BVH we get a frame time of 20ms. Let’s throw more triangles at it

Ray testing with linear BVH 18k triangles

The performance hit is visible, but we’re at 18k triangles now. I actually tried to go higher, however the BVH was failing to build for more detailed meshes, or more triangles. This is proof again that we need to fix our BVH building implementation. I wanted to provide some tracing stats like I did before, however like I already mentioned, atomic operations are not supported because of some weird Tint issue, so I can’t really count rays, intersections, tests and the like.

Building the BVH on the GPU

This was one of the reasons we chose to implement linear BVHs in the first place. However, I’m not entirely sure we can actually do it without atomic operations. For now, we’ll concentrate on fixing our current BVH building CPU implementation and perhaps even try other types of BVHs, just to get a comparison.

There are a lot of things we could do. We basically haven’t done any shading at all. We haven’t defined anything for the lights in our scene, where we’re still using hard coded data and our BVH is still wonky. We’ll find something.

References:

⁽¹⁾ https://github.com/Fierykev/RayTraceBVH

⁽²⁾ https://www.scratchapixel.com/lessons/advanced-rendering/introduction-acceleration-structure/bounding-volume-hierarchy-BVH-part1

⁽³⁾ https://pbrt.org/chapters/pbrt-2ed-chap4.pdf

⁽⁴⁾ https://programmerkevin.wordpress.com/bvh-ray-tracing/

⁽⁵⁾ https://developer.nvidia.com/blog/thinking-parallel-part-iii-tree-construction-gpu/

⁽⁶⁾ https://developer.nvidia.com/blog/thinking-parallel-part-ii-tree-traversal-gpu/