I’ve won a nice little battle in my personal struggle with Cuda recently by hacking in dynamic texture memory updates.
Unfortunately for me and my little Cuda graphics prototypes, Cuda was designed for general (non graphics ) computation. Texture memory access was put in, but as something of an afterthought. Textures in Cuda are read only – kernels can not write to them (ok yes they added limited support for 2D texturing from pitch linear memory, but thats useless), which is pretty much just straight up retarded. Why? Because Cuda allows general memory writes! There is nothing sacred about texture memory, and the hardware certainly supports writing to it in a shader/kernel, and has since render-to-texture appeared in what? DirectX7? But not in Cuda.
Cuda’s conception of writing to texture memory involves writing to regular memory and then calling a Cuda function to perform a sanctioned copy from the temporary buffer into the texture. Thats ok for alot of little demos, but not for large scale dynamic volumes. For an octree/voxel tracing app like mine, you basically fill up your GPU’s memory with a huge volume texture and accessory octree data which is broken up into chunks which can be fully managed by the GPU. You need to then be able to modify these chunks as the view changes or animation changes sections of the volume. Cuda would have you do this by double buffering the entire thing with a big scratch buffer and doing a copy from the linear scratchpad to the cache tiled 3D texture every frame. That copy function, by the way, acheives a whopping 3 Gb/s on my 8800. Useless.
So, after considering various radical alternatives that would avoid doing what I really want to do (which is use native 3D trilinear filtering in a volume texture I can write to anywhere), I realized I should just wait and port to DX11 compute shaders which will hopefully allow me to do the right thing. (and should also allow access to DXTC volumes, which will probalby be important).
In the meantime, I decided to hack my way around the cuda API and write to my volume texture anyway. This isn’t as bad as it sounds, because the GPU doesn’t have any fancy write-protection page faults, so a custom kernel can write anywhere in memory. But you have to know what you’re doing. The task was thus to figure out where exactly it would allocate my volume in GPU memory, and exactly how the GPU’s tiled addressing scheme works.
I did this with a brute force search. The drivers do extensive bounds checking even in release and explode when you attempt to circumvent them, so I wrote a memory-laundring routine to shuffle illegitimate arbitary GPU memory into legitimate allocations the driver would accept. Then I used this to snoop the GPU memory, which allowed a brute force search, using cuda’s routines to copy from cpu linear memory into the tiled volume texture, then snooping the GPU memory to find out exactly where my magic byte ended up, revealing one mapping of a XYZ coordinate and linear address to a tiled address on the GPU (or really, the inverse).
For the strangely curious, here is my currently crappy function for the inverse mapping (from GPU texture address to 3D position):
inline int3 GuessInvTile(uint outaddr)
int3 outpos = int3_(0, 0, 0);
outpos.x |= ((outaddr>>0) & (16-1)) << 0;
outpos.y |= ((outaddr>>4) & 1) << 0;
outpos.x |= ((outaddr>>5) & 1) << 4;
outpos.y |= ((outaddr>>6) & 1) << 1;
outpos.x |= ((outaddr>>7) & 1) << 5;
outpos.y |= ((outaddr>>8) & 1) << 2;
outpos.y |= ((outaddr>>9) & 1) << 3;
outpos.y |= ((outaddr>>10) & 1) << 4;
outpos.z |= ((outaddr>>11) & 1) << 0;
outpos.z |= ((outaddr>>12) & 1) << 1;
outpos.z |= ((outaddr>>13) & 1) << 2;
outpos.z |= ((outaddr>>14) & 1) << 3;
outpos.z |= ((outaddr>>15) & 1) << 4;
outpos.x |= ((outaddr>>16) & 1) << 6;
outpos.x |= ((outaddr>>17) & 1) << 7;
outpos.y |= ((outaddr>>18) & 1) << 5;
outpos.y |= ((outaddr>>19) & 1) << 6;
outpos.y |= ((outaddr>>20) & 1) << 7;
outpos.z |= ((outaddr>>21) & 1) << 5;
outpos.z |= ((outaddr>>22) & 1) << 6;
outpos.z |= ((outaddr>>23) & 1) << 7;
I’m sure the parts after the 15th output address bit are specific to the volume size and thus are wrong as stated (should be done with division). So really it does a custom swizzle within a 64x32x32 chunk, and then fills the volumes with these chunks in a plain X, Y, Z linear fill. Its curious that it tiles 16 over in X first, and then fills in a 64×32 2D tile before even starting on the Z. This means writing in spans of 16 aligned to the X direction is most effecient for scatter, which is actually kind of annoying, a z-curve tiling would be more convenient. The X-Y alignment is also a little strange, it means that general 3D fetches are memory equivalent to 2 2D fetches in terms of bandwidth and cache.