13 Hello Interop

Zero-copy resource interoperability between different compute and graphics APIs is essential when specific algorithms or libraries must be used. For example, an application may benefit from highly optimized routines available through CUDA or ROCm libraries.

Unfortunately, achieving this interoperability typically involves a significant amount of platform- and API-specific code, especially with the Vulkan API, where even basic examples can exceed 2500 lines of code.

The good news is that Tellusim natively supports CUDA and ROCm APIs and provides a simple mechanism for Buffer, Texture, and Fence interoperability with both Vulkan and Direct3D12.

Examples of CUDA and ROCm interop in Tellusim include:

The GravityMark benchmark for high-speed NVLink data copying between GPUs (20xx and 30xx series).
NVEnc H264 encoding using data from Vulkan or Direct3D12 textures.
Multi-GPU physics simulation with shared compute resources across devices.

note

ROCm API currently has limitations on Windows and is available only in Linux x64 builds.

Now, let's walk through a simple application that demonstrates Vulkan-CUDA and Direct3D12-CUDA texture interoperability.

Creating the Device

To begin, we initialize a Window and Device, as in previous examples:

// create app
App app(argc, argv);
if(!app.create()) return false;

// create window
Window window(app.getPlatform(), app.getDevice());
if(!window || !window.setSize(app.getWidth(), app.getHeight())) return false;
if(!window.create("13 Hello Interop") || !window.setHidden(false)) return false;
window.setCloseClickedCallback([&]() { window.stop(); });

// create device
Device device(window);
if(!device) return false;

// device info
const Device::Features &features = device.getFeatures();
TS_LOGF(Message, "%s 0x%x\n", device.getName().get(), features.pciBusID);

You can explicitly select the rendering API by passing -vk or -d3d12 as command-line arguments.

To correctly initialize a CUDA context on multi-GPU systems, we need to provide the PCI Bus ID of the currently active graphics device:

// create CUDA context for device
CUContext cu_context = CUContext(Context(PlatformCU, features.pciBusID));
if(!cu_context || !cu_context.create()) {
	TS_LOG(Error, "Can't create CUDA context\n");
	return false;
}

// create CUDA device
Device cu_device(cu_context);
if(!cu_device) return false;

At this point, both the graphics and CUDA devices are initialized on the same GPU.

info

CUDA-specific derived classes are used to access API-level resources.

Tellusim provides low-level (driver-level) access to CUDA resources, which can be easily cast to the runtime-level API and used directly in *.cu files:

// set CUDA device for runtime-level API
if(cudaSetDevice(cu_context.getDevice()) != cudaSuccess) return 1;

Creating the Texture

To share a texture between the graphics API and CUDA, we create it on the graphics device using the Texture::FlagInterop flag:

// create texture
uint32_t texture_size = 2048;
Texture texture = device.createTexture2D(FormatRGBAu8n, texture_size, Texture::FlagSurface | Texture::FlagInterop);
if(!texture) return false;

We then use this texture as the source for a CUDA texture:

CUTexture cu_texture = CUTexture(cu_device.createTexture(texture));
if(!cu_texture) return false;

This cu_texture can now be passed to CUDA kernels for compute operations.

Dispatch a CUDA Kernel

Before launching the kernel, we need to wrap the CUDA texture in a surface object:

// create CUDA surface descriptor
cudaResourceDesc surface_desc = {};
surface_desc.resType = cudaResourceTypeArray;
surface_desc.res.array.array = (cudaArray_t)cu_texture.getTextureLevel(0);

// create CUDA surface
cudaSurfaceObject_t cu_surface = 0;
cudaError_t error = cudaCreateSurfaceObject(&cu_surface, &surface_desc);
if(error != cudaSuccess) return false;

We can now dispatch the CUDA kernel:

// dispatch CUDA kernel
uint32_t group_size = 8;
uint32_t num_groups = udiv(texture_size, group_size);
cudaStream_t stream = (cudaStream_t)cu_context.getStream();
kernel<<<dim3(num_groups, num_groups), dim3(group_size, group_size), 0, stream>>>(texture_size, time, cu_surface);

// check for CUDA errors
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess) TS_LOGF(Error, "%s\n", cudaGetErrorString(error));

// synchronize CUDA stream
cudaStreamSynchronize(stream);

Rendering the Texture

To render the texture without introducing additional resources or shaders, we'll use the built-in Canvas system. In this case, we will use a CanvasRect element:

// flush texture after CUDA writes
device.flushTexture(texture);

// render texture using canvas
target.begin();
{
	Command command = device.createCommand(target);
	canvas.draw(command);
}
target.end();

Conclusion

The complete example contains fewer than 170 lines of CUDA and graphics code, including comments. This demonstrates a significant reduction in complexity compared to the original CUDA-Vulkan interoperability example.

When run locally, this CUDA-based application will produce the same result as the following WebGL example:

main.cu

Creating the Device​

Creating the Texture​

Dispatch a CUDA Kernel​

Rendering the Texture​

Conclusion​

Creating the Device

Creating the Texture

Dispatch a CUDA Kernel

Rendering the Texture

Conclusion