r/StableDiffusion@rStableDiffusion P.55851

r/StableDiffusion

Compile fp8 on RTX 30xx in triton-windows 3.5

I've merged the patch to let torch.compile work with fp8 on Ampere GPUs and let's see how it rolls out: https://github.com/woct0rdho/triton-windows/pull/140

I hoped this could be superseded by GGUF + better torch.compile or Nunchaku, but as of PyTorch 2.9 I realized that fp8 + the block swap in ComfyUI-WanVideoWrapper (or ComfyUI-wanBlockswap for native workflows) runs faster and causes fewer recompilations than GGUF + the block swap in ComfyUI-GGUF on my machine.

This is the first feature in the 'core' part (rather than the Windows support code) that's deliberately different from the official Triton. It should also work on Linux but I'm not sure what's the best way to publish Linux wheels.

I'm not an expert on PTX. Welcome help in optimizing those PTX code.

triton-windows 3.2.0.post21 is also released, which supports fp8 on RTX 20xx.

https://redd.it/1o75zgt
@rStableDiffusion

GitHub

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions by woct0rdho · Pull Request #140 · woct0rdho/triton…

Motivation
Nvidia GPUs with sm < 89 are still widely used, see e.g. Steam hardware survey. When running large AI models, a common usage is to store the parameters in fp8, and cast them to fp...

www.tgoop.com/rStableDiffusion/55851

10 viewsOct 15 at 10:40

tgoop.com/rStableDiffusion/55851

Create: 2025-10-15
Last Update: 2025-11-19 06:47:43

BY r/StableDiffusion

Share with your friend now:
tgoop.com/rStableDiffusion/55851

Telegram News

Compile fp8 on RTX 30xx in triton-windows 3.5