tgoop.com/rStableDiffusion/55851
Last Update:
Compile fp8 on RTX 30xx in triton-windows 3.5
I've merged the patch to let torch.compile work with fp8 on Ampere GPUs and let's see how it rolls out: https://github.com/woct0rdho/triton-windows/pull/140
I hoped this could be superseded by GGUF + better torch.compile or Nunchaku, but as of PyTorch 2.9 I realized that fp8 + the block swap in ComfyUI-WanVideoWrapper (or ComfyUI-wanBlockswap for native workflows) runs faster and causes fewer recompilations than GGUF + the block swap in ComfyUI-GGUF on my machine.
This is the first feature in the 'core' part (rather than the Windows support code) that's deliberately different from the official Triton. It should also work on Linux but I'm not sure what's the best way to publish Linux wheels.
I'm not an expert on PTX. Welcome help in optimizing those PTX code.triton-windows 3.2.0.post21 is also released, which supports fp8 on RTX 20xx.
https://redd.it/1o75zgt
@rStableDiffusion
BY r/StableDiffusion
Share with your friend now:
tgoop.com/rStableDiffusion/55851
