An experiment with "realism" with Wan2.2 that are safe for work images
https://redd.it/1o7khg9
@rStableDiffusion
https://redd.it/1o7khg9
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: An experiment with "realism" with Wan2.2 that are safe for work images
Explore this post and more from the StableDiffusion community
CLIPs can understand well beyond 77 tokens
A little side addendum on CLIPs after this post: https://www.reddit.com/r/StableDiffusion/comments/1o1u2zm/text\_encoders\_in\_noobai\_are\_dramatically\_flawed\_a/
I'll keep it short this time.
While CLIPs are limited to 77 tokens, nothing *really* stopping you from feeding them longer context. By default this doesn't really work:
https://preview.redd.it/47d58svb5cvf1.png?width=1980&format=png&auto=webp&s=6133df9238318b35630b8a9c484988ccc94bd83c
https://preview.redd.it/ya61tmuc5cvf1.png?width=1979&format=png&auto=webp&s=23e7dbf162d378ff7610a0832b1ea1c2d349a9d3
I tuned base CLIP L on \~10000 text-image pairs filtered out by token length. Every image in dataset has 225+ tokens tagging. Training was performed with up to 770 tokens.
Validation dataset is 5%, so \~500 images.
In length benchmark, each landmark point is the maximum allowed length at which i tested. Up to 77 tokens both CLIPs show fairly normal performance, where the more tokens you give - the better it would perform. Then past 77 performance of base CLIP L drops drastically(as new chunk has entered the picture, and at 80 tokens it's mostly filled with nothing), but tuned variation does not. Then CLIP L regains to the baseline, but it can't make use of additional information, and as more and more tokens are being added into the mix, it practically dies, as signal is too overwhelming.
Tuned performance peaks at \~300 tokens(\~75 tags). Why, shouldn't it be able to utilize even more tokens?
Yeah. And it's able to, what you see here is saturation of data, beyond 300 tokens there are very few images that actually can continue extending information, majority of dataset is exhausted, so there is no new data to discern, therefore performance flatlines.
There is, however, another chart i can show, which shows performance decoupled from saturated data:
https://preview.redd.it/zbss7ob77cvf1.png?width=1980&format=png&auto=webp&s=6066dd804d92fe5822eb96e39a3a3268d1353892
https://preview.redd.it/hwtxboa87cvf1.png?width=1980&format=png&auto=webp&s=6c130ad94d24f5b573a399dc26082a2161deeee3
This chart removes images that are not able to saturate tested landmark.
Important note, that as images get removed, benchmark becomes easier, as there are less samples to compare against, so if you want to consider performance, utilize results of first set of graphs.
But with that aside, let's address this set.
It is basically same image, but as number decreases, proportionally Base CLIP L has it's performance "improved" due to sheer chance, as beyond 100 tags data is too small, and it allows model to guess by pure chance, so 1/4 correct gives 25% :D
In reality, i wouldn't consider data in this set very reliable beyond 300 tokens, as further sets are done on less than 100 images, and are likely much easier to solve.
But conclusion that can be made, is that CLIP tuned with long captions i able to utilize information in those captions to reliably(80% on full data is quite decent) discern anime images, while default CLIP L likely treats it as more or less noise.
And no, it is not usable out of the box
https://preview.redd.it/mrjwmtg6acvf1.png?width=899&format=png&auto=webp&s=adebc48a279fbcf54f54068df7180a95a52a5d90
But patterns are nice.
I will upload it to HF if you want to experiment or something.
And node graphs for those who interested of course, but without explanations this time. There is nothing concerning us regarding longer context here really.
Red - Tuned, Blue - Base
PCA:
https://preview.redd.it/9qcixd88ccvf1.png?width=2435&format=png&auto=webp&s=9114b46e0d27b74ef6c351ec94df41b3cda7fe5b
t-sne
https://preview.redd.it/iu9sy2eiccvf1.png?width=2128&format=png&auto=webp&s=16a0706bcdd99f35c639af4f2a48fe964569520f
pacmap
https://preview.redd.it/dobrckyrccvf1.png?width=2036&format=png&auto=webp&s=001d237868df7accad8776779edcc3e3c552e943
HF link:
A little side addendum on CLIPs after this post: https://www.reddit.com/r/StableDiffusion/comments/1o1u2zm/text\_encoders\_in\_noobai\_are\_dramatically\_flawed\_a/
I'll keep it short this time.
While CLIPs are limited to 77 tokens, nothing *really* stopping you from feeding them longer context. By default this doesn't really work:
https://preview.redd.it/47d58svb5cvf1.png?width=1980&format=png&auto=webp&s=6133df9238318b35630b8a9c484988ccc94bd83c
https://preview.redd.it/ya61tmuc5cvf1.png?width=1979&format=png&auto=webp&s=23e7dbf162d378ff7610a0832b1ea1c2d349a9d3
I tuned base CLIP L on \~10000 text-image pairs filtered out by token length. Every image in dataset has 225+ tokens tagging. Training was performed with up to 770 tokens.
Validation dataset is 5%, so \~500 images.
In length benchmark, each landmark point is the maximum allowed length at which i tested. Up to 77 tokens both CLIPs show fairly normal performance, where the more tokens you give - the better it would perform. Then past 77 performance of base CLIP L drops drastically(as new chunk has entered the picture, and at 80 tokens it's mostly filled with nothing), but tuned variation does not. Then CLIP L regains to the baseline, but it can't make use of additional information, and as more and more tokens are being added into the mix, it practically dies, as signal is too overwhelming.
Tuned performance peaks at \~300 tokens(\~75 tags). Why, shouldn't it be able to utilize even more tokens?
Yeah. And it's able to, what you see here is saturation of data, beyond 300 tokens there are very few images that actually can continue extending information, majority of dataset is exhausted, so there is no new data to discern, therefore performance flatlines.
There is, however, another chart i can show, which shows performance decoupled from saturated data:
https://preview.redd.it/zbss7ob77cvf1.png?width=1980&format=png&auto=webp&s=6066dd804d92fe5822eb96e39a3a3268d1353892
https://preview.redd.it/hwtxboa87cvf1.png?width=1980&format=png&auto=webp&s=6c130ad94d24f5b573a399dc26082a2161deeee3
This chart removes images that are not able to saturate tested landmark.
Important note, that as images get removed, benchmark becomes easier, as there are less samples to compare against, so if you want to consider performance, utilize results of first set of graphs.
But with that aside, let's address this set.
It is basically same image, but as number decreases, proportionally Base CLIP L has it's performance "improved" due to sheer chance, as beyond 100 tags data is too small, and it allows model to guess by pure chance, so 1/4 correct gives 25% :D
In reality, i wouldn't consider data in this set very reliable beyond 300 tokens, as further sets are done on less than 100 images, and are likely much easier to solve.
But conclusion that can be made, is that CLIP tuned with long captions i able to utilize information in those captions to reliably(80% on full data is quite decent) discern anime images, while default CLIP L likely treats it as more or less noise.
And no, it is not usable out of the box
https://preview.redd.it/mrjwmtg6acvf1.png?width=899&format=png&auto=webp&s=adebc48a279fbcf54f54068df7180a95a52a5d90
But patterns are nice.
I will upload it to HF if you want to experiment or something.
And node graphs for those who interested of course, but without explanations this time. There is nothing concerning us regarding longer context here really.
Red - Tuned, Blue - Base
PCA:
https://preview.redd.it/9qcixd88ccvf1.png?width=2435&format=png&auto=webp&s=9114b46e0d27b74ef6c351ec94df41b3cda7fe5b
t-sne
https://preview.redd.it/iu9sy2eiccvf1.png?width=2128&format=png&auto=webp&s=16a0706bcdd99f35c639af4f2a48fe964569520f
pacmap
https://preview.redd.it/dobrckyrccvf1.png?width=2036&format=png&auto=webp&s=001d237868df7accad8776779edcc3e3c552e943
HF link:
Reddit
From the StableDiffusion community on Reddit: Text Encoders in Noobai are dramatically flawed - a bit long thread about topic you…
Explore this post and more from the StableDiffusion community
https://huggingface.co/Anzhc/SDXL-Text-Encoder-Longer-CLIP-L/tree/main
Probably don't bother downloading if you're not going to tune your model in some way to adjust to it.
https://redd.it/1o7nnc1
@rStableDiffusion
Probably don't bother downloading if you're not going to tune your model in some way to adjust to it.
https://redd.it/1o7nnc1
@rStableDiffusion
huggingface.co
Anzhc/SDXL-Text-Encoder-Longer-CLIP-L at main
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Beginner Friendly Workflow for Automatic Continuous Generation of Video Clips Using Wan 2.2
https://www.reddit.com/r/comfyui/comments/1o7pqf3/workflow_for_automatic_continuous_generation_of/
https://redd.it/1o7pwv6
@rStableDiffusion
https://www.reddit.com/r/comfyui/comments/1o7pqf3/workflow_for_automatic_continuous_generation_of/
https://redd.it/1o7pwv6
@rStableDiffusion
Reddit
From the comfyui community on Reddit
Explore this post and more from the comfyui community
Media is too big
VIEW IN TELEGRAM
Queen Jedi's - home return : Hunyuan 3.0, Wan 2.2, Qwen, Qwen edit 2509
https://redd.it/1o7tv41
@rStableDiffusion
https://redd.it/1o7tv41
@rStableDiffusion
Hyper-Lora/InfiniteYou hybrid faceswap workflow
Since faceCLIP was removed, I made a workflow with the next best thing (maybe better). Also, I'm tired of people messaging me to re-upload the faceCLIP models. They are unusable without the unreleased inference code anyway.
So what this does is use Hyper-Lora to create a fast SDXL lora from a few images of the body. It also does the face, but it tends to lack detail. Populate however many or few full body images of your subject on the left side. On the right side, input good quality face images of the subject. Enter an SDXL positive and negative prompt to create the initial image. Do not remove the "fcsks fxhks fhyks" from the beginning of the positive prompts. Hyper-Lora won't work without it. Hyper-Lora is picky about which SDXL models it likes. RealVis v4.0 and Juggernaut v9 work well in my tests so far. That image is sent to InfiniteYou and the Flux model. Only stock Flux1.D makes accurate faces from what I've tested so far. If you want ոsfw, keep the Mystic v7 lora. You should keep it anyway because it seems to make InfiniteYou work better for some reason. The chin-fix lora is also recommended for obvious reasons. JoyCaption takes the SDXL image and makes a Flux-friendly prompt.
The output is only going to be as good as your input, so use high-quality images.
You might notice a lot of VRAM Debug nodes. This workflow will use nearly every byte of a 24GB card. If you have more, use the fp16 T5 instead of the fp8 for better results.
Are the settings in this workflow optimized? Probably not. I leave it to you to fiddle around with it. If you improve it, it would be nice if you would comment your improvements.
No, I will not walk you through installing Hyper-Lora and InfiniteYou.
https://pastebin.com/he9Sbywf
https://redd.it/1o7nlyr
@rStableDiffusion
Since faceCLIP was removed, I made a workflow with the next best thing (maybe better). Also, I'm tired of people messaging me to re-upload the faceCLIP models. They are unusable without the unreleased inference code anyway.
So what this does is use Hyper-Lora to create a fast SDXL lora from a few images of the body. It also does the face, but it tends to lack detail. Populate however many or few full body images of your subject on the left side. On the right side, input good quality face images of the subject. Enter an SDXL positive and negative prompt to create the initial image. Do not remove the "fcsks fxhks fhyks" from the beginning of the positive prompts. Hyper-Lora won't work without it. Hyper-Lora is picky about which SDXL models it likes. RealVis v4.0 and Juggernaut v9 work well in my tests so far. That image is sent to InfiniteYou and the Flux model. Only stock Flux1.D makes accurate faces from what I've tested so far. If you want ոsfw, keep the Mystic v7 lora. You should keep it anyway because it seems to make InfiniteYou work better for some reason. The chin-fix lora is also recommended for obvious reasons. JoyCaption takes the SDXL image and makes a Flux-friendly prompt.
The output is only going to be as good as your input, so use high-quality images.
You might notice a lot of VRAM Debug nodes. This workflow will use nearly every byte of a 24GB card. If you have more, use the fp16 T5 instead of the fp8 for better results.
Are the settings in this workflow optimized? Probably not. I leave it to you to fiddle around with it. If you improve it, it would be nice if you would comment your improvements.
No, I will not walk you through installing Hyper-Lora and InfiniteYou.
https://pastebin.com/he9Sbywf
https://redd.it/1o7nlyr
@rStableDiffusion
Pastebin
{ "id": "c1a3d0d1-f0cb-4369-83ae-273696df248c", "revision": 0, "last_no - Pastebin.com
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
They did not release any wheels for torch2.9 of nunchaku 1.0.1?
So it seems nunchakutech did not release the wheels for torch2.9 when they released nunchaku 1.0.1.
See here:
https://github.com/nunchaku-tech/nunchaku/releases
As ComfyUI (on Windows) now uses torch2.9 how would I install the python package for nunchaku 1.0.1? Because there are only torch2.8 and torch2.10 wheels available!
Strange thing is - for 1.0.0 they also released the torch2.9 wheels but this time they missed it. Accidentially?
https://redd.it/1o7yn3x
@rStableDiffusion
So it seems nunchakutech did not release the wheels for torch2.9 when they released nunchaku 1.0.1.
See here:
https://github.com/nunchaku-tech/nunchaku/releases
As ComfyUI (on Windows) now uses torch2.9 how would I install the python package for nunchaku 1.0.1? Because there are only torch2.8 and torch2.10 wheels available!
Strange thing is - for 1.0.0 they also released the torch2.9 wheels but this time they missed it. Accidentially?
https://redd.it/1o7yn3x
@rStableDiffusion
GitHub
Releases · nunchaku-tech/nunchaku
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models - nunchaku-tech/nunchaku
The need for InfiniteTalk in Wan 2.2
InfiniteTalk is one of the best features out there in my opinion, it's brilliantly made.
What I'm surprised about, is why more people aren't acknowledging how limited we are in 2.2 without upgraded support for it. Whilst we can feed a Wan 2.2 generated video into InfiniteTalk, you'll strip it of much of 2.2's motion, raising the question as to why you generated your video with that version in the first place...
InfiniteTalk's 2.1 architecture still excels for character speech, but the large library of 2.2 movement LORAs are completely redundant because it will not be able to maintain those movements whilst adding lipsync.
Without 2.2's movement, the use case is actually quite limited. Admittedly it serves that use case brilliantly.
I was wondering to what extent InfiniteTalk for 2.2 may actually be possible, or whether the 2.1 VACE architecture was superior enough to allow for it?
https://redd.it/1o80k2i
@rStableDiffusion
InfiniteTalk is one of the best features out there in my opinion, it's brilliantly made.
What I'm surprised about, is why more people aren't acknowledging how limited we are in 2.2 without upgraded support for it. Whilst we can feed a Wan 2.2 generated video into InfiniteTalk, you'll strip it of much of 2.2's motion, raising the question as to why you generated your video with that version in the first place...
InfiniteTalk's 2.1 architecture still excels for character speech, but the large library of 2.2 movement LORAs are completely redundant because it will not be able to maintain those movements whilst adding lipsync.
Without 2.2's movement, the use case is actually quite limited. Admittedly it serves that use case brilliantly.
I was wondering to what extent InfiniteTalk for 2.2 may actually be possible, or whether the 2.1 VACE architecture was superior enough to allow for it?
https://redd.it/1o80k2i
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community