Commit Graph

39 Commits

Author SHA1 Message Date
comfyanonymous 8ce2a1052c Optimizations to --fast and scaled fp8. 2024-10-22 02:12:28 -04:00
comfyanonymous 0075c6d096 Mixed precision diffusion models with scaled fp8.
This change allows supports for diffusion models where all the linears are
scaled fp8 while the other weights are the original precision.
2024-10-21 18:12:51 -04:00
comfyanonymous 83ca891118 Support scaled fp8 t5xxl model. 2024-10-20 22:27:00 -04:00
comfyanonymous f9f9faface Fixed model merging issue with scaled fp8. 2024-10-20 06:24:31 -04:00
comfyanonymous a68bbafddb Support diffusion models with scaled fp8 weights. 2024-10-19 23:47:42 -04:00
comfyanonymous 67158994a4 Use the lowvram cast_to function for everything. 2024-10-17 17:25:56 -04:00
comfyanonymous e38c94228b Add a weight_dtype fp8_e4m3fn_fast to the Diffusion Model Loader node.
This is used to load weights in fp8 and use fp8 matrix multiplication.
2024-10-09 19:43:17 -04:00
comfyanonymous 9c41bc8d10 Remove useless line. 2024-09-23 02:32:29 -04:00
comfyanonymous dc96a1ae19 Load controlnet in fp8 if weights are in fp8. 2024-09-21 04:50:12 -04:00
comfyanonymous 8ae23d8e80 Fix onnx export. 2024-08-23 17:52:47 -04:00
comfyanonymous c7ee4b37a1 Try to fix some lora issues. 2024-08-22 15:32:18 -04:00
comfyanonymous 904bf58e7d Make --fast work on pytorch nightly. 2024-08-21 14:01:41 -04:00
Svein Ove Aas 5f50263088
Replace use of .view with .reshape (#4522)
When generating images with fp8_e4_m3 Flux and batch size >1, using --fast, ComfyUI throws a "view size is not compatible with input tensor's size and stride" error pointing at the first of these two calls to view.

As reshape is semantically equivalent to view except for working on a broader set of inputs, there should be no downside to changing this. The only difference is that it clones the underlying data in cases where .view would error out. I have confirmed that the output still looks as expected, but cannot confirm that no mutable use is made of the tensors anywhere.

Note that --fast is only marginally faster than the default.
2024-08-21 11:21:48 -04:00
comfyanonymous 03ec517afb Remove useless line, adjust windows default reserved vram. 2024-08-21 00:47:19 -04:00
comfyanonymous 510f3438c1 Speed up fp8 matrix mult by using better code. 2024-08-20 22:53:26 -04:00
comfyanonymous 9953f22fce Add --fast argument to enable experimental optimizations.
Optimizations that might break things/lower quality will be put behind
this flag first and might be enabled by default in the future.

Currently the only optimization is float8_e4m3fn matrix multiplication on
4000/ADA series Nvidia cards or later. If you have one of these cards you
will see a speed boost when using fp8_e4m3fn flux for example.
2024-08-20 11:55:51 -04:00
comfyanonymous 538cb068bc Make cast_to a nop if weight is already good. 2024-08-20 10:46:36 -04:00
comfyanonymous 39f114c44b Less broken non blocking? 2024-08-18 16:53:17 -04:00
comfyanonymous 6730f3e1a3
Disable non blocking.
It fixed some perf issues but caused other issues that need to be debugged.
2024-08-18 14:38:09 -04:00
comfyanonymous 73332160c8 Enable non blocking transfers in lowvram mode. 2024-08-18 10:29:33 -04:00
comfyanonymous b85216a3c0 Lower T5 memory usage by a few hundred MB. 2024-07-31 00:52:34 -04:00
comfyanonymous 25853d0be8 Use common function for casting weights to input. 2024-07-30 10:49:14 -04:00
comfyanonymous bb1969cab7 Initial support for the stable audio open model. 2024-06-15 12:14:56 -04:00
comfyanonymous 6c23854f54 Fix OSX latent2rgb previews. 2024-05-22 13:56:28 -04:00
comfyanonymous 448d9263a2 Fix control loras breaking. 2024-03-14 09:30:21 -04:00
comfyanonymous db8b59ecff Lower memory usage for loras in lowvram mode at the cost of perf. 2024-03-13 20:07:27 -04:00
comfyanonymous 667c92814e Stable Cascade Stage B. 2024-02-16 13:02:03 -05:00
comfyanonymous 78a70fda87 Remove useless import. 2024-01-19 15:38:05 -05:00
comfyanonymous 36a7953142 Greatly improve lowvram sampling speed by getting rid of accelerate.
Let me know if this breaks anything.
2023-12-22 14:38:45 -05:00
comfyanonymous 77755ab8db Refactor comfy.ops
comfy.ops -> comfy.ops.disable_weight_init

This should make it more clear what they actually do.

Some unused code has also been removed.
2023-12-11 23:27:13 -05:00
comfyanonymous ba07cb748e Use faster manual cast for fp8 in unet. 2023-12-11 18:24:44 -05:00
comfyanonymous 57926635e8 Switch text encoder to manual cast.
Use fp16 text encoder weights for CPU inference to lower memory usage.
2023-12-10 23:00:54 -05:00
comfyanonymous af365e4dd1 All the unet ops with weights are now handled by comfy.ops 2023-12-04 03:12:18 -05:00
comfyanonymous 412d3ff57d Refactor. 2023-11-11 01:11:06 -05:00
comfyanonymous 00c0b2c507 Initialize text encoder to target dtype. 2023-08-23 21:01:15 -04:00
comfyanonymous d6e4b342e6 Support for Control Loras.
Control loras are controlnets where some of the weights are stored in
"lora" format: an up and a down low rank matrice that when multiplied
together and added to the unet weight give the controlnet weight.

This allows a much smaller memory footprint depending on the rank of the
matrices.

These controlnets are used just like regular ones.
2023-08-18 11:59:51 -04:00
comfyanonymous bb1f45d6e8 Properly disable weight initialization in clip models. 2023-06-14 20:13:08 -04:00
comfyanonymous 21f04fe632 Disable default weight values in unet conv2d for faster loading. 2023-06-14 19:46:08 -04:00
comfyanonymous 6971646b8b Speed up model loading a bit.
Default pytorch Linear initializes the weights which is useless and slow.
2023-06-14 12:09:41 -04:00