Bettering Diffusers Bundle for Excessive-High quality Picture Technology

Overcoming token measurement limitations, customized mannequin loading, LoRa assist, textual inversion assist, and extra

Goodbye Babel, generated by Andrew Zhu utilizing Diffusers in pure Python

Steady Diffusion WebUI from AUTOMATIC1111 has confirmed to be a strong device for producing high-quality photographs utilizing the Diffusion mannequin. Nonetheless, whereas the WebUI is simple to make use of, knowledge scientists, machine studying engineers, and researchers usually require extra management over the picture technology course of. That is the place the diffusers package deal from huggingface is available in, offering a method to run the Diffusion mannequin in Python and permitting customers to customise their fashions and prompts to generate photographs to their particular wants.

Regardless of its potential, the Diffusers package deal has a number of limitations that forestall it from producing photographs pretty much as good as these produced by the Steady Diffusion WebUI. Probably the most vital of those limitations embrace:

The shortcoming to make use of customized fashions within the .safetensor file format;
The 77 immediate token limitation;
A scarcity of LoRA assist;
And the absence of picture scale-up performance (also called HighRes in Steady Diffusion WebUI);
Low efficiency and excessive VRAM utilization by default.

This text goals to deal with these limitations and allow the Diffusers package deal to generate high-quality photographs corresponding to these produced by the Steady Diffusion WebUI. With the enhancement options offered, knowledge scientists, machine studying engineers, and researchers can take pleasure in higher management and adaptability of their picture technology processes whereas additionally attaining distinctive outcomes. Within the following sections, we’ll discover the varied methods and methods that can be utilized to beat these limitations and unlock the complete potential of the Diffusers package deal.

Word that please comply with this hyperlink to put in all required CUDA and Python packages if it’s your first time operating Steady Diffusion.

1. Load Up Native Mannequin recordsdata in .safetensor Format

Customers can simply spin up diffusers to generate a picture like this:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")

You could not fulfill with both the output picture or the efficiency. Let’s cope with the issues one after the other. First, let’s load up a customized mannequin in .safetensor format positioned anyplace in your machine. you can’t simply load the mannequin file like this:

pipeline = DiffusionPipeline.from_pretrained("/mannequin/custom_model.safetensors")

Listed below are the detailed steps to covert .safetensor file to diffusers format:

Step 1. Pull all diffusers code from GitHub

git clone https://github.com/huggingface/diffusers.git

Step 2. Below the scripts folder find the file: convert_original_stable_diffusion_to_diffusers.py

In your terminal, run this command to transform .safetensor file to Diffusers format. Keep in mind to vary the — checkpoint_path worth to characterize your case.

python convert_original_stable_diffusion_to_diffusers.py --from_safetensors --checkpoint_path="D:stable-diffusion-webuimodelsStable-diffusiondeliberate_v2.safetensors" --dump_path="D:sd_modelsdeliberate_v2" --device="cuda:0"

Step 3. Now you’ll be able to load up the pipeline utilizing the newly transformed mannequin file, right here is the entire code:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
)
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")

It’s best to be capable to convert and use any fashions you obtain from huggingface or civitai.com.

Cat taking part in piano generated by the above code

2. Increase the Efficiency of Diffusers

Producing high-quality photographs is usually a time-consuming course of even for the most recent 3xxx and 4xxx Nvidia RTX GPUs. By default, Diffuers package deal comes with non-optimized settings. Two options could be utilized to enormously enhance efficiency.

Right here is the interplay velocity earlier than making use of the next answer, solely about 2.x iterations per second in RTX 3070 TI 8G RAM to generate a 512×512 picture

Use Half Precision Weights

The primary answer is to make use of half precision weights. Half precision weights use 16-bit floating-point numbers as a substitute of the standard 32-bit numbers. This reduces the reminiscence required for storing weights and quickens computation, which may considerably enhance the efficiency of the Diffusers package deal.

In line with this video, lowering float precision from FP32 to FP16 can even allow the Tensor Cores.

I had one other article to check out how briskly GPU Tensor cores can enhance the computation velocity.

Right here is the way to allow FP16 in diffusers, Simply including two strains of code will enhance the efficiency by 500%, with nearly no picture high quality impacts.

from diffusers import DiffusionPipeline
import torch # <----- Line 1 added
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,torch_dtype        = torch.float16 # <----- Line 2 Added
)
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")

Now the iteration velocity boosts to 10.x iteration per second. A 5x occasions sooner.

Xformers is an open-source library that gives a set of high-performance transformers for varied pure language processing (NLP) duties. It’s constructed on prime of PyTorch and goals to offer environment friendly and scalable transformer fashions that may be simply built-in into present NLP pipelines. (These days, are there any fashions that don’t use Transformer? :P)

Set up Xformers by pip set up xformers , then we are able to simply swap diffusers to make use of xformers by one line code.

...
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()  <--- one line added
...

This one-line code boosts efficiency by one other 20%.

3. Take away the 77 immediate tokens limitation

Within the present model of Diffusers, there’s a limitation of 77 immediate tokens that can be utilized within the technology of photographs.

Luckily, there’s a answer to this downside. Through the use of the “lpw_stable_diffusion” pipeline offered by the group, you’ll be able to unlock the 77 immediate token limitation and generate high-quality photographs with longer prompts.

To make use of the “lpw_stable_diffusion” pipeline, you should utilize the next code:

pipeline = DiffusionPipeline.from_pretrained(
model_path,
custom_pipeline="lpw_stable_diffusion",  #<--- code added
torch_dtype=torch.float16
)

On this code, we’re initializing a brand new DiffusionPipeline object utilizing the “from_pretrained” technique. We’re specifying the trail to the pre-trained mannequin and setting the “custom_pipeline” argument to “lpw_stable_diffusion”. This tells Diffusers to make use of the “lpw_stable_diffusion” pipeline, which unlocks the 77 immediate token limitation.

Now, let’s use a protracted immediate string to try it out. Right here is the entire code:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"  #<--- code added
,torch_dtype        = torch.float16
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
immediate = """
Babel tower falling down, strolling on the starlight, dreamy extremely vast shot
, atmospheric, hyper real looking, epic composition, cinematic, octane render
, artstation panorama vista pictures by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed put up processing, artstation, rendering by octane, unreal engine
"""
picture = pipeline(immediate).photographs[0]
picture.save("goodbye_babel_tower.png")

And you’ll get a picture like this:

Goodby Babel, generated by Andrew Zhu utilizing diffusers

Should you nonetheless see a warning message like: Token indices sequence size is longer than the required most sequence size for this mannequin ( *** > 77 ) . Working this sequence via the mannequin will end in indexing errors. It’s regular, you’ll be able to simply ignore it.

4. Use Customized LoRA with Diffusers

Regardless of the claims of LoRA assist in Diffusers, customers nonetheless face limitations on the subject of loading native LoRA recordsdata within the .safetensor file format. This is usually a vital impediment for customers to make use of the LoRA from the group.

To beat this limitation, I’ve created a perform that enables customers to load LoRA recordsdata with weighted numbers in actual time. This perform can be utilized to load LoRA recordsdata and their corresponding weights to a Diffusers mannequin, enabling the technology of high-quality photographs with LoRA knowledge.

Right here is the perform physique:

from safetensors.torch import load_file
def __load_lora(
pipeline
,lora_path
,lora_weight=0.5
):
state_dict = load_file(lora_path)
LORA_PREFIX_UNET = 'lora_unet'
LORA_PREFIX_TEXT_ENCODER = 'lora_te'alpha = lora_weight
visited = []
# straight replace weight in diffusers mannequin
for key in state_dict:
# as now we have set the alpha beforehand, so simply skip
if '.alpha' in key or key in visited:
proceed
if 'textual content' in key:
layer_infos = key.cut up('.')[0].cut up(LORA_PREFIX_TEXT_ENCODER+'_')[-1].cut up('_')
curr_layer = pipeline.text_encoder
else:
layer_infos = key.cut up('.')[0].cut up(LORA_PREFIX_UNET+'_')[-1].cut up('_')
curr_layer = pipeline.unet
# discover the goal layer
temp_name = layer_infos.pop(0)
whereas len(layer_infos) > -1:
strive:
curr_layer = curr_layer.__getattr__(temp_name)
if len(layer_infos) > 0:
temp_name = layer_infos.pop(0)
elif len(layer_infos) == 0:
break
besides Exception:
if len(temp_name) > 0:
temp_name += '_'+layer_infos.pop(0)
else:
temp_name = layer_infos.pop(0)
# org_forward(x) + lora_up(lora_down(x)) * multiplier
pair_keys = []
if 'lora_down' in key:
pair_keys.append(key.substitute('lora_down', 'lora_up'))
pair_keys.append(key)
else:
pair_keys.append(key)
pair_keys.append(key.substitute('lora_up', 'lora_down'))
# replace weight
if len(state_dict[pair_keys[0]].form) == 4:
weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
weight_down = state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
curr_layer.weight.knowledge += alpha * torch.mm(weight_up, weight_down).unsqueeze(2).unsqueeze(3)
else:
weight_up = state_dict[pair_keys[0]].to(torch.float32)
weight_down = state_dict[pair_keys[1]].to(torch.float32)
curr_layer.weight.knowledge += alpha * torch.mm(weight_up, weight_down)
# replace visited listing
for merchandise in pair_keys:
visited.append(merchandise)
return pipeline

The logic is extracted from the convert_lora_safetensor_to_diffusers.py of the diffusers git repo.

Take one of many well-known LoRA:MoXin for instance. you should utilize the __load_lora perform like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"  
,torch_dtype        = torch.float16
)
lora = (r"D:sd_modelsLoraMoxin_10.safetensors",0.8)
pipeline = __load_lora(pipeline=pipeline,lora_path=lora[0],lora_weight=lora[1])
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()immediate = """
shukezouma,unfavourable area,shuimobysim 
a department of flower, conventional chinese language ink portray
"""
picture = pipeline(immediate).photographs[0]
picture.save("a department of flower.png")

The immediate will generate a picture like this:

a department of flower, generated by Andrew Zhu utilizing diffusers

You’ll be able to name a number of occasions of __load_lora() to load a number of LoRAs for one technology.

With this perform, now you can load LoRA recordsdata with weighted numbers in actual time and use them to generate high-quality photographs with Diffusers. The LoRA loading is fairly quick, often taking only one–2 seconds, means higher than changing and utilizing(which is able to generate one other mannequin file in GB measurement).

5. Use Customized Texture Inversions with Diffusers

Utilizing customized Texture Inversions with Diffusers package deal is usually a highly effective method to generate high-quality photographs. Nonetheless, the official documentation of Diffusers means that customers want to coach their very own Textual Inversions which may take as much as an hour on a V100 GPU. This will not be sensible for a lot of customers who need to generate photographs rapidly.

So I investigated it and located an answer that may allow diffusers to make use of a textual inversion similar to in Steady Diffusion WebUI. Under is the perform I created to load a customized Textual Inversion.

def load_textual_inversion(
learned_embeds_path
, text_encoder
, tokenizer
, token = None
, weight = 0.5
):
'''
Use this perform to load textual inversion mannequin in mannequin initilization stage 
or picture technology stage. 
'''
loaded_learned_embeds = torch.load(learned_embeds_path, map_location="cpu")
string_to_token = loaded_learned_embeds['string_to_token']
string_to_param = loaded_learned_embeds['string_to_param']# separate token and the embeds
trained_token = listing(string_to_token.keys())[0]
embeds = string_to_param[trained_token]
embeds = embeds[0] * weight
# solid to dtype of text_encoder
dtype = text_encoder.get_input_embeddings().weight.dtype
embeds.to(dtype)
# add the token in tokenizer
token = token if token shouldn't be None else trained_token
num_added_tokens = tokenizer.add_tokens(token)
if num_added_tokens == 0:
#print(f"The tokenizer already incorporates the token {token}.The brand new token will substitute the earlier one")
increase ValueError(f"The tokenizer already incorporates the token {token}. Please move a special `token` that's not already within the tokenizer.")
# resize the token embeddings
text_encoder.resize_token_embeddings(len(tokenizer))
# get the id for the token and assign the embeds
token_id = tokenizer.convert_tokens_to_ids(token)
text_encoder.get_input_embeddings().weight.knowledge[token_id] = embeds
return (tokenizer,text_encoder)

Within the load_textual_inversion() perform, you might want to present the next arguments:

learned_embeds_path: Path to the pre-trained textual inversion mannequin file in .pt or .bin format.
text_encoder: Textual content encoder object obtained from the Diffusion Pipeline.
tokenizer: Tokenizer object obtained from the Diffusion Pipeline.
token: Non-compulsory argument specifying the immediate token. By default, it’s set to None. it’s the key phrase that can set off the textual inversion in your immediate
weight: Non-compulsory argument specifying the load of the textual inversion. By default, I set it to 0.5. you’ll be able to change to different worth as wanted.

Now you can use the perform with a diffusers pipeline like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"  
,torch_dtype        = torch.float16
,safety_checker     = None
)textual_inversion_path = r"D:sd_modelsembeddingsstyle-empire.pt"
tokenizer       = pipeline.tokenizer
text_encoder    = pipeline.text_encoder 
load_textual_inversion(
learned_embeds_path     = textual_inversion_path
, tokenizer             = tokenizer
, text_encoder          = text_encoder
, token                 = 'styleempire'
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
immediate = """
styleempire,award successful lovely road, storm,((darkish storm clouds))
, fluffy clouds within the sky, shaded flat illustration, digital artwork
, trending on artstation, extremely detailed, high-quality element, intricate
, ((lens flare)), (backlighting), (bloom)
"""
neg_prompt = """
cartoon, 3d, ((disfigured)), ((dangerous artwork)), ((deformed)), ((poorly drawn))
, ((additional limbs)), ((shut up)), ((b&w)), bizarre colours, blurry
, hat, cap, glasses, sun shades, lightning, face
"""
generator = torch.Generator("cuda").manual_seed(1)
picture = pipeline(
immediate
,negative_prompt =neg_prompt
,generator       = generator
).photographs[0]
picture.save("tv_test.png")

Right here is the results of making use of an Empire Type Textual Inversion.

The left’s fashionable road turns to an previous London type.

6. Upscale Photographs

Diffusers package deal is nice for producing high-quality photographs, however picture upscaling shouldn’t be its major perform. Nonetheless, the Steady-Diffusion-WebUI affords a characteristic known as HighRes, which permits customers to upscale their generated photographs to 2x or 4x. It might be nice if Diffusers customers may take pleasure in the identical characteristic. After some analysis and testing, I discovered that the SwinRI mannequin is a wonderful possibility for picture upscaling, and it could actually simply upscale photographs to 2x or 4x after they’re generated.

To make use of the SwinRI mannequin for picture upscaling, we are able to use the code from the GitHub repository of JingyunLiang/SwinIR. Should you simply need codes, downloading fashions/network_swinir.py, utils/util_calculate_psnr_ssim.py and main_test_swinir.py is sufficient. Following the readme guideline, you’ll be able to upscale photographs like magic.

Here’s a pattern of how properly SwinRI can scale up a picture.

Left: authentic picture, Proper: 4x SwinRI upscaled picture

Many different open-source options can be utilized to enhance picture high quality. Right here listing three different fashions that I attempted that return great outcomes.

RealSR can scale up a picture 4 occasions nearly pretty much as good as SwinRI, and its execution efficiency is the quickest, as a substitute of invoking PyTorch and CUDA. The creator compiles the code and CUDA utilization to binary straight. My observations reveal that the RealSR can upscale a mage in about simply 2–4 seconds.

CodeFormer is nice at restoring blurred or damaged faces, it could actually additionally take away noise and improve background particulars. This answer and algorithm is broadly utilized in different purposes, together with Steady-Diffusion-WebUI

One other highly effective open-source answer that archives superb outcomes of face restoration, and it’s quick too. GFPGAN can be built-in into Steady-Diffusion-WebUI.

7. Optimize Diffusers CUDA Reminiscence Utilization

When utilizing Diffusers to generate photographs, it’s necessary to contemplate the CUDA reminiscence utilization, particularly once you need to load different fashions to additional course of the generated photographs. Should you attempt to load one other mannequin like SwinIR to upscale photographs, you may encounter a RuntimeError: CUDA out of reminiscence as a result of Diffuser mannequin nonetheless occupying the CUDA reminiscence.

To mitigate this subject, there are a number of options to optimize CUDA reminiscence utilization. The next two options I discovered work the perfect:

Sliced Consideration for Further Reminiscence Financial savings

Sliced consideration is a method that reduces the reminiscence utilization of self-attention mechanisms in transformers. By partitioning the eye matrix into smaller blocks, the reminiscence necessities are diminished. This system can be utilized with the Diffusers package deal to cut back the reminiscence footprint of the Diffuser mannequin.

To make use of it in Diffusers, merely one line code:

pipeline.enable_attention_slicing()

Often, you gained’t have two fashions operating on the similar time, the thought is to dump the mannequin knowledge to the CPU reminiscence briefly and liberate CUA reminiscence area for different fashions, and solely load as much as VRAM once you begin utilizing the mannequin.

To make use of dynamically offload knowledge to CPU reminiscence in Diffusers, use this line code:

pipeline.enable_model_cpu_offload()

After making use of this, at any time when Diffusers end the picture technology activity, the mannequin knowledge will probably be offloaded to CPU reminiscence routinely till the subsequent time calling.

Abstract

The article discusses the way to enhance the efficiency and capabilities of the Diffusers package deal, The article covers a number of options to widespread points confronted by Diffusers customers, together with loading native .safetensor fashions, boosting efficiency, eradicating the 77 immediate tokens limitation, utilizing customized LoRA and Textual Inversion, upscaling photographs, and optimizing CUDA reminiscence utilization.

By making use of these options, Diffusers customers can generate high-quality photographs with higher efficiency and extra management over the method. The article additionally contains code snippets and detailed explanations for every answer.

Should you can efficiently apply these options and code in your case, there may very well be a further profit, which I profit loads, is that you could be implement your individual options by studying the Diffusers supply code and perceive higher how Steady Diffusion works. To me, studying, discovering, and implementing these options is a enjoyable journey. Hope these options also can assist you to and want you take pleasure in with Steady Diffusion and diffusers package deal.

Right here present the immediate that generates the heading picture:

Babel tower falling down, strolling on the starlight, dreamy extremely vast shot
, atmospheric, hyper real looking, epic composition, cinematic, octane render
, artstation panorama vista pictures by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed put up processing, artstation, rendering by octane, unreal engine

Measurement: 600 * 800
Seed: 3977059881
Scheduler (or Sampling technique): DPMSolverMultistepScheduler
Sampling steps: 25
CFG Scale (or Steerage Scale): 7.5
SwinRI mannequin: 003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth

License and Code Reuse

The options offered on this article had been achieved via intensive supply studying, later night time testing, and logical design. It is very important be aware that on the time of writing (April 2023), loading LoRA and Textual Inversion options and code included on this article are the one working variations throughout the web.

Should you discover the code introduced on this article helpful and need to reuse it in your venture, paper, or article, please reference again to this Medium article. The code introduced right here is licensed underneath the MIT license, which allows you to use, copy, modify, merge, publish, distribute, sublicense, and/or promote copies of the software program, topic to the situations of the license.

Please be aware that the options introduced on this article will not be the optimum or best method to obtain the specified outcomes, and are topic to vary as new developments and enhancements are made. It’s all the time beneficial to totally take a look at and validate any code earlier than implementing it in a manufacturing atmosphere.