References
https://www.youtube.com/watch?v=N_zhQSx2Q3c
https://www.youtube.com/watch?v=1BCYdd9r1To
Tools
- Python 3.10
- Git
- Visual Studio runtime: https://aka.ms/vs/17/release/vc_redist.x64.exe
- Kohya_ss GUI: https://github.com/bmaltais/kohya_ss
- 30X0 GPU or above: CUDNN 8.6: https://github.com/bmaltais/python-library/raw/main/cudnn_windows.zip
Install kohya_ss
- Run setup.bat
- Install kohya_ss gui (choice 1)
- Torch 2 (choice 2)
- This machine (default)
- No distributed training (default)
- Run on CPU only: NO
- Optimize with torch dynamo: NO
- DeepSpeed: NO
- GPU(s): all
- 30X0 GPU or above: bf16 (otherwise fp16)
- 30X0 GPU or above: Extract CUDNN8.6 as koyha_ss/cudnn_windows
- 30X0 GPU or above: Install cudann files (choice 2)
Prepare dataset
- At least 10 images for person, at least 40 for style
- High resolution, sharp images only
- Optionally use https://www.google.com/advanced_image_search with an image size larger than 4MP
- Optionally use https://www.birme.net/ for cropping (not needed for SDXL) or converting image type or quality
- Preferably upscale your images with SD batch process in A1111 using R-ESRGAN 4x+
- Reduce file size using https://compresspng.com/ (or https://compressjpeg.com/)
- If training a person, check best resemblance at https://www.starbyface.com/ (and check if SD actually knows this person by generating a few images)
Image captioning from koyha_ss
- From Utilities -> Captioning -> BLIP Captioning
- Image folder to caption = <path to input images>
- Caption file extension = .txt
- Prefix or Postfix to add to BLIP caption (prefix for person or object / postfix for style = instance prompt (see below)
- Click “Caption images”
- From <path to input images> add as much accurate information as possible to the generated .txt files
Configure training in koyha_ss
- From LoRA -> Training -> Source model
- Model quick pick = custom
- Pretrained model name or path = <path to sd_xl_base_1.0.safetensors>
- From LoRA -> Tools -> Dataset Preparation
- Instance prompt = the token(s) you want the start the training with / the keywords used in the prompt later on (in case of a person, use the prompt found in preparing the dataset)
- Class prompt = person, man, woman, style name, object name, etc.
- Training images = <path to input images and captions>
- Regularization images = <path to regularization images of the matching class prompt>
- Repeats = 20 (typically) and remains 1 for regularization images
- Destination training directory = <path to generated model and meta-data>
- Click “Prepare training data”
- Click “Copy info to Folders Tab”
- From LoRA -> Training -> Folders
- Model output name = <subject>-<instance prompt> (e.g. BILL_BAILEY-JOHN_LITHGOW)
- From LoRA -> Training -> Parameters -> Basic
- LoRA type = Standard
- Train batch size = 1 for person, 5 for style (higher batch size creates more flexible model, but highly affects VRAM usage)
- Epoch = 10 (typically)
- Save every N epochs = 1 (an epoch is a cycle, so saving after every cycle results in multiple models to chose from)
- Caption Extension = .txt
- Mixed precision = bf16 for 30X0 or higher, otherwise fp16
- Save precision = bf16 for 30X0 or higher, otherwise fp16
- Cache latents = ☑️
- Cache latents to disk = ☑️
- Learning rate = between 0.0001 and 0.0003 for person / between 0.0009 and 0.0012 for style
- LR scheduler = constant
- Optimizer = Adafactor
- Optimizer extra arguments = scale_parameter=False relative_step=False warmup_init=False (in case of Adafactor)
- Max resolution = 1024,1024 (or 768,768 if you don’t have the VRAM)
- Enable buckets = ☑️ (allows training on non-cropped images)
- Text Encoder learning rate = <Learning rate>
- Unet learning rate = <Learning rate>
- No half VAE = ☑️
- Network Rank (Dimension) = 256 or e.g. 32 (higher number = bigger LoRA + ,more VRAM)
- Network Alpha = 1 or e.g. 16 (always in combination with Network Rank)
- From LoRA -> Training -> Parameters -> Advanced
- Gradient checkpointing = ☑️
- CrossAttention = xformers
- Don’t upscale bucket resolution =☑️
- Additional parameters = –network_train_unet_only (but practice says to leave it blank)
- Click “Start training” (steps = #images * repeats * epoch / batch size)
Selecting the best model
- From A1111 -> txt2img
- Place generated models in stable-diffusion-webui\models\Lora
- Enter a prompt describing the scene with your trained subject
- Enter a fixed seed and a high resolution
- Load all LoRAs in the prompt
- Select all LoRAs from the prompt and “copy”
- Remove all LoRAs from the prompt except the first one
- Click on “Script” and choose “X/Y/Z plot”
- For the X type choose “Prompt S/R” (stands for search / replace)
- Paste all LoRA names as X values and separate the names with a semicolumn
- Click on “Generate” (and do this multiple times to get a proper matrix to choose from)
- Don’t choose the best looking picture by default, but look at what is best representing the prompt. Find the right balance between accuracy and flexibility! Also don’t forget that Hires. Fix lead to better images.
Cloud generation
All of the above is intended to be installed and executed locally, but doing LoRA training in the cloud is a very good – if not better – alternative. I use this provider + template: https://www.runpod.io/console/gpu-secure-cloud?template=ya6013lj5a and was able to create a 256 / 1 LoRA in two hours (costing less than a dollar).
Notes:
- Select a GPU with high VRAM (like an RTX 3090 or A5000).
- Customize deployment and increase container disk to e.g. 50GB
- Jupyter notebook password = Jup1t3R!
- Disable A1111 and ComfyUI to free up as much VRAM as possible