Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. Stable Diffusion is a speed and quality breakthrough, meaning it can run on consumer GPUs. Stable Diffusion runs on under 10 GB of VRAM on consumer GPUs, generating images at 512x512 or 768x768 pixels in a few seconds.
To quickly try out the model, you can try out the Stable Diffusion Space. For more control and rapid generation you can try Stability AI DreamStudio beta.
The purpose of this tutorial is to demonstrate how to install Stable Diffusion 2.1 on Debian 11 Linux. However, before we proceed with the installation process, let's take a moment to review the software and hardware specifications of my laptop (Legion 5 Pro 16IAH7H), as well as the Debian 11 virtual machine and Google Colab environments.
1. Host:
Hardware:
- Laptop Legion 5 Pro 16IAH7H
- CPU: 12th Gen Intel(R) Core(TM) i9-12900H
- GPU: NVIDIA GeForce RTX 3070 Ti 8GB 150 W GPU, 8GB vRAM
- RAM: 32GB
Software:
- Debian 11 bullseye
- VirtualBox Version 7.0.6 r155176
- Python: 3.10.6
- Nvidia Driver Version: 470.161.03, CUDA Version: 11.4
2. Guest VM Settings:
Hardware:
- CPU: SD 1.4 - 8 x CPU, Execution Cap 100%, Enabled PAE/NX and Nested VT-X*AMD-V or SD 2.1 - 14 x CPU
- GPU: VMSVGA
- RAM: SD 1.4 - 9379 MB or SD 2.1 - 17274 MB
- Acceleration: Enabled Nested Paging
Software:
- Debian: 11 bullseye
- Python: 3.9.2
Google Collab:
Hardware:
- CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
- GPU: NVIDIA TU104GL [Tesla T4], 16GB vRAM
- RAM: 12985 MB
Software:
- Ubuntu 20.04.5 LTS
- Python: Python 3.9.16
- Driver Version: 525.85.12 CUDA Version: 12.0
Before moving on, it would be beneficial to take a moment and explore the concept of AI models. In AI, a model refers to a computer program or algorithm that is designed to learn and make predictions based on data. We train models with datasets. Training a model involves providing it with input data and the corresponding expected output or label, and then adjusting the model's parameters or weights so that it can make accurate predictions on new, unseen data.
The performance of the model is evaluated using metrics such as accuracy, precision, recall, and F1 score, which indicate how well the model is able to make predictions on new data.
To train a text-to-image AI model, we typically use a dataset of paired text and image examples. The dataset would consist of textual descriptions of images, paired with the corresponding images themselves.
Stable Diffusion Versions
Higher versions have been trained for longer and are thus usually better in terms of image generation quality then lower versions.
1. Stable Diffusion 1.4
Stable Diffusion 1.4 is the initial version of Stable Diffusion developed by CompVis. It was continued from stable-diffusion-v1-2 with a total of 225,000 steps at 512x512 resolution on "laion-aesthetics v2 5+". Additionally, a 10% decrease in text-conditioning was implemented to enhance classifier-free guidance sampling.
2. Stable Diffusion 2.0
StabilityAI utilized a significantly larger image dataset to train SD v2.0, but with the exclusion of adult content through the use of LAION's NSFW filter, a non-profit organization that creates models and datasets for AI researchers. This larger dataset resulted in improved performance for v2.0 in recognizing inanimate objects such as architecture and landscapes.
However, the NSFW filter was too strict and removed many safe-for-work images of people in the dataset, leading to a shortage of such images for training the model.
3. Stable Diffusion 2.1
Stable Diffusion 2.1 was released by StabilityAI on December 7, 2022, and it still incorporates a filter to exclude adult content. However, this version of the filter is less restrictive compared to the previous version. This change offers the best of both worlds by delivering enhancements in recognizing inanimate objects and improving the model's ability to detect people.
1. Generating Stable Diffusion Images Locally on Debian 11 VirtualBox VM Using CPU
To generate images, we will be using the CPU, but note that this process is very slow. It takes around 20 minutes to generate an SD 2.1 image with dimensions of 767x768 and about 5 minutes and 30 seconds to generate an SD 1.4 image with dimensions of 512x512. For this purpose, we will use a VirtualBox machine with Debian 11 installed, and we recommend checking the software and hardware specifications above to ensure optimal performance.
To get started, install the required packages by running the following commands:
$ sudo apt install python3-pip
$ sudo pip3 install diffusers transformers accelerate
You can test the installation by launching Python3 and importing the StableDiffusionPipeline module from diffusers:
$ python3
>>> from diffusers import StableDiffusionPipeline
>>> pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
Note: The Python will start downloading the preferred AI model. If you want to download SD 2.1 model, use the command below instead of the above line.
>>> StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1") instead.
Remove VAE encoder as it's not needed:
>>> del pipe.vae.encoder
>>> prompt = "astronaut riding a horse on mars"
>>> image = pipe(prompt).images[0]
>>> image.save(f"astronaut_rides_horse.png")
Figure 1 - CPU-Generated 512x512 Image Created with Service Diffusion 1.4
Hi,
Is there a way to add negative prompts in the google collab method?
Use negative_prompt in the call, e.g.
image = pipe(prompt, negative_prompt="woods, trees, nature, bushes, grass", width=1024, height=768, num_inference_steps=20, guidance_scale=8.5).images[0]