Experiments in Stable Diffusion

Frederick Dopfel
Dec 9, 2023
3 min read

When people talk about the enormous computing requirements to train or fine-tune an AI model, one naturally imagines ultra-fast computer chips. In truth, the speed of computer chips is not the limiting factor to training. If one is willing to be a bit patient, the limiting factor is the memory size of the computer instead. Although I had previously been playing around with AI applications like DeepDream and DeepStyle on my main workhorse personal computer, I was reaching practical limitations on what the system could support (It was reaching its 9th birthday, after all).

Fortunately, my new purpose-built AI computer was up to the task of taking on a new generation of AI training tasks, and I was excited to get started. Currently, the state of the art in generative AI is focused on Large Language Models and Image Generation. In this post, I will cover some of my early experiments with image generation and will go over large language models in a later post.

Stable Diffusion, developed by Stability.ai, is the most well-established AI image generation system and the basis upon which many online AI image apps, including Midjourney, were originally based. Stable Diffusion is open source, and, as a result, can be modified by anyone. Whereas most people who play with Stable Diffusion use it to create pictures, I aim to go an extra step: to fine-tune Stable Diffusion to learn new concepts, including how to make photos of myself.

There are many ways to work with Stable Diffusion, but the most common way is through the open-source software Automatic1111 (A1111). Installation is a bit tricky, but this software is the most robust currently available.

Creating images is pretty straightforward, but fine-tuning is much more difficult. In effect, you are teaching the model what a new empty word vector (for example, the nonsense word "zkz") means, by showing it images (for example, a picture of my face, labeled as "a photo of a zkz man") and having it compare that to a data set of images without that vector ("a photo of a man"). Some people use existing labeled data sets of photos for this training data set, but in my case, I have Stable Diffusion generate a new data set of base images "a photo of a man" and provide my own photos of "a photo of a zkz man" (50 or so cropped photos of myself). This then teaches the program specifically what makes zkz different than a normal man.

Once the model is trained, I save it and load it into the inference engine. I can then prompt it with zkz and it will generate pictures of myself (or whoever I choose). I've found that results become acceptable at around 25 training images, but are better as you approach 50. Most important, however, is that the photos be taken at different times, in different lighting, and in a variety of outfits. When I showed a collection of real and fake photos to strangers (admittedly, laypeople), most would guess incorrectly which were generated by AI.

In particular, I enjoyed tinkering with prompts to make myself a Pokemon card, an admiral, an old photo, and (my favorite) a Star Trek captain, which I use as my profile photo on a lot of social media now. These prompts are deceivingly complex, often times longer than this paragraph to get the photo just right. It shows the importance of prompt engineering, which is a skill that is learned over time.

This experimentation has also allowed me to upgrade my favorite wedding gift to give. Whereas previously, I would use style transfer to adjust engagement photos and print them on canvas as a gift to the newlyweds, now I can create entirely new works of art for them. Unfortunately, it requires a large number of photos of both the bride and groom in different lighting and different situations, which makes getting the training data hard, but I have already had some early success, such as this set with the couple as magical warriors / magic card characters.

I'm hoping that in future versions of Stable Diffusion, I will be able to train on a smaller data set, which means that I can train models using photos just on their Facebook and Instagram accounts, rather than asking them to provide an additional set of diverse photos to train the model on.

Final Thoughts:

I primarily worked with Stable Diffusion 1.5, despite it not being the most up-to-date model. I had difficulties working with Stable Diffusion 2.0 because it often ignored my prompts and underweighted the concepts I trained it with (in one comical case, it drew me as a black woman). I've read that Stable Diffusion XL is based on Stable Diffusion 1.5, and so I plan to experiment with XL in A1111.