Consistent Characters in Stable Diffusion – Part 1

Tutorial – Consistent characters

One of the challenges using AI professionally or commercially is generating a consistent character. MidJourney has some tricks to help automate that, but then you’re locked in to MidJourney, and you don’t have control over all the parameters. We’re using Stable Diffusion- it gives us a lot more control, and once we get it set up the way we want it, can massively speedup professional workflows by letting you tune the model to exactly what you need – in this case, a character with the same face and body whether they are running through the surf on a beach in a romantic scene or hanging upside-down from a rope in a heist scene or slugging it out for the tenth round in a boxing / sport scene.

I was originally going to write this whole process as one tutorial, but it got very long and very heavy almost immediately. So, I’ve broken it up a little.

At a high level, our workflow is going to be the following:

1 – Get 20+ headshots of our character (this tutorial)
2 – Train a model to produce just that character (next tutorial)
3 – Pose that model the way we want them to be posed (third tutorial)

Ready? Let’s begin!

In order to train SD later, we need to start with at least 20 images of the person we’re aiming for, from different angles and wearing different outfits in different lighting. This will let SD know that we’re interested in the character, not the clothing, lighting, outfits, etc. There’s some flexibility in this, but more variety is generally better if you’re starting with the same person.

Here, already, we run into our first problem. How do you get varied pictures of a fake person the all of the above variations before you’ve created said fake person? It’s a bit of a chicken-and-egg challenge.

One way to do that is to start with real people. You may think of stock photography, and that is an option; if you can find 20+ pictures of a model you want to use in different outfits, you could license those photos. That said, it’s surprisingly difficult to search for a particular model on stock photography websites, especially one in different sets of clothing. (Note: this is for the model’s safety, and is probably a good thing in general). Another option is to take pictures of a friend – just make sure to get a model release form from them first! Third, and probably most ethically questionable, is finding some random person on Instagram, and using a bunch of their pictures.
With all of these methods, starting with a real person has some thorny legal problems. Using someone’s likeness does not generally fall under fair use, so there may be legal difficulties. The way I’ve seen this work best is when someone uses models’ photos to make their LoRA’s, and then blends them together into a new and unrecognizable person.

Long story short: It’s an option to start with real people, but I’d recommend you generally start from scratch. (One exception: if one of your readers wants to give their likeness to a character and is willing to pay for that privilege on a Kicsktarter or Patreon or something, go for it!). Even if you do generally use real people as a starting point, eventually you’re going to want to generate a frog person, or an alien, or someone from a racial group that doesn’t exist in real life, or whatever. For full creative freedom, we’re going to start and finish in Stable Diffusion – our consistent character is going to be 100% fictional and unique.

So, how do we get there? Start with a style.  

For this exercise, I’m going to use the mother of my (very young) main character in the upcoming visual novel. I want her to be attractive enough that she generates instinctive concern / interest / care from the audience, but not have her be sexually objectified. (Note – AI models are trained on user-defined tags on images across the internet, so there’s a little bit of crowd psychology and guesswork that goes into prompting.)

After some attempts (and looking at example prompts on CivitAI for images I thought were in the right kind of style), I landed on the following prompt using the model Realistic Vision v1.3:

Positive: RAW, Analog, Award Winning Photograph, beautiful Chinese woman in her late 30s, casual, perfect face, 4K, intricate, highly detailed, sharp focus

Negative:

nsfw, nudity, ((disfigured)), ((bad art)), ((deformed)),((extra limbs)),((close up)),((b&w)), blurry, (((duplicate))), ((morbid)), ((mutilated)), [out of frame], extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), (((mutation))), (((deformed))), ((ugly)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), out of frame, ugly, extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck))), video game, ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, 3d render,

I initially tried making these in the anime style that the game will eventually be in, but my co-creator and I haven’t yet nailed down exactly what we want that style to be yet. I decided to get my characters first in a photographic style – which should be the most difficult and carry the most visual data – and then adapt that to whatever visual art style we choose after the fact. YMMV; I’ll include instructions later for that kind of tweak, but if you want to start in an art style rather than a photographic one, you’ll just have one fewer set of steps. (If you decide to use a photographic style, my best luck was Realistic Vision. FTR, I’m not affiliated with them, and again, YMMV.)

I set my generations to random and my batch size to 8, and then generated 192 images. Every couple of sets, I would vary the image prompt – instead of casual, I might add “in a t-shirt” or “in a dress” or “in a skirt and blouse” after “in her late 30s” and before the comma. Finally, on my 24th set, I found an image I really liked: 


This is the central image for the character. Knowing that, I copied this image into a folder, and went to the “Train” tab in SD. 

The Automatic1111 GUI gives you some limited training tools to create your own embeddings and Hypernetworks. Alas, it does NOT yet let you train LoRAs like we need to in later tutorials, but we will get there in good time.

By default, you’ll land on a “Create Embedding” screen. We first need to create an “embedding”, and then after we’ll train it. An embedding is also known as a textual inversion – it’s a way to teach Stable Diffusion what a certain prompt should mean. One example I saw recently was “sitting on the edge of a skyscraper” – someone had taken a bunch of different photos of people sitting on the edge of skyscrapers, and as a result, while that embedding wouldn’t create consistent characters for them, it did let them consistently generate pictures of… people sitting on the edge of skyscrapers.

We’re going to use this to help get closer to our consistent characters.

I created an embedding “uighur_mother”. (Note: my upcoming game is a protest piece regarding the treatment of Uighurs by the government of China. I have no criticism, implied or intended, for anyone of Chinese descent generally; I have nothing but criticism for mass murder, genocide, etc. by any person or group towards any other. With that said, it shouldn’t come up again in the course of this tutorial – and you don’t have to agree with me in order to learn how to make consistent characters!)



Try to pick an embedding name that WON’T come up in general prompting. Character names are great here, as are phrases with underscores instead of spaces. You’re trying to teach SD that, when you enter a very specific phrase in your prompt, you want a very specific set of results.

You can actually name the embedding something different than the “initialization text”, which is what actually triggers the embedding, but I would recommend NOT doing so. When you have a couple hundred embeddings installed, it gets VERY difficult / impossible to remember how to activate them all without a lengthy cheat sheet.



Now, go to the “Train” sub tab… on the “Train” super-tab you’re already on. 😝

Select the embedding you just created in the “embedding” drop down, and then add the filepath for the directory you stored the image in the Dataset Directory line. In Windows, to get to that filepath, open up the folder in Windows Explorer, and then highlight the top where it shows the folder name and the hierarchy of folders it’s in. Copy (Ctrl-C) and paste (Ctrl-V) that into the Dataset Directory line. Unlike most other times, in this case it’s ok that’s it’s on your machine, and not on the web!



Select “Train Embedding” in the bottom left corner, and let that finish; it may take a few minutes.
Congrats, you now have your own Textual Inversion / Embedding! This will make it easier to generate images of the character you want.

Our next goal is to get 20+ images of the “same” character. After we get that, we’ll move to make a LoRA that amalgamates them more completely and automatically, but for the moment, this is all an art, not a science.

Go back to the txt2image tab at the top of the screen, and add your initialization text (hopefully the name of your embedding, if you took my advice!) to the start of your prompt. Now, re-generate a bunch of images of your character in different clothes, environments, etc. For my embedding, here are some examples of that embedding at play, with all other variables (seed, steps, etc.) held constant.


From left to right, an image with no changes, and then with an increasing degree of the uighur_mother embedding applied. (at 1.0, with three sets of parentheses around it, and at 3.0).  These changes are subtle, but you can see the slightly wider mouth of the original image coming through in the final photo. It’s still not exact, and there’s a case to be made for image 3 being more true to the original in terms of face shape and jawline.

Here’s another set of examples.


 

From left to right, top row: no embedding, embedding trigger only, three parentheses, embedding trigger:3.0, embedding trigger:2.0
Wow, that changed things! Not only did we lose the interesting pose from the original, we made her even thinner, to the point it starts to strain believability in proportion. Backing off a little helped, but I think the second or third image best catches the Mona Lisa smile kind of expression of the original image, as well as her high and prominent cheekbones.

As a tip, you can vary the strength of your embedding’s influence on the outcome by putting your embedding trigger word in one or more sets of parentheses, or by just putting a colon and number at the end of it – i.e. embedding:1.5. In my experience, using the numbers tends to be more powerful – but also more in your control – than the parentheses. Again, this is an art, not a science.

Ok, we’re in the homestretch for this section. Go and generate a LOT more images, and then comb through them for the 20-50 you feel best capture the character you’re looking for. (I ended up generating about 416 total, 192 to get the starting image, and then another 224 for the remaining 50.) You only need 20; I grabbed more than that so I could cut back down. You’re looking for a variety of poses, facial expressions, etc. – the better mix you have, the more flexible your final model will be. My mix, at the moment, is below. I may go back and see if I can’t get a wider mix of facial expressions here for some more variety, but we’ll see what happens.  



Once you’ve got 20+ headshots, we’re ready for the next step – and the next tutorial!

P.S – You might find some other characters you really like in this process as well. I have no idea who this spunky orange hoodie-wearing girl is, but I’m going to find a place for her in this novel!

A brief guide to using Stable Diffusion

A brief guide to using Stable Diffusion:

After I posted some of my results using Stable Diffusion, a few people asked me for tips on how I got there. Hence, this guide. Stable Diffusion is quite a bit different than MidJourney (and NovelAI and most other paid/hosted services out there), both in what it allows you to do, and how you control it. BUT, once you learn how it works, the world is your oyster.

(Buyer beware: SD gets used, A LOT, for nsfw image generation. That’s not my thing and I try to stay away from it as much as possible, but you WILL likely bump into inappropriate images while setting up SD and downloading the add-ons you are looking for to get the images you want. I’ll mark how to sanitize your experience as much as possible, but please go into this with a very, very wary eye.)

The main benefit of SD over MJ is that YOU control the model used to generate your art. Want to switch from a realistic style to one more suited to early 90s HeMan/SheRa style animation? No problem! Want to create a consistent character that you can then pose / use in scenes over the course of a whole story? Sure! Want to turn your character into a pixel image spritesheet for a video game, or a full character rotation for a 3D model, or a lineart drawing for a coloring book? Say no more!

First, you need to install Stable Diffusion. This is not too bad, but you need to follow these (and the software instructions) CAREFULLY. Note: this DOES require a pretty decent graphics card, with AT LEAST 4GB RAM on the card. (4GB will function most of the time, but 6 or 8 or 10 or 12 on your graphics card is MUCH better). Realistically, this means you probably need an NVIDIA 3060 (or AMD equivalent) or better. There ARE ways to run this on a hosted server (so you don’t need to spend a bunch on a computer upgrade), but it’s a different process – I’ll try and post a guide to that later. (I’m plan to spin one of those up myself to create some consistent characters in DreamBooth.)

Here’s a step-by-step guide (not mine) for Windows: https://stable-diffusion-art.com/install-windows/

There’s an option listed here to install SD somewhere OTHER than in your Users directory – I REALLY wish I had done that when I first installed. I could change it now, but I’m in the middle of using the tool; I’ll get to it when I’ve finished this project.

Walk through all of those steps, then come back here.

Ok, to start up Stable Diffusion, click the “webui” Windows Batch File in your stabel-diffusion-webui folder. A command prompt will run for a while – a lot longer the first time, up to 10-15 minutes – and then pause on a text that should say something like:

“Running on local URL: http://127.0.0.1:7860

You MUST leave this command line window open; that is the stable diffusion software running. Highlight and copy the http:// address in your command line, and then go back to your web browser. Paste that address, and you should see something like the attached picture.

If you don’t, you probably forgot to download the model into the stable-diffusion/models/Stable-diffusion folder. Double check that, and try again.

Once you get the UI from the picture to show up in your browser, you’re ready to get started. The SINGLE most important thing about SD, and the biggest difference from all of those online tools, is that SD needs “negative prompts” to work correctly. The prompt is where you say what you want to see; the negative prompt tells SD what NOT to give you.

To get started with that, here is a starter negative prompt you can copy:

(bad_prompt_version2:0.7), flat color, flat shading, bad anatomy, disfigured, deformed, malformed, mutant, gross, disgusting, out of frame, nsfw, poorly drawn, extra limbs, extra fingers, missing limbs, blurry, out of focus, lowres, bad hands, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username,

There are some default things I put in the prompt as well; for example: “high detail, 4k, 3d, 2d, trending on artstation”. This is definitely something you can play with more; I haven’t nailed this down for myself yet, but getting consistent in this end-of-prompt stuff will really help you get a consistent style.

When you are writing what you WANT to see, I get better results with building up prompts bit by bit; start with something that works, then get more specific. For example, I started with “cactus”, then didn’t like the background, so went to “Cactus in a New Mexico desert”, then “Cactus in New Mexico desert in the Rocky Mountains”, then “Cactus under a moonlit sky in New Mexico desert in the Rocky Mountains.”

It also helps to define your camera angle, like “wide angle shot”, “

low angle shot”, or “cowboy shot”; here’s a good website with a list of shot types to include in your prompt: https://www.studiobinder.com/…/ultimate-guide-to…/

Also: see that “(bad_prompt_version2:0.7)” in the negative prompt? This is one of the powerful things about SD: people can create what are called “textual inversions,” which are basically packets of prompts that achieve a specific results. One of those that is commonly used is the “bad-prompts” textual inversion (they were previously called “Styles”), so let’s go download and use that. (For reference, you can control the strength of the added style by adding colon#.# (i.e. “:0.7” or “:1.3” or “:2.1”, where 0-1 reduces strength to a percentage, and anything over one increase strength.)

Go to the following link, create a HuggingFace account (it’s free), and download bad_prompt_version2.pt into your stable-diffusion-webui/embeddings folder:

https://huggingface.co/data…/Nerfgun3/bad_prompt/tree/main

HuggingFace has hundreds of those textual inversions you can download; these all go into your “embeddings” folder as .pt or .bin files. Here’s a link to HuggingFace’s library: https://huggingface.co/sd-concepts-library

You can see samples of a lot of these using the Conceptualizer link on that page, but about half of the inversions (Concepts) don’t have demos, so do scroll through the list at the bottom of the page as well.

When you want to use these in your prompts, you activate the style/concept/inversion by dropping the NAME OF THE FILE IN THE FOLDER into your prompt (i.e. bad_prompt_version2). I would set that off with commas myself, though you can play with it both ways.

And now, the biggest part:

Unfortunately, the base model of SD is… not so great. Fortunately, you can update that model; there are dozens to choose from. Unfortunately, here is the first place where you REALLY need to be careful with running into nsfw content. Go to https://civitai.com/ – it’s sfw by default, but be warned: even the sfw models here often have nsfw images (hidden by blur; remove at your own risk.) In filters (set Safe browsing if you like), and then search only for checkpoints. These are all different art-generation models you can use, and will give an entirely different flavor to all of your generated art. These should be downloaded into the stable-diffusion/models/Stable-diffusion folder, and once you refresh your SD web browser page, should be available in the dropdown under “Stable Diffusion Checkpoint.”

Find some models that look interesting, download them, and try them out! Some of my favorites are Anything V3 and Midjourneyv4, but the InkPunk models, and the 90s monochrome RPG book model, and the comicbook models… nvm, there’s a LOT of fun ones.

There’s one last item you need to do before you are *fully* up and running, and that is to enhance the colors generated in SD. Download the vae-ft-mse-840000-ema-pruned.ckpt file from here: https://huggingface.co/…/sd-vae-ft-mse-original/tree/main

into your stable-diffusion-webui/models/VAE folder. Then, go into your SD (you may have to turn it off and start it up again) and click Settings —> Stable Diffusion. In the VAE drop down, select that same 840,000 VAE, and hit Apply Settings. This will make your colors richer and more vibrant. There are other VAEs you can use – I’m using the kl-f8-anime.ckpt one – but I don’t remember where I found it, and I don’t have a good single-source repository for you in that regard.

Ok, you’re up and running! Go crank out some artwork!

When you want to see the files you’ve produced (every one is saved by default!) go to stable-diffusion-webui/outputs/txt2img-images and that should have all your original prompts there. Grids and img2img files are in their respective folders; poke around and it should all be there.

Happy generating!