Anatomy of an image: SAND DEMON

I've had a few people ask me how I go about making images using AI, and my instinct is to tell them that it's super easy. Just go to Microsoft Copilot and ask to create an image, and tell it what you want. Done!

The truth is that this can get you spectacular results with very little work... sometimes. But the pesky part comes when you want something specific, not just something cool. A strong artistic vision makes an easy-to-use tool like DALLE (the image engine that Chat-GPT and Microsoft Copilot use) less useful.

That is where a platform like Stable Diffusion is a powerful tool. I like to think of Stable Diffusion as the Windows of AI art, and DALLE as the Mac. It may not be as pretty or as user-friendly but can be tweaked to your heart's content, giving you many more ways to customize and optimize your creative world.

One of the tools available in Stable Diffusion is called Controlnet. It allows you to take a starting image and use it to shape a new image in various ways. In this case, I had a photograph I created many years ago that I liked the composition of, and I wanted to play with it:

I didn't want to make a new version of this picture (a process which is typically called image-to-image), but rather, I wanted to use some elements of the image to suggest to the AI what I wanted, while still using a text prompt for the specifics. Controlnet can analyze and map the structure of your image in a variety of ways, producing black-and-white images which distill particular aspects of the original image. In my process for this image, I used two different Controlnet preprocessors:

The white lines in Controlnet image on the left indicate edges of objects. The Controlnet image on the right describes the depth from camera for any given pixel using the brightness of that pixel. White pixels are very close to the camera, and black pixels are very far from the camera, with gray pixels in the middle.

Using these preprocessed images, Stable Diffusion can match the characteristics that they describe in the generated image, while still conforming to the text prompt provided by the artist. The amount of effect that each of the Controlnet images will have on the generated image can be controlled by the artist by assigning weights to them.

The entire workflow to generate the first image in the process looked like this:

This user-interface to Stable Diffusion is called ComfyUI, and while it may look complicated, it's actually quite straightforward for anyone who has used node-based tools before. There are much simpler interfaces to Stable Diffusion available as well, but the simpler tools tend to lose the flexibility given by ComfyUI.

The first round of generations in my process netted this image:

The text prompt for the image was:

Horrible evil ocean demon laughs menacingly as it emerges from the roiling sea. It has blue wings and a red chest. Bright Sun backlights. Lens flare. Style of Hollywood blockbuster. Highly detailed.

Several details in the prompt were designed to match up with the original image, although the colors of the original image were not available to Stable Diffusion through the black-and-white Controlnet images. You can see that monster's arm and leg positions lined up quite well with the original photograph, as did the general attitude of the head. The ocean splashing got more creative, but you can see how the generated image owes its overall feeling to the original image without it being simply a different version of that image.

As much as I like this generation, I was going for something a bit more playful in line with the source photograph. I also decided to see how Stable Diffusion and Controlnet would respond to a prompt with completely different context, so I decided to place the scene in the desert instead.

To help achieve the playfulness I was going for, I made use of another tool in the Stable Diffusion toolbox called an IP-Adapter. The technology behind these can be thought of as supplying images to the AI text-to-image model in a way that the images effectively become part of the prompt. I chose these two images, both created by me using DALLE, to be part of the IP-Adapter for the ComfyUI workflow.

Through much experimentation with prompts and weighting of the Controlnets and the IP-Adapter, as well as tweaking my text prompt, I came away with some images that came closer to my goal:

The text prompt for these was usually something along the lines of:

Horrible evil desert demon laughs menacingly as it emerges from the dune-swept desert. It has tan wings and a furry chest. Bright Sun backlights. Lens flare. Style of Hollywood blockbuster. Highly detailed.

I chose a couple of these renders and continued to work on them for a while before deciding that they went a little bit too far in the direction of cute and playful, and I wanted to swing the pendulum back a little bit towards scary. I backed off the strength of the IP-Adapter and tweaked some of the other workflow settings until I arrived at this:

Happy with this generation, I moved to the next step, which is upscaling the image. The resolution of the original images generated by Stable Diffusion (and DALLE) is around 1K, or 1024x1024 if creating a square image. The image is displayed in this blog post in its full resolution, and is lacking fine detail and sharpness if examined closely. I usually like to upscale to no less than 4K, and will often go higher if the image merits it.

This process generally requires multiple upscaling applications. In this case, I started in ComfyUI with a common upscaling tool called Ultimate SD Upscale. I have never really been happy with the tools available for upscaling, and this one is no exception. Back over a year ago, when I was still starting out with AI image generation, I was using a tool called Disco Diffusion, and I would upscale to huge resolutions by cutting the image into little squares and rendering each of them out using image-to-image with its own prompt at the highest resolution possible. It took about a week to complete one image, but it provided detail that I haven't been able to reproduced with Stable Diffusion.

So I created my usual 2K upscaled version in Stable Diffusion, trying a bunch of different models and settings, and being generally unhappy with the results as usual. Then, as if by a miracle, I discovered a little-used tool in Stable Diffusion called Latent Diffusion Super Resolution (LDSR) which is extraordinarily slow, but blows all other free tools out of the water, and even adds much more detail to the image than the pricey Gigapixel software which I use on almost all of my images.

With a 10K image in hand, the next stop was Photoshop. The image did not require a lot of touchup work, so I went straight to the look creation, which was completed in two main steps.

First, using Photoshop's Neural Filters, I created a shallow depth-of-field effect to simulate the image having been shot on a large format camera. I also added depth haze to add a feeling of heat in the atmosphere.

Next, I spent a long time working in Nik Color Efex Pro creating a silent movie look.

I was very happy with this result, and thought that the old movie look merited a change in aspect ratio, so I used Photoshop's Generative Expand tool to make the image a little bit more square to match old movies.

At this point, I thought I was done, and was ready to put the image out into the world. But, something told me to check out the new upscaling functionality from Krea.ai. So I went ahead took this final image into Krea to see what it would do with it.

What the Krea upscaler does has been called "creative upscaling" in that it comes up with new details to fill in the gaps based on the content of the scene, not just the pixel values. I felt that this was a tremendous improvement in the quality of the image (especially in the eyes), but the free tier of Krea's upscaler only provides a 2K image. So while the image itself was much better, the resolution of the image was actually downscaled by 80%, from 10K down to 2K.

So I ran through the whole upscaling process again, using LDSR to get to about 8K (LDSR mysteriously picks an upscale ratio for you). Then, for good measure, I upscaled that image to 32K using Topaz Gigapixel. That file was 1.5GB. Gigapixel does not really add much detail, but it is more of a fancy sharpening tool, so once that was done, I scaled it back down to a reasonable 10K image in order to finish up. My theory is that the downscaled image will retain some of the sharpness provided by the massive upscale.

After that, it was back to Photoshop, where I used Generative Fill to fix up the weird foreground elements that Krea had created, and also to make the spraying sand behind the monster a little nicer. Finally, another pass through Nik Color Efex Pro and some curves adjustments in Photoshop led me to the final image!

I'm super happy with the results, and though this post got very long, it wasn't really a very long process to get here. There were a lot of tools used along the way, and I expect that we'll see some consolidation of functionality as platforms mature.

Anatomy of an image: SAND DEMON

Recent Posts

Comments