After the experience of creating a video using still images and music with AI-assisted technologies (Wings: The Unexpected History of the Gorillahawk), I thought it would be fun to try to make a video where the motion in the video was integral to the imagery. I used Davinci Resolve in that first video to create "motion control" movements in the style of a Ken Burns documentary, where the images themselves were static, but I panned around, rotated, or zoomed into them to give them a sense of movement. This time, I wanted to have objects in the frame move relative to each other - no cheating!
The AI tools to do this are far less advanced than those that create still imagery. I'd say the experience of trying to create interesting video clips was similar to what I felt about 18 months ago when I first discovered generative AI to make images. Cool stuff is definitely possible, but you need a lot of patience, some luck, and the willingness to accept some weirdness.
I also created the music for the video using an AI tool available in Microsoft Copilot, but I'm going to save that discussion for another post.
As I had for the previous video, I only used tools that I could access for free or ones to which I already had paid access to make this video. That turned out to be quite limiting, as it turns out that there are much more powerful video creation tools available if you're willing to fork out some cash.
There are two main ways to generate video with AI right now. The simplest is text-to-video, which is directly analogous to still image generators like DALLE-3. You type in a description of what you want to see, and the generative AI model spits out some video. At this point, these are pretty much useless for anything but the most typical (re: boring) images of girls staring at the camera or pretty landscapes. So text-to-video was off the table for my project.
The other method is to give the model a starting image, which it analyzes and animates based on how it expects things in the image might move. Some of the tools have very rudimentary controls that are meant to help you define how you want the image to move, and some give you the ability to give a descriptive prompt like with text-to-video. None of the controls are particularly deterministic; you will still get a lot of unexpected weirdness.
All of the tools completely decimate the original image. Below is a gallery of the original images used in this video. You can see how heartbreaking it is to have the quality reduced so significantly in the name of creating motion. All generated clips are at a low frame rate (usually 8 frames per second) and only 2-3 seconds long. I used Davinci Resolve's tools to smooth out the motion in the clips, and while I believe that functionality is only available in the paid version, it is amazingly good.
I tried the video creation tools from RunwayML, Pika, LeonardoAI, and several others, all of which are credit-based, which means money if you want to create a reasonable number of clips. In the end, the large majority of the clips in the video were created with Stability Video Diffusion, which can be run on your local computer if you have enough GPU power, and so is completely free. It doesn't have the ability to accept text prompts or to give any specific instructions as to how to create the clip, but rather it has many technical parameters and custom models that you can explore to find the best possible results. I found that it generally wanted to create simple slow dolly push-in shots, which are quite impressive in the way they handle perspective and parallax. But a casual viewer might not notice the difference between that and a Ken Burns-style zoom like the ones from my previous video. So it took thousands of generations, tweaking parameters along the way, to get the 40 or so clips you see in the video.
As these things go, I'm sure everyone will be making videos ten times better than this one by the end of 2024. But I won't be attempting this again until the tools get quite a bit better, or until the better ones become free!
Here are the original images used in the video (scaled down for the web, but still far better quality than the resulting video). They were all created in DALLE3 via Microsoft Copilot, and then expanded from square to 16:9 aspect ratio in Photoshop using the AI Generative Expand tool and standard retouching. The one upside-down image was inverted in order to trick the AI into making the movement that I wanted (which was a complete shot in the dark, but it worked!). Click to see full-size images.