To understand AI and creativity, we need to first examine the relationship between technology and art. Creativity is vast, however, so this article will focus on film and touch on the very related impacts on design.
Technology and art have traveled hand in hand for as long as we've had both. Tools for art-making dating back 100,000 years were discovered in South Africa . Some argue that the desire to make art drove the human adoption of tools, and vice versa. Contemporary art history itself came from the scientific approach in the 15th century, when the study of tools and techniques was as important as the art itself.
→ Generative AI is remarkable and transformative technology, but generative video remains a daunting technical challenge.
→ Even if we somehow achieve art-making machines with perfect fidelity, the central need for human direction and taste will be ever present.
In film, the Academy Awards came out of the Academy of Motion Pictures Arts and Sciences (emphasis on sciences). The early Oscars, from the 1930s to the 1960s, featured toolmakers as part of the main awards ceremony, and they are still recognized in an adjunct ceremony every year.
Design systems have emerged to augment the more manual process of design by using tools to abstract repetitive tasks in the creative process. Extended further, these abstractions have facilitated the creation of tooling such as sketch-to-design interfaces  and text-to-design prototypes .
The point is: Technology and art are deeply intertwined. Tools in art are a mechanism to take human intent and magnify it. Central to this is understanding that without a human to drive them, the tools aren't meaningful.
There's a persistent myth that advances in technology are going to "kill" art and artisans. As painter Paul Delaroche exclaimed at a demonstration of the daguerreotype (an early photograph) circa 1839, "From today, painting is dead!"
Photography didn't kill painting, but it would be disingenuous to deny that it caused a seismic shift in the field. In the same way that a good set of brushes, paint, and canvas don't automatically make a good artist, a good camera doesn't automatically make a good photographer. Time and again, it's the skill and expressiveness of the operator that makes compelling art. In the words of legendary street photographer Henri Cartier-Bresson, "It is an illusion that photos are made with the camera… they are made with the eye, heart and head."
More recently, the introduction of design systems has been met with skepticism. As with photography, there has been concern that design systems will replace designers. Of course, there's an argument to be made that by freeing up designers from a repetitive piece of work in an already complex workflow, they can instead focus on higher-level concerns like why a piece of work is being made—the composition rather than just the function. As Marshall McLuhan is claimed to have said, "We become what we behold. We shape our tools and then our tools shape us." In a complex creative task like image-making or design, these tools peel away the repetitive work and instead get closer to the ineffable idea of creativity.
Tools in art are a mechanism to take human intent and magnify it. Without a human to drive them, the tools aren't meaningful.
An example in living memory, before our current AI moment, is the idea that computer graphics were ruining cinema. The combination of computer graphics and art was born from a collegial optimism. In the May 1976 issue of Computer Graphics and Art, editor Grace C. Hertlein suggested that "[i]n the very near future, computer art will become a common, accepted art form that many people will enjoy and practice. Computer art applications will be implanted in many useful, everyday functions as well."
We ended up in what has become a familiar place, with warnings that computer graphics tooling—the new technique—will destroy cinema. Despite this, it has become an ordinary and ubiquitous part of the filmmaker toolbox, applied across human-scale dramas and blockbusters alike, from Parasite to The Scorpion King.
As a computer graphics toolmaker, I am biased here, but I'd make the argument that blaming the tools is missing the point, that these complaints about computer graphics in film come down to issues of taste  rather than function . We talk often about using the right tool for the job, but despite its versatility sometimes a CGI approach isn't the right solution. The tools might not be up to the task, or there might not be enough time or data to use them properly. In that case, could another tool have been used instead? Maybe the narrative could have been adjusted to tell the story best without the tools?
Unfortunately there's a pattern in the early life of new computer graphics tooling where it becomes a solution in search of a problem. It's a shiny new toy with so much potential that experimentalists will go out of their way to give it a go, whether it's appropriate for the storytelling or not.
I'm going to go with the framing assumption that AI (and the machine learning that drives it) is a tool—a powerful one, but a tool nonetheless. Is it going to replace creatives? To reflect on that, I'll start by describing my own attempt to answer that question.
ML filters. In early 2016, I was captivated by the idea of style transfer, an approach pioneered by Leon Gatys et al. in their seminal paper "A Neural Algorithm of Artistic Style" . In this technique, any given image—a photo or whatever—could be redrawn in the style of another image (for example, a painting) using machine learning.
I started out by just gluing things together and seeing what happened. Style transfer worked on individual images, so to get it working on video, what if I just split a video into individual frames and style transferred each frame, and then combined the results?
Turns out this was surprisingly fun. I took two things stylistically unalike—the symmetrical perfection of Stanley Kubrick's 2001: A Space Odyssey and the chaotic explosion of shape and color of Picasso—to create 2001: A Picasso Odyssey (https://vimeo.com/169187915). The short film was silly and absurd, but it happened at the right time to capture a moment in AI. It ended up showing at a few film festivals, often framed by the now-perennial question: Can computers make art?
Not long after this, I got a call from a producer in Los Angeles, David Shapiro, who was working with Kristen Stewart on her directorial debut, Come Swim. The concept for this film was based on a watercolor painting Stewart had made, and it had sparked a creative question: Could style transfer be used to reexpress some of the scenes in the movie in the style of the painting?
When you get that kind of call, you say yes and then spend the next three months working out how you're going to make it happen. Drunk on hubris from A Picasso Odyssey, I'd—incorrectly—romanticized the capabilities and idea of style transfer and AI and just assumed the task was easy: Frames and painting go in, movie comes out.
But now I was on a real movie, in real production. And it turned out I was wrong about AI being plug and play for creative work—so wrong, in fact, that we ended up writing a paper about how wrong we were . So what happened?
Taking ML techniques that work on individual images and assuming they are going to work on a video sequence is a superbad assumption. We run into a key problem here that doesn't have a simple solution, which is what is called poor temporal stability (or poor temporal coherence). Turns out even minuscule changes in input images—the kind of tiny changes that happen between frames in a video—can lead to dramatically different results when an ML algorithm is applied.
This means that there's a significant amount of "sizzle" in the results; there could be significant differences in rendered color and even shape from frame to frame. It's a cool effect that in some cases works well with the narrative, but in the vast majority of cases it's a bug, an error, and we're still trying to find ways around this.
In a more general sense, another problem with ML techniques that transform images—for argument's sake let's call them ML filters—is poor local control. These techniques take hints about what you want them to do and then maybe will do it the way you want them to. So if you want a good result, then we return to the age-old story about humans creatively using tools—that is, use human intuition and discrimination and a sense of taste and aesthetics to select the best result from among several attempts the computer has made.
We can get better results by adding more handles for directability. These handles are often priors—bits of precomputed or structured information—that help explicitly guide the machine learning process. Tools like ControlNet allow you to preprocess the video to highlight what is important (e.g., edges) and use that to guide the generation.
It's worth noting that these still don't address the temporal sizzle. There are various ways of improving the result with artistic intervention using warping tools like EbSynth, but as the level of human artistic intervention increases, the less it looks like AI and the more it looks like traditional video animation work—and that's okay!
Deepfakes have overcome this issue with a great degree of success by using a prior that computers can now track very well: the human face. It's very limited—as soon as someone turns their head to the side, it falls apart. Consumer-level deepfakes are starting to look pretty decent, but they still don't look good. And they still require a lot of human intuition, steering, and taste to get them there.
High-end deepfakes—like what's being used on Indiana Jones and the Dial of Destiny—are even further down that line. The results look great, and it doesn't make that any less true that the starting point was ML, but there were hundreds of hours of artist intervention to get it there.
These ML filters are a long, long way from claims that AI can "make a movie." It's safe to treat any overheated hype around that with due suspicion. Sure, they might make some video jobs a bit easier to achieve, but they still require a skilled operator with taste and discretion.
Text to video. So what about text to video? Over the past year or so, there have been several machine learning models that, similar to other diffusion-based techniques, can generate videos based on a text prompt. It's astonishing that they work at all, and the promise of these techniques is that they can generate anything you can think of. The results are fun, but they're a long way from being good, especially given the problems with our old friend, poor temporal coherence.
The argument here is that even though the technology is in its early stages, it will evolve fast, like DALL-E and Stable Diffusion. It's a reasonable prediction to make, and I don't mind admitting I had a moment of existential dread when I thought moviemaking could be replaced with generative models in the near future.
But it doesn't hold up. From a technical point of view, the task of getting a machine learning model to generate video is an enormous mountain to climb. These models need to accurately generate a picture world from any precise viewpoint or angle (we're not nearly there yet). They need to learn what is in videos and how they move over short time spans (also super challenging). Over longer time spans, they need to learn sequences of actions and then situate them in a world model. Each one of these is a massive area of research on its own, and combining them is no mean feat.
That's not to say the general idea isn't possible. As with all creative tooling, there are clever trade-offs that can make a carefully staged portion of the idea possible. A 2023 paper, "Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation" , only trains on audio and video of talking heads. What's remarkable is the quality of the result, that there's no prior here and that it's truly generative. The faces, the mouth shapes, the timing, the expressions—all are completely generated from an audio track and a starting "identity frame" that you give it to kick off the process.
From a technical point of view, the task of getting a machine learning model to generate video is an enormous mountain to climb.
In a similar spirit, the model in the 2022 paper "Flexible Diffusion Modeling of Long Videos"  is trained on a set of clips of driving around a virtual, fixed-size town or a Minecraft map. Within these constraints, arbitrary views and even realistic motions of driving and movement can be generated.
There's an argument that perhaps if we make the models bigger they'll learn how to deal with time. I'm not sure that's true—video motion is so much more complex. Taking an educated guess based on the path of current discoveries, completely generative video models of any useful quality are going to be too big for practical inference for at least the short to medium term.
Instead, let's play this forward theoretically, and assume that these enormous challenges around text-to-video generation are somehow solved and we have models that can generate anything we can think of. I'd be happy to stick my neck out and say that even then we're going to still need a human in the loop. Because if you can do anything you'll more than ever need someone with taste and a sense of aesthetics to tell a good story. Let's look at a solid example from 3D computer graphics in the movie industry: Pixar's Inside Out (2015).
|Still from "Harry Potter by Balenciaga."
With hundreds of hours of artist labor, massive 3D worlds can be created, and the camera—it can go anywhere. But Inside Out's director of photography, Patrick Lin, made the decision to tie the camera to a model of real-world mechanics—the virtual lens was modeled precisely after real cinema lenses to capture their flaws and imperfections. The virtual camera was tied to a virtual camera crane that couldn't go anywhere that wasn't physically plausible. These creative choices were based on intuition and aesthetics and taste; these intrinsically human qualities are what makes the look and feel of the movie so good.
Even when AI can create videos with minimal human supervision, whether it's 3D computer graphics or diffusion, it will still be locked in an eternal battle with the collective human ability to spot a fake. A year into so-called high-quality text-to-image generation, we've already got the collective ability to identify that washed-out Midjourney look, the mangled hands or the mismatched lighting and perspective. The quality of these models will get better, but so will our ability to pick them out.
As always, there is creative innovation happening with clever filmmakers who are able to create videos that work around the limitations of imperfect tooling. A recent favorite of mine is the series of Balenciaga spoofs (https://www.youtube.com/watch?v=iE39q-IKOzA).
Faces were generated using Midjourney and voices using ElevenLabs AI. The faces were animated using D-ID. But only the faces are animated. The bodies are rigid and lifeless; the voices, staccato and flat. And what real-life situation has that aesthetic? Haute couture, of course. All of a sudden these limitations become central to the look, and it works.
Similarly, "Pepperoni Hug Spot" (https://www.youtube.com/watch?v=qSewd6Iaj6I) works with text-to-video footage. Text-to-video footage is low resolution, grainy, and limited to clips only seconds long because they become incoherent after that. What was low res, grainy, and full of short-length cuts of seconds-long clips? Ads in the 1980s! That's not exactly true, but it's how we remember it to be true. And that's good enough. The crew nailed the look here.
|Still from "Pepperoni Hug Spot," a fictional TV commercial made with text-to-video footage.
The image and video generation in these examples is important, but there's a ton of human work here. Humans need to make prompts to drive the image and video generation, and to sift through many iterations of this to find a set of media that work together. A story, and a script, that makes sense needs to be written. Then that needs to be edited into meaningful and aesthetic cuts; captions and titles need to be added in post as well as music. The human is still the director and the complex intuition and creativity required for that isn't likely to go anywhere.
I don't think there's any algorithm or model that can capture the subtle expressiveness of cinema, TV, or theatrical presentation. It isn't about regurgitating what is literally seen; it's about setting rules for the world of the story and knowing when to creatively break them for emphasis. It's a unique intuition that can't be described, so a computer is going to have a hard time capturing it.
Don't count out humans yet.
1. Henshilwood, C.S. et al. A 100,000-year-old ochre-processing workshop at Blombos Cave, South Africa. Science 334 (2011), 219–222; https://www.science.org/doi/10.1126/science.1211535
2. History of the Scientific & Technical Awards. Oscars website; https://www.oscars.org/sci-tech/history
3. Wilkins, B. Sketching interfaces: Generating code from low fidelity wireframes. Airbnb website; https://airbnb.design/sketching-interfaces/
4. Frost, B. Design systems in the time of AI; https://bradfrost.com/blog/post/design-systems-in-the-time-of-ai/
5. Michelle, C., Davis, C.H., Hight, C., and Hardy, A.L. The Hobbit hyperreality paradox: Polarization among audiences for a 3D high frame rate film. Convergence 23, 3 (2017), 229–250; https://doi.org/10.1177/1354856515584880
6. Hughes, M. If digital effects ruin movies, so did color and sound. New York Times. Mar. 8, 2013; https://www.nytimes.com/roomfordebate/2013/03/07/are-digital-effects-cgi-ruining-the-movies/if-digital-effects-ruin-movies-so-did-color-and-sound
7. Gatys, L.A., Ecker, A.S., and Bethge, M. A neural algorithm of artistic style. arXiv:1508.06576, 2015; https://doi.org/10.48550/arXiv.1508.06576
8. Joshi, B., Stewart, K., and Shapiro, D. Bringing impressionism to life with neural style transfer in Come Swim. Proc. of the ACM SIGGRAPH Digital Production Symposium. ACM, New York, 2017, Article 5, 1–5; https://doi.org/10.1145/3105692.3105697
9. Stypułkowski, M., Vougioukas, K., He, S., Zięba, M., Petridis, S., and Pantic, M. Diffused heads: Diffusion models beat GANs on talking-face generation. arXiv:2301.03396, 2023; https://arxiv.org/abs/2301.03396
10. Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., and Wood, F. Flexible diffusion modeling of long videos. arXiv:2205.11495, 2022; https://arxiv.org/abs/2205.11495
Bhautik Joshi is a principal R&D engineer at Canva. He has nearly 25 years of experience in immersive tech, film, TV, games, and machine-learning and consumer-facing tools for creativity, having previously worked at CSIRO, Industrial Light & Magic, Flickr, and Adobe Research. [email protected]
- 100,000 BCE: Stone Age painting kits, South Africa
- 5,000 BCE: Bone tools for sculptures, Mesopotamia
- 100 CE: Pens with ink, Greece
- 1200: Silverpoint paper marking, Europe
- Mid-1500s: Graphite and the pencil
- 1630s: The camera obscura (allegedly used by Vermeer)
- 1826: Development of the camera and permanent photograph, France
- 1934: Development of acrylic paint, Germany
- January 1963: First computer art, by Efi Arazi, published on the title page of Computers and Automation
- 1982: First realistic fractal-generated landscape and motion blur, "the Genesis effect," by the Lucasfilm Computer Division, in Star Trek II: The Wrath of Khan
- March 1983: First offline photorealistic ray-traced render at arbitrary resolution, Road to Point Reyes, by Alvy Ray Smith
- 1989: Development of RenderMan, a sophisticated tool kit for photorealistic offline rendering of 3D scenes
- 1999: Release of the first consumer GPU, the Nvidia GeForce
Copyright held by author. Publication rights licensed to ACM.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.