How Generative AI Could Enable a New Era of Filmmaking

Illustration of robotic hands framing out a shot of a virtual landscape
Illustration: Variety VIP+; Background: Adobe Stock

Note: This article relates to the Variety VIP+ special report “Generative AI in Film & TV,” available to subscribers only.

Imagine a Hollywood production without cameras, sets, locations — even actors.

Unthinkable as it might seem, it’s a future that could begin to emerge with generative artificial intelligence. Gen AI tools and their underlying models are beginning to offer capabilities that take steps toward the possibility of synthetic production that would allow for replacing certain physical production methods.

Video diffusion models, neural radiance fields (NeRF) and video avatars are among the emergent AI tools that could begin to revolutionize filmmaking:

1. Video generation: Generative video tools, such as Runway’s Gen-2 and Pika, that use video diffusion models are capable of synthesizing novel video, creating short, soundless animations from text prompts, images or video. Researchers at Meta and Google have developed unreleased models with similar capabilities, including Emu Video and Imagen Video, respectively. Additionally, Google’s forthcoming multimodal model Gemini is expected to offer video generation capabilities, as OpenAI is likely to bring to ChatGPT later this year with its GPT-5 model update.

Some have analogized video diffusion models to a new kind of camera, albeit where video is rendered rather than physically recorded. Model capabilities have improved dramatically over the last year of development, allowing for longer outputs, better temporal consistency and higher fidelity. As of a recent update, Runway’s Gen-2 can generate videos up to 18 seconds long, up from four.

VIP+ Survey: Entertainment Pros Anxious About Gen AI Use

For now, raw video outputs from such tools are still far too limited to be usable onscreen footage for a high production value film or premium TV. Questions of copyright aside, these tools are considerably constrained to give professional artists necessary control over the output, meaning the ease with which they can derive or manipulate a result to achieve a specific look.

But the quality and realism of outputs from video diffusion models are expected to continue improving, which suggests more serious potential usefulness in future. Powerful new control parameters are also often being added to software tools to allow users to more specifically change how the video renders, such as Runway’s Director Mode in Gen-2 that allows zoom, speed adjustments and “camera” rotations. Runway also recently released Multi Motion Brush, which allows video editors to control selected areas of a video with independent motion.

2. Neural radiance fields (NeRFs): NeRFs have gained attention in VFX circles for entertainment production use cases. Applications including Luma AI and Nvidia Instant NeRF allow users to create NeRFs from video shot on an iPhone, though NeRFs created from video shot on production-grade cameras will be of higher quality.

To create a NeRF, a neural network is trained on a simple recorded video from any camera or just a partial set of 2D images, meaning they don’t need to show all perspectives, sides or angles of the object or scene. The network is then able to generate a high-fidelity 3D representation of an object or scene by inferring unseen viewpoints, even those not captured in training data provided to the model.

In an improvement over photogrammetry, NeRFs also retain and dynamically render all reflections, lighting and qualities of different materials (e.g., transparency of glass, shininess of metal, human skin).

NeRFs might present impressive potential for VFX and even directors. Once created, a single NeRF capture can render any number of new 3D visualizations, which could then be processed in the cloud and exported in a variety of 3D editable formats.

Analogized to a virtualized camera that operates and moves within 3D volumetric space, a single NeRF allows creators to render infinite “camera” paths and framings from any angle or position, enabling physically “impossible” shots and the possibility of redefining the framing of a scene in post.

VIP+ Analysis: Understanding AI Risks Is First Step to Adoption

Attention has also rallied around NeRFs being used for virtual production, replacing the content that goes into LED volumes. In virtual production, in lieu of greenscreens, LED walls are constructed around a set and display fully rendered environments synchronized with the on-set camera’s movement.

Typically, the imagery displayed in these volumes requires VFX artists to produce a realistic 3D model of the scene environment in Unreal Engine. But now, NeRFs may be a dramatically easier, cheaper way to create these 3D scenes, as a small team of photographers can simply go to a scene location and capture video or images of the environment to render as a NeRF.

3. Video avatars: Generative AI tools developed by Synthesia, Soul Machines and HeyGen can create entirely synthetic, photorealistic avatars that combine deepfake video and synthetic speech to precisely replicate a specific person’s appearance, voice, expressions and mannerisms. These unique personal AI avatars have been variously referred to as digital humans, twins, doubles or clones.

AI systems create a person’s custom model by training on varying amounts of audiovisual data, whether captured in studios or as video footage of a person speaking directly to camera. AI avatars fall on a wide spectrum of realism, with some being hyperrealistic almost indiscernible from the real person, while others still tend to look like 3D graphics or “gamelike.”

For their speech capabilities, avatars can be provided with a transcript they can then dictate; or to enable conversational interaction, they are paired with a large language model (e.g., GPT-4) to effectively serve as a knowledge base or “brains,” which can be customized to an individual’s “personality” with further training.

For example, Soul Machines trains large language model and speech systems on hundreds of hours of interviews to replicate how an individual would respond in conversation, including in what they say and how they say it. As a result, avatars can be rendered to speak any language available in a large language model.

For now, however realistic some avatars appear, many have only limited range of motion and facial expressiveness and overall remain in the uncanny valley — the theory that describes the uneasy emotional response we have toward not quite real humanoid figures.

However, as the tech advances, the possibility would appear to exist for entirely synthetic, hyperrealistic avatars to bridge the uncanny valley and look, speak and behave indiscernibly from an actual person, whether trained on an actor’s data or brand-new virtual people.

Even so, many, including AI developers themselves, believe synthetic actors are unlikely to fully replace human performances in film and TV, at least by principal actors. Aside from the ethical or consumer implications, it’s understood that it would be extremely difficult to realistically replicate a human actor’s full emotional range and responsiveness in a way that would capture a performance’s unquantifiable genius or magic.

The longer-term implications of synthetically rendering film elements — actors and video in particular — are potentially great. As theorized, a new paradigm of AI video generation — rendering synthetic video with AI models — replaces physical production. For instance, if scene settings can be created entirely within a computer, on-set or location shoots and the camera itself become all but expendable. Through another lens, however, this would only be a continuation in a trend toward CG and virtual production, where what’s physically captured in front of the camera already differs considerably from the final look of a film. 

While synthetic production techniques aren’t likely to replace the basic ingredients of traditional filmmaking anytime soon, the question of how far production “virtualizes” with AI may increasingly depend on legal or contractual practicalities, creative principles or scruples and consumer acceptance.

See VIP+’s full catalog of artificial intelligence articles ...

Read the Report