Since OpenAI announced its text-to-video (T2V) tool Sora this week, manic predictions have ensued with varying combinations of awe and trepidation, acclaim and dismay, over what appeared to be a massive leap forward in T2V capability compared with similar publicly available tools, such as Runway’s Gen-2 and Pika.
Judging purely by its sample outputs, OpenAI’s Sora is the most impressive video diffusion model to date. But Sora and other video diffusion models are still limited in ways that would make them inept for Hollywood filmmaking.
“Sora is a tremendous achievement in the direction of realistic content that will be useful in high-end entertainment. Today, creatives demand total control over performances and what is in a scene, so there is still a long way to go before diffusion models can generate Hollywood movies,” Tom Graham, CEO and cofounder of AI firm Metaphysic responsible for deaging Tom Hanks in the Miramax film "Here," told VIP+.
But first, what makes Sora a leap forward exactly?
The model’s capabilities are identical to those already offered by other video diffusion models from Runway and Pika, including video generation and video editing. These include generating a novel short video based on a text prompt, generating a video based on a 2D image (e.g., animating an image), and inpainting (replacing or inserting new visual elements) and outpainting (extending a shot beyond its original frame, filling in with context-relevant content).
But Sora improves on or newly achieves a few things:
- Video quality and realism: Most evidently, its videos appear significantly more photorealistic — with higher fidelity — compared to outputs from other models.
- Video length: Sora’s video outputs can be up to a minute long while maintaining coherence to the prompt, significantly longer than Runway’s Gen-2, which can generate up to 18 seconds per generation as of its August 2023 update, up from just four.
- Spatiotemporal consistency: Sora also promises to be able to extend generated videos to make them longer. But the power of this capability is best understood in the context of another: that by giving the model “foresight” of many frames at a time, Sora solves the problem of “making sure a subject stays the same even when it goes out of view temporarily.”
This conceivably resolves a pain point for those who would try to create an AI-generated film that stitches together multiple video outputs as “shots.” Such attempts struggle to maintain character and scene continuity because repeated generation using the same prompt wording or conditioning parameters will never result in the model producing an identical result. This “extender” capability that can maintain character or object continuity from one output to the next could enable longer AI-generated storytelling from video diffusion tools.
Yet despite these advances, the following still present real barriers for models like Sora to be used in Hollywood productions:
- Continuity: Sora’s promised improvements aren’t total guarantees of subject/object and environment continuity enough to ensure a coherent narrative or look of a film or TV show. Like other models, too, Sora also isn’t free of occasionally misconstruing the way the real world looks or behaves, with “physics fails” that have been observed in outputs from other image and video models.
- Controllability: Some have analogized these models to a camera, albeit where video is rendered rather than physically recorded. But as I’ve previously discussed, so far, these tools don’t offer filmmakers sufficient creative control and precision to derive and manipulate their outputs — meaning AI could prove to be more, not less, difficult and constraining than traditional methods in the near term. That’s changing bit by bit with new control parameters being added to software, but it doesn’t automatically mean AI video vastly improves over camera footage.
- Copyright: More consequentially, Hollywood productions are highly unlikely to use these outputs for on-screen footage without more clarity on all sides of copyright law and generative AI. Significant questions remain open, including whether AI-assisted works from such models will be copyright protectable and whether AI-generated material is an infringement liability due to the strong likelihood that models have trained on copyrighted material.
As I argued in our December 2023 special report, “Generative AI in Film & TV, “At least in the near term, until legal questions are more clearly resolved, studio decisions about using AI-generated images or videos in production remain problematic and will prevent the use of generative AI tools for production assets that will show up on screen.”
As a result of these constraints, in the near term, these tools are most likely to materialize during previsualization stages of a project, such as to rapidly develop and iterate concept art, character design or animatics.
But even early-stage concept work is potentially not failsafe against infringement claims or questions of protectability if, for example, a studio, creative team or artist generates an interesting character or environment that’s then used in a human-created TV, movie or video game.
Sora isn’t publicly available (yet) and will be red-teamed to determine vulnerabilities or vectors of misuse. OpenAI also pledged to get feedback from policymakers, educators and artists around the world to understand concerns and identify beneficial use cases.
That’s similar to the conclusion reached by Google researchers who presented Lumiere, a new T2V diffusion model, in late January, saying while they believed the tool offered creative possibilities, “there is a risk of misuse for creating fake or harmful content with our technology.”
Assuming that Sora gets its public release, Sora stands to become a tool for social media creators and average users to flex their creativity, instantly flooding platforms with generated video. Advertisers and content marketers may also manage to use it. Unfortunately, deepfake disinformation is another likely outcome, regardless of watermarks. The verdict of OpenAI’s conversations remains to be seen.
VIP+ Analysis: Gen AI Explored From All Angles — Pick a Story