Just a few weeks back, I wrote that we are probably still some way from being able to create a movie from a natural language prompt.
Now, it seems that it may happen a lot sooner than I suspected. OpenAI – creator of ChatGPT, the chatbot that started the current generative AI craze -just announced its own text-to-video model, Sora.
To say the results have stunned the AI community is an understatement. Although we can’t yet use it for ourselves, videos demonstrate a close-to-photorealistic sequence of a woman walking in a city and a goldrush-era US town, generated from simple text prompts.
According to people I’ve spoken to, this puts them two or three years ahead of where they were assumed to be when it comes to generative video. This is just one more sign that the AI revolution is going to take place at a far quicker pace than many are anticipating.
But generative video – while undoubtedly technically amazing – creates ethical and societal challenges that go beyond those posed by the automated creation of text, images and sounds.
So, let’s take a look at what it is, what it does, and perhaps most importantly, what it means for a world in which it will inevitably become more and more difficult to tell the difference between the real and the digitally generated.
So What Is Sora?
Basically, Sora is to video what ChatGPT is to writing, and Dall-E 3 is to image generation. You type what you want to see, and it appears, in full motion, in front of your eyes.
None of the videos that have been shown as of yet have any sound, but given advances in AI sound and music generation, we can only assume that this will be coming soon.
Generative AI video creators aren’t entirely new. I’ve outlined a number of them that have appeared in the last year or so in the piece I linked to at the start of this article. Mostly, though, while they generate text, overlays and effects, they don’t produce actual video animation. However, there are a few exceptions, like Runway.
At this early stage, impressive though it is, it isn’t going to give us the next Toy Story from a prompt. But the potential is virtually unlimited. Filmmakers can use it to visualize concepts and scenes or generate special effects. Teachers can create immersive historical recreations, and manufacturers can use it to create prototypes and demonstrations.
At the moment, Sora can generate videos up to one minute long. And it’s more than simple image generation (if we have to think of that as simple now) creating a set of consecutive images to give the impression of movement; it’s capable of tracking the positioning of objects so they move realistically and coherently with other objects, moving in front or behind of them, for example.
It can even perform complicated operations like “remembering” objects when they move off-camera so they will be recreated accurately when they move back into view.
It isn’t perfect, of course, and OpenAI admits that it will generate inconsistencies, such as objects that don’t follow the laws of physics or causality.
But from what we’ve seen, it’s an amazing technology that gives a tantalizing glimpse of what we will soon be able to do!
How Does It Work?
Like Dall-E and other image generators, Sora is essentially a diffusion model, meaning it creates images from random “noise” and gradually de-randomizes them by transforming them into an image that matches their prompt.
Over thousands or tens of thousands of steps, the images that make up the video become more defined.
What really makes it special is the ability to understand how the objects – people or anything else – in the setting would realistically interact with everything else. This could mean water making things wet when they move through it or a ball falling and moving across the floor in a realistic way when it’s dropped.
Just as ChatGPT understands words from their context, learning how they fit together with other words to communicate meaning, Sora understands how things act and behave in real-world settings. OpenAI hasn’t given details of what data it’s trained on, but it’s likely to be many, many hours of real-world video footage from which it can learn how items, people, animals, and scenery move and interact.
As well as generating entirely new footage, it can continue an existing video and recreate existing footage from new angles.
Is The World Ready For Generative Video On-Demand?
Sora offers amazing possibilities. But empowering anyone to create realistic videos of anything they want will clearly not be without dangers.
Scams and phishing attacks could become more sophisticated, for example, by using deepfake videos to make fraudulent activities seem more legitimate or plausible. We’ve already seen this with AI voiceovers overlaid on footage of celebrities to create the impression they are giving their endorsement.
It will inevitably also become easier to create non-consensual videos with convincing likenesses of real people, which could be used to cause harm or for blackmail.
I am sure that we will also see it used in attempts to subvert democratic processes and spread fake news and disinformation, with the aim of undermining trust in politicians, governments, or institutions.
OpenAI tells us it has built safeguards into its algorithms in order to prevent many of these uses and is also developing its own tools to help identify harmful content. But as we’ve seen with ChatGPT, it’s highly likely that workarounds for these will be found, or copycat products will emerge without safeguards in place.
Addressing these issues will require a concerted effort involving education, legislation and the adoption of robust frameworks around responsible, ethical AI use. Sadly, as has been the case with every transformative technology from mechanization to the automobile and computing, it seems inevitable that some harm will be caused.
But the genie is now very much out of the bottle, meaning it’s down to responsible AI users and advocates to ensure society manages these risks effectively while also allowing its transformative potential to be realized.