The term generative AI refers to a relatively new field of AI that can create human-like content, from pictures and videos to poetry and even computer code.
To achieve this, several different techniques are used. These have mostly evolved over the last 10 years, building on earlier work carried out in the fields of deep learning, transformer models and neural networks.
All of them rely on data to effectively ‘learn’ how to generate content, but beyond that, they are built around quite different methodologies. Here’s my overview of some of the categories that they fall into, as well as the type of content they can be used to create.
Large Language Models
Large language models (LLMs) are the foundational technology behind breakthrough generative AI tools like ChatGPT, Claude and Google Gemini. Fundamentally, they are neural networks that are trained on huge amounts of text data, allowing them to learn the relationship between words and then predict the next word that should appear in any given sequence of words. They can then be further trained on specific texts related to specialized domains – known as ‘fine-tuning’ to enable them to carry out specific tasks.
Words are broken down into ‘tokens,’ which could be small, individual words, parts of longer words, or combinations of prefixes, suffixes and other linguistic elements that frequently appear together in text. The mathematical process of matrix transformation is then used to convert them into structured numerical data that can be analyzed by computers.
As well as creating text and computer code, LLMs have made it possible for computers to understand natural language inputs for many tasks, including language translation, sentiment analysis and other forms of generative AI such as text-to-image or text-to-voice. However, their use has created ethical concerns around bias, AI hallucination, misinformation, deepfakes and the use of intellectual property to train algorithms.
Diffusion Models
Diffusion models are widely used in image and video generation, and work via a process known as ‘iterative denoising’. Starting from a text prompt which the computer can use understand what it has to create an image of, random ‘noise’ is generated – you can think of this as starting to draw a picture by scribbling randomly on a piece of paper.
Gradually, the scribbles are then refined, using its training data to understand what features should be included in the final image. At each step, ‘noise’ is removed as the image is gradually adjusted to include the desired characteristics. Eventually, this leads to the creation of an entirely new image that matches the text prompt but hasn’t already been found in the training data.
By following this process, today’s most advanced diffusion models, such as Stable Diffusion and Dall-E, can create photo-realistic images, as well as images that imitate paintings and drawings of any style. What’s more, they are increasingly able to generate videos, as recently demonstrated by OpenAI’s groundbreaking Sora model.
Generative Adversarial Networks
Generative Adversarial Networks (GANs) emerged in 2014 and quickly became one of the most effective models for generating synthetic content, both text and images. The basic principle involves pitting two different algorithms against each other. One is known as the ‘generator,’ and the other is known as the ‘discriminator,’ and both are given the task of getting better and better at out-foxing each other. The generator attempts to create realistic content, and the discriminator attempts to determine whether it is real or not. Each learns from the other, becoming better and better at its job until the generator knows how to create content that’s as close as possible to being ‘real.’
Although pre-dating the large language models and diffusion models used in headline-grabbing tools like ChatGPT and Dall-E, GANs are still considered to be versatile and powerful tools for generating pictures, video, text and sound, and are widely used for computer vision and natural language processing tasks.
Neural Radiance Fields
Neural Radiance Fields (NeRFs) are the newest technology covered here, only emerging onto the scene in 2020. Unlike the other generative technologies, they are specifically used to create representations of 3D objects using deep learning. This means creating an aspect of an image that can’t be seen by the ‘camera’ – for example, an object in the background of an image that’s obscured by an object in the foreground or the rear aspect of an object that’s been pictured from the front.
This is done by predicting elements such as the volumetric properties of objects and mapping them to 3D spatial coordinates, using neural networks to model the geometry and properties such as the reflection of light around an object.
This allows, for example, for a two-dimensional image of an object – say, a building or a tree – to be recreated as a three-dimensional representation that can be viewed from any angle. This technique, pioneered by Nvidia, is being used to create 3D worlds that can be explored in simulations and video games, as well as visualizing robotics, architecture and urban planning.
Hybrid Models In Generative AI
One of the latest advancements in the field of generative AI is the development of hybrid models, which combine various techniques to create innovative content generation systems. These models draw on the strengths of different approaches, such as blending the adversarial training of Generative Adversarial Networks (GANs) with the iterative denoising of diffusion models to produce more refined and realistic outputs. By integrating Large Language Models (LLMs) with other neural networks, hybrid models can offer enhanced context and adaptability, leading to more accurate and contextually relevant results. This hybrid approach unlocks new possibilities for applications like text-to-image generation, where the fusion of different generative techniques leads to more complex and diverse outputs, as well as improved virtual environments. For example, DeepMind’s AlphaCode combines the power of Large Language Models (LLMs) with reinforcement learning to generate high-quality computer code, demonstrating the versatility of hybrid approaches in software development. Another example is OpenAI’s CLIP, which fuses text and image recognition capabilities to create more accurate text-to-image models. CLIP can understand complex relationships between text and images, allowing it to work across various generative applications.
Generative AI is constantly evolving, with new methodologies and applications emerging regularly. As the field continues to grow, we can expect to see even more innovative approaches that blend different techniques to create advanced AI systems. The next decade is likely to bring groundbreaking applications that will transform industries and reshape how we interact with technology.