Sometimes we benefit from looking in detail at the terms that we use to talk about large language models and neural networks.
Things are changing rapidly, and science is moving from one groundbreaking model to another in the blink of an eye. It’s genuinely hard for most of us to keep up!
In our recent classes, I was hearing about some of the foundational ideas behind new types of neural networks that are able to increase performance without losing accuracy, and others that also speed up the advances that are fueling the AI revolution.
(Disclaimer: one of these is familiar to me. I am involved in the work of the teams creating Liquid AI, as is MIT CSAIL Director Daniela Rus. I want to take the opportunity to reveal a little about the context of these systems and what makes them work.)
So … there’s a community of people working on this technology
As for the methodology, I thought that Liquid AI CEO Ramin Hasani explained some of this pretty well in his remarks to a recent MIT gathering I hosted.
He started out with the ‘quadratic inference cost’ of traditional models, and the idea of going ‘sub-quadratic’ in order to save resources.
Here’s the thing: conventional LLM systems use a transformer method that pays attention to every element of a sequence.
This, in the words of Nathan Paull at Substack, leads to “quadratic growth of pairwise interactions,” which puts a strain on resources. So new sub-quadratic systems, experts suggest, could become new ‘keystone architectures’ powering the next generation of neural nets – although these new systems, he concedes, face a high bar:
“With the explosion in popularity of LLMs as the current peak of AI, any architecture that wants to replace the Transformer will have to perform at gigantic parameter scales, approaching tens or even hundreds of billions of parameters,” Paull writes. “And this is the ultimate barrier to entry, how can any new architecture show enough promise at the hundred million to few billion scale to be tested on a level competitive with LLMs.”
As for identifying new models, Paull mentions Mamba (monitoring and memorizing long range dependencies) and a system called BASED, which apparently combines “short range convolution, and long range Taylor series attention” methods – these are architectures that are linear subclasses of Liquid neural networks.
As Hasani pointed out in his comments, the problem is that these types of models perform better with less compute, but are also less accurate in general.
“There’s no free lunch,” he told computer science students looking at these advances.
However, he noted, that’s where liquid neurons come in – replacing traditional functions with liquid functions makes processes more interpretable, and lets us do more with the systems that we have.
Here’s my two cents: think back to the early days of machine learning, where most projects were supervised with troves of carefully labeled data.
Eventually, we learned that we could move to an unsupervised model, where the AI or LLM Interprets the data on its own, and labels it accordingly.
We found surprising levels of accuracy, and so that model started to catch on.
Using the new methods, we increased what systems are able to do.
That’s the value proposition of this technology: we’re doing more with fewer parameters. This combination of changing the methodologies, and changing the activation function, seems to have vast potential. Then there’s the abstraction of weighted inputs, not to mention new hardware models, and … quantum computing? To be fair, that last one is in a league of its own, but in a way, the liquid model takes a page from the quantum playbook in replacing a type of determinism (function determinism?) with, well, something else.
Keep an eye on the blog for more on what we are doing with intelligent systems design. Many are hard at work on these and other innovations, and we’re in the loop at MIT, too.