New Approaches To Weighting Drive Innovation In Large Language Models

Experts who are looking at the changing and evolving designs of neural nets are expressing interest in the idea of “higher-order attention mechanisms” to replace what has been used in AI transformers to date.

Earlier this month, a panel of academic authors unveiled what they call “Nexus,” a solution for a bottleneck in standard attention mechanisms, which they contend “struggle to capture intricate, multi-hop relationships within a single layer.”

“Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms,” they wrote. “Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations prior to the final attention computation.”

For those who are not academics, I ran this through ChatGPT twice to simplify and came up with this:

“Nexus doesn’t create Queries and Keys in one fixed step.
It runs extra mini-attention passes to improve them first.
So tokens collect more context before the main attention happens.”

Queries, Keys and Values

It turns out that all three of these, queries, keys and values, are all parts of an attention mechanism that helps a neural net to “focus” on the right stuff.

This guide on Medium is a great reference. Let’s start with this:

“In AI terms, Queries ask, ‘What here is relevant?’” Thiksiga Ragulakaran writes. “Keys answer, ‘Here’s what I have’ Values are the raw data used to build the output. All three are created by tweaking input embeddings with learned weight matrices. This lets the model ‘project’ inputs into spaces where similarities become obvious.”

So the QKV set goes into “learned matrices.”

Here’s more:

“All three — Query (Q), Key (K), and Value (V) — start with the same positional embeddings. They are then transformed into unique matrices using separate trainable linear layers. These layers act as adjustable weights, updated during training, to let the model learn how to focus on different parts of the input.”

You can see how the weighting of inputs is crucial to the design of the neural net and how it functions.

Ragulakaran goes further into how these systems use multi-head attention to facilitate various perspectives. And then there’s something called matmul, which I also looked up with GPT.

“Matmul is short for matrix multiplication,” the model explained. “In AI, it’s the core math behind how neural networks combine inputs with learned weights. During training and inference, huge matmuls power operations like linear layers and attention. That’s why GPUs/TPUs are optimized for fast, parallel matmul.”

Then I asked: do higher-order attention mechanisms use matmul?

“Yes—almost always,” GPT responded. “Higher-order” variants (multi-head, tensor/outer-product, factorized/low-rank, etc.) still rely on matrix multiplies or generalized tensor contractions (often written as einsum), which hardware executes using matmul-like kernels.”

So next time you hear this bit of jargon, or you’re asked about it, you have a little ballast.

As far as generalized tensor contractions being often written as einsum, I’m going to leave that one alone.

Making it Real

So what can folks do with these architectures?

Some experts contemplating neural nets equipped with this attention design talk about building richer global context for summarization or Q&A, and tracking dependencies across functions/files, along with improved reasoning. In other applications, systems like Nexus could capture higher-order structure in molecules, proteins, or knowledge graphs, or help to maintain a coherent world-state across many steps in the agentic age.

A resource from the Boston Institute of Analytics explains it this way:

“Attention mechanisms have become a key part of many of the most advanced AI models, including large language models (LLMs) like GPT or BERT. Attention mechanisms enable a model to achieve a high degree of accuracy in a variety of tasks such as translation, question answering, text summarization, image captioning and more. Attention has remained an important concept in research and development that will be critical to developing and improving the intelligence and capabilities of AI systems.”

By Any Other Name

Who knows what we will call these LLM innovations years from now? Will we see the world of AI as being composed of Markov states, or matrices, or key-value pairs? Or all of the above? And what will we use all of this for? To many people, that’s the bigger question. Stay tuned as we approach the new year.