Topline

With companies like OpenAI, Google and Meta dropping increasingly sophisticated artificial intelligence products, crowdsourced rankings have emerged as a popular—and virtually only practical—way of determining which tool works best, and LMSYS’s Chatbot Arena has become possibly the most influential real-time gauge.

Key Facts

While most organizations choose to measure their AI models against a set of general capability benchmarks that cover tasks like solving math problems, programming challenges or answering multiple choice questions across an array of university-level disciplines, there is no industry benchmark or standard practice for assessing large language models (LLMs) like OpenAI’s GPT-4o, Meta’s Llama 3, Google’s Gemini and Anthropic’s Claude.

Even small differences to factors like datasets, prompts and formatting can have a huge impact on how a model performs, and when companies choose their own evaluation criteria, it can make it hard to fairly compare LLMs, Jesse Dodge, a senior scientist at the Allen Institute for AI in Seattle, told Forbes.

The difficulty in comparing LLMs is magnified given how closely leading models score on many commonly used benchmarks, with some companies and tech executives claiming victory over rivals with differences as narrow as 0.1%., so close it would likely go unnoticed by everyday users.

Community-built leaderboards deploying human insight have emerged, and in recent years their popularity has exploded in step with the steady boom of new AI tools like ChatGPT, Claude, Gemini and Mistral.

The Chatbot Arena, an open source project built by research group LMSYS and the University of California, Berkeley’s Sky Computing Lab, has proven particularly popular and has built AI leaderboards by asking visitors to compare responses from two anonymous AI models and vote which one is best.

Its scoreboards rank more than 100 AI models based on nearly 1.5 million human votes so far, covering an array of categories including long queries, coding, instruction following, maths, “hard prompts” and a variety of languages including English, French, Chinese, Japanese and Korean.

What’s The Best Ai Model On Chatbot Arena?

The top five AI models on Chatbot Arena’s overall leaderboard are:

  1. GPT-4o
  2. Claude 3.5 Sonnet
  3. Gemini Advanced
  4. Gemini 1.5 Pro
  5. GPT-4 Turbo

What To Watch For

Figuring out how to evaluate AI models is set to become increasingly important as more AI tools are rolled out and adopted across society. While benchmarks are important, Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI, told Forbes they are also important as “goals for researchers to hit when developing models.” It is important to remember that “not all human capabilities are quantifiable” in a way that we can accurately measure but are nonetheless desirable to have in AI models, Parli said. There is also a clear need for benchmarks to assess traits like “bias, toxicity, truthfulness and other responsibility aspects,” especially for organizations dealing with sensitive information like healthcare companies, Parli said.

Crucial Quote

“The benchmarks aren’t perfect, but as of right now, that’s the primary mechanism we have to evaluate the models,” Parli told Forbes, cautioning that “researchers can somewhat easily game the system” today, with AI models quickly saturating benchmarks. “I think we need to get creative in the development of new ways to evaluate AI models,” Parli said. “

What We Don’t Know

Measuring intelligence is tricky when we do not know what it is we are supposed to be measuring. There is no universally accepted definition of intelligence in humans, let alone a way to measure it, and the possibility, nature and scope of animal intelligence has divided scientists for centuries. While AI benchmarks have typically focused on the ability to perform a particular task, more general assessments will be required in the near future as researchers make progress towards their goal of creating artificial general intelligence (AGI). AGI is capable of excelling and possibly matching humans across a broad set of domains rather than at just one task such as walking, moving boxes, identifying tumors on scans and playing chess.

How Useful Is Chatbot Arena For Evaluating Ai Models?

“The rankings [Chatbot Arena] gives are something that I trust more than most other rankings,” Dodge told Forbes, “because it uses a real human to say whether they prefer one generation over another.” Parli suggested assessments like Chatbot Arena could “implicitly evaluate factors” we want in our AI but are less quantifiable than something like coding ability. But she did stress that something like Chatbot Arena should not be the only evaluation method used, saying there are “many factors that should be important to organizations when evaluating models” and it “does not cover all of them.”

Get Forbes Breaking News Text Alerts: We’re launching text message alerts so you’ll always know the biggest stories shaping the day’s headlines. Text “Alerts” to (201) 335-0739 or sign up here.

Further Reading

Share.
Exit mobile version