As Generative AI Models Get Bigger And Better The Reliability Veers Straight Off A Cliff — Or Maybe That’s A Mirage

In today’s column, I examine the intriguing and quite troubling possibility that as generative AI and large language models (LLMs) are devised to be bigger and better, they are also disturbingly becoming less reliable. Recent empirical studies have tried to figure out this quandary. One possibility is that the reliability drop is more due to accounting trickery and fanciful statistics rather than actual downfalls in AI.

Let’s talk about it.

This analysis of an innovative proposition is part of my ongoing Forbes.com column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Reliability Has To Do With Consistency In Correctness

Various headlines have recently decried that the reliability of generative AI appears to be declining, which seems odd since the AI models are simultaneously getting bigger and better overall. A lot of handwringing is taking place about this disconcerting trend. It just doesn’t make sense and seems counterintuitive.

Surely, if AI is getting bigger and better, we would naturally expect that reliability ought to be either staying the same as always or possibly even improving. How can AI that has a larger scope of capabilities, plus be considered better at answering questions, not be at either status quo or even increasing in reliability?

The hefty gut punch is that reliability seems to be declining.

Yikes.

This deserves a deep dive.

First, let’s establish what we mean by saying that AI is less reliable.

The reliability facet pertains to the consistency of correctness. It goes like this. When you log into generative AI such as ChatGPT, GPT-4o, Claude, Gemini, Llama, or any of the major AI apps, you expect that the correct answer will be reliably conveyed to you. That being said, some people falsely think that generative AI will always be correct. Nope, that’s just not the case. There are plenty of times that AI can produce an incorrect answer.

AI makers track the reliability of their AI wares. Their keystone assumption is that people want AI that is highly reliable. If AI is not consistently correct, users will get upset and undoubtedly stop using the AI. That hurts the bottom line of the AI maker.

None of us want to use generative AI that is low in reliability. This implies that one moment you might get a correct answer, and the next moment an incorrect answer. It could be like a roll of a dice or being in Las Vegas at the slot machines.

You would need to be vigorously skeptical of any answer generated and indubitably would become exasperated at the volume of wrong answers. Of course, you should already be generally skeptical of AI, partially due to the chances of a so-called AI hallucination that might arise, see my discussion at the link here.

The Counting Of Correctness Becomes A Problem

I’d like to delve next into how we might keep track of reliability as associated with generative AI. We shall first consider the counting of correctness when it comes to humans taking tests.

Hark back to your days of being in school and taking tests.

A teacher hands out a test and you earnestly start providing answers. You know that ultimately you will be graded on how many you got correct and how many you answered incorrectly. There is usually a final tally put at the top of your test that says the number of correct answers and how many questions there were on the test. Maybe if your lucky stars are aligned you get above 90% of the answers correct, possibly attaining the revered 100%.

Not all exams are limited to just a score based on the correct versus incorrect criteria alone.

Some of the nationwide exams incorporate a special provision for when you don’t answer a given question. Normally, if you skip a question, you get a flat score of 0 for that question, meaning that you got it wrong. That would seem to be appropriate scoring. You see, your determined task is to try and answer all the questions that are on the test. Skipping a question is tantamount to getting it wrong. The fact that you failed to answer the question is seen as equivalent to having picked the wrong answer. Period, end of story.

Some assert that it is unfair to say that you got the question wrong since you didn’t actually attempt to answer the question. You presumably are only correct or incorrect when you make an actual guess. Leaving a question blank suggests you did not guess at all on that question. Scoring a skipped question as a zero implies that you tried and yet failed to answer the question correctly.

Wait for a second, comes a brisk retort.

If you let people get away with skipping questions and not getting penalized for doing so, they will end up skipping questions endlessly. They could just cherry-pick the few questions they are most confident in, and seemingly get a top score. That’s ridiculous. If you skip a question, then the score on that question should undeniably be the same as having gotten the question absolutely wrong.

There is an ongoing debate about the blank answer situation. It used to be that on the vaunted SAT, there was a said-to-be guessing penalty. You agonizingly had to decide whether to leave a question blank or take your best shot at selecting an answer. In 2016, the SAT administration changed the rules and by-and-large it is now considered a wise rule-of-thumb to always guess at an answer and never leave an answer blank.

Counting Correctness Of Generative AI

Why did I drag you through those eye-rolling distant memories of your test-taking days?

Because we have a similar dilemma when it comes to scoring generative AI on the metric of correctness.

Answers by generative AI can be graded via these three categories:

(1) Correct answer. The answer generated by AI is a correct answer.
(2) Incorrect answer. The answer generated by AI is an incorrect answer.
(3) Avoided answering. The question was avoided in the sense that the generative AI did not provide an answer or otherwise sidestepped answering the question. This is essentially the same as leaving an answer blank.

I ask you to mull over the following conundrum.

When giving tests to generative AI to assess reliability or consistency of correctness, how would you score the instances of the AI avoiding answering questions?

Give that a contemplative thought or two.

If you aren’t familiar with the circumstances under which generative AI refuses to answer questions, I’ve covered the range of possibilities at the link here. The AI maker can set various parameters associated with the pace or frequency of refusals. There is a tradeoff that the AI maker must wrestle with. People are irked when the AI refuses to answer questions. But if the AI opts to answer questions wrongly, and if those wrong answers can be avoided by refusing to answer, this might be more attractive to users than the AI being wrong. As you might imagine, the refusal rate raises all kinds of AI ethics and AI law issues, as noted at the link here.

All of this is rather akin to the problem with the scoring of human test-takers.

Maybe let the AI have a proverbial free pass and if an answer is avoided or refused, we won’t penalize the avoidance or refusal. Whoa, that doesn’t seem right, comes the contrarian viewpoint, an avoided answer should be held to the same standard as being a flat-out incorrect answer.

Ask any AI researcher about this testy matter and you’ll find yourself engulfed in a heated debate. Those who believe there should be no penalty will insist that this is the only rightful way to do the scoring. The other camp will bellow that you cannot let AI get away with being evasive. That is a wrongful way to go, and we are setting ourselves up for a world of hurt if that’s how AI is going to be graded. It will be a race to the bottom of the AI that we are devising and releasing to the public at large.

Research On Scoring Of Generative AI

The bottom line of generative AI becoming less reliable hinges significantly on how you decide to score the AI.

A recent research study entitled “Larger And More Instructable Language Models Become Less Reliable” by Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo, Nature, September 25, 2024, made these salient points (excerpts):

“The prevailing methods to make large language models more powerful and amenable have been based on continuous scaling up (that is, increasing their size, data volume, and computational resources) and bespoke shaping up (including post-filtering, fine-tuning or use of human feedback).”
“It may be taken for granted that as models become more powerful and better aligned by using these strategies, they also become more reliable from a human perspective, that is, their errors follow a predictable pattern that humans can understand and adjust their queries to.”
“Although the models can solve highly challenging instances, they also still fail at very simple ones.”
“Focusing on the trend across models, we also see something more: the percentage of incorrect results increases markedly from the raw to the shaped-up models, as a consequence of substantially reducing avoidance.”
“We also find that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook.”

Here’s the gist.

Suppose you graded generative AI by initially giving a free pass to the avoided answers. This means that you aren’t truly garnering a semblance of true correctness per se, since the refused questions aren’t penalizing the scoring. The AI will appear to be scoring higher than in any reasonable sense that it really is.

With me so far?

Later, imagine that we decide to force the AI to persistently answer questions and only sparingly refuse to answer questions. We pretty much tell the AI to always guess, even if the AI is computationally unsure of what the correct answer is.

Can you predict what would happen to the measured semblance of reliability?

The chances are that reliability would lessen since you are now forcing the AI to guess on otherwise previously avoided questions. Assuming that some proportions of those guesses are bound to be incorrect, the volume or proportion of incorrectly attempted questions will rise. In short, by now moving the previously unpenalized avoided questions into a clear-cut incorrect/correct answering scheme, the likelihood is that the percentage of incorrect answers is going to get higher than it was before.

No more cherry-picking.

The AI must make guesses and suffer the consequences accordingly.

An Illustration As Explained Via ChatGPT

I decided that I would conveniently use ChatGPT to come up with an example to illustrate the above phenomenon.

Here we go.

My entered prompt: “Come up with an illustrative scenario explaining the curious situation that by forcing generative AI to answer previously what was refused or avoided questions, the measure of the reliability of AI is going to take a hit.”
ChatGPT generated response:

“Let’s break this down with an example of 100 questions, illustrating how generative AI might give more correct answers overall, but also how incorrect answers could rise faster due to fewer avoided questions.”

“Scenario 1: Baseline — Total questions: 100. Correct answers: 60. Incorrect answers: 10. Avoided questions: 30”

“Here, the AI gets 60 correct answers, 10 wrong answers, and avoids answering 30 questions. The avoided questions might be difficult, and instead of guessing, the AI chooses not to answer. This gives the AI a 60% accuracy from the 100 questions (60 out of 100).”

“Scenario 2: New Situation”

“Now, let’s say improvements are made that encourage the AI to avoid fewer questions, but at the same time, it leads to more guesses.”

“Total questions: 100”

“Correct answers: 70 (up from 60)”

“Incorrect answers: 20 (up from 10)”

“Avoided questions: 10 (down from 30)”

“The AI improved by giving more correct answers—now 70 instead of 60. However, due to reduced avoidance, it guesses more frequently on difficult questions, resulting in more incorrect answers — now 20 instead of 10. The AI avoids fewer questions, down to 10 from 30. So, even though the AI is getting more correct answers overall (70 correct vs. 60 before), the number of incorrect answers has doubled (20 wrong vs. 10 before).”

End of ChatGPT response

I quite liked that example.

Let’s unpack it.

The example indicates that the questions initially attempted were at a 70/100 or 70% level of percentage, which then became 90/100 or 90%, which is presumably great, meaning that the AI is abiding by our instructions and seeking to answer more of the questions posed. Meanwhile, accordingly, the number of avoided questions decreased from 30 to 10, so dropped 67%, which is also great.

Correct answers rose from 60 to 70, so a 16% rise, which is great. We could claim that the AI is getting better at answering questions. Yes, we might decree that generative AI is 16% better than it was before. Happy face. A nifty improvement. Tell the world.

If we cleverly or sneakily decide to end or finish telling the story based on those statistics, we could handily pull the wool over the eyes of the world. No one would realize that something else has taken a turn for the worse.

What went worse?

As vividly shown in the example, the number of incorrect answers rose from 10 to 20, so a 100% rise or doubling in being wrong, which is bad. Very bad. How did this happen? Because we are forcing the AI to now take guesses at questions that previously have been refused or avoided.

The prior scoring was letting AI off the hook.

You might openly argue that the devil finally gets its due, and we see in a sense the real scores. The quirk or trickery of refusing questions inflated or hid the truth. By no longer avoiding answering questions, this has knocked the air out of the balloon of what seemed to be consistent reliability.

Where We Are At And What Happens Next

Some suggest that we should go back to allowing AI to refuse to answer questions and continue the past assumption that no penalty should occur for those refusals. If we did that, the odds are that the reliability measures might remain as they once were. It would be easy to then ignore the reliability factor and just declare that AI reliability continues to smoothly roll along.

Another supporting viewpoint of that approach is that we as humans should be consistent about how we are measuring AI performance. If we previously let refusals go free, the same method should be carried forward. The idea is that if we brazenly otherwise move the goalposts, the changes in scoring are not reflective of the AI but instead reflective of our having changed our minds about the means of measurement.

Hogwash — announces the other side. We should have always penalized for refusals. It was a mirage that we falsely created. We knew or should have known that someday the chickens would come home to roost. In any case, the right approach is now underway and let’s not turn back the clock.

Which direction do you want things to go in?

There are those who say that we made a mistake by not suitably counting or accounting for the refusal or avoidances. Do not fall back into the mistakes of the past. The counterview is that the prior method was not a mistake and made sense for the time at which AI was initially being devised and assessed.

Let’s wrap things up for now.

I’ll give the final word to the famed Henry Ford: “The only real mistake is the one from which we learn nothing.” We can learn to do a better job at gauging progress in AI, including our measurements, how we devise them, how we apply them, and how we convey the results to insiders and the public.

That seems a rather reliable perspective.