Close Menu
Alpha Leaders
  • Home
  • News
  • Leadership
  • Entrepreneurs
  • Business
  • Living
  • Innovation
  • More
    • Money & Finance
    • Web Stories
    • Global
    • Press Release
What's On
The Benefits of Red Light Therapy: Expert-Approved Advice

The Benefits of Red Light Therapy: Expert-Approved Advice

3 April 2026
AI chatbots will defy orders and deceive users if asked to delete another model, study finds

AI chatbots will defy orders and deceive users if asked to delete another model, study finds

3 April 2026
Jamie Dimon says the Iran war was inevitable, and the Middle East payoff could be worth it

Jamie Dimon says the Iran war was inevitable, and the Middle East payoff could be worth it

3 April 2026
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
Alpha Leaders
newsletter
  • Home
  • News
  • Leadership
  • Entrepreneurs
  • Business
  • Living
  • Innovation
  • More
    • Money & Finance
    • Web Stories
    • Global
    • Press Release
Alpha Leaders
Home » As Generative AI Models Get Bigger And Better The Reliability Veers Straight Off A Cliff — Or Maybe That’s A Mirage
Innovation

As Generative AI Models Get Bigger And Better The Reliability Veers Straight Off A Cliff — Or Maybe That’s A Mirage

Press RoomBy Press Room7 November 202414 Mins Read
Facebook Twitter Copy Link Pinterest LinkedIn Tumblr Email WhatsApp
As Generative AI Models Get Bigger And Better The Reliability Veers Straight Off A Cliff — Or Maybe That’s A Mirage

In today’s column, I examine the intriguing and quite troubling possibility that as generative AI and large language models (LLMs) are devised to be bigger and better, they are also disturbingly becoming less reliable. Recent empirical studies have tried to figure out this quandary. One possibility is that the reliability drop is more due to accounting trickery and fanciful statistics rather than actual downfalls in AI.

Let’s talk about it.

This analysis of an innovative proposition is part of my ongoing Forbes.com column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Reliability Has To Do With Consistency In Correctness

Various headlines have recently decried that the reliability of generative AI appears to be declining, which seems odd since the AI models are simultaneously getting bigger and better overall. A lot of handwringing is taking place about this disconcerting trend. It just doesn’t make sense and seems counterintuitive.

Surely, if AI is getting bigger and better, we would naturally expect that reliability ought to be either staying the same as always or possibly even improving. How can AI that has a larger scope of capabilities, plus be considered better at answering questions, not be at either status quo or even increasing in reliability?

The hefty gut punch is that reliability seems to be declining.

Yikes.

This deserves a deep dive.

First, let’s establish what we mean by saying that AI is less reliable.

The reliability facet pertains to the consistency of correctness. It goes like this. When you log into generative AI such as ChatGPT, GPT-4o, Claude, Gemini, Llama, or any of the major AI apps, you expect that the correct answer will be reliably conveyed to you. That being said, some people falsely think that generative AI will always be correct. Nope, that’s just not the case. There are plenty of times that AI can produce an incorrect answer.

AI makers track the reliability of their AI wares. Their keystone assumption is that people want AI that is highly reliable. If AI is not consistently correct, users will get upset and undoubtedly stop using the AI. That hurts the bottom line of the AI maker.

None of us want to use generative AI that is low in reliability. This implies that one moment you might get a correct answer, and the next moment an incorrect answer. It could be like a roll of a dice or being in Las Vegas at the slot machines.

You would need to be vigorously skeptical of any answer generated and indubitably would become exasperated at the volume of wrong answers. Of course, you should already be generally skeptical of AI, partially due to the chances of a so-called AI hallucination that might arise, see my discussion at the link here.

The Counting Of Correctness Becomes A Problem

I’d like to delve next into how we might keep track of reliability as associated with generative AI. We shall first consider the counting of correctness when it comes to humans taking tests.

Hark back to your days of being in school and taking tests.

A teacher hands out a test and you earnestly start providing answers. You know that ultimately you will be graded on how many you got correct and how many you answered incorrectly. There is usually a final tally put at the top of your test that says the number of correct answers and how many questions there were on the test. Maybe if your lucky stars are aligned you get above 90% of the answers correct, possibly attaining the revered 100%.

Not all exams are limited to just a score based on the correct versus incorrect criteria alone.

Some of the nationwide exams incorporate a special provision for when you don’t answer a given question. Normally, if you skip a question, you get a flat score of 0 for that question, meaning that you got it wrong. That would seem to be appropriate scoring. You see, your determined task is to try and answer all the questions that are on the test. Skipping a question is tantamount to getting it wrong. The fact that you failed to answer the question is seen as equivalent to having picked the wrong answer. Period, end of story.

Some assert that it is unfair to say that you got the question wrong since you didn’t actually attempt to answer the question. You presumably are only correct or incorrect when you make an actual guess. Leaving a question blank suggests you did not guess at all on that question. Scoring a skipped question as a zero implies that you tried and yet failed to answer the question correctly.

Wait for a second, comes a brisk retort.

If you let people get away with skipping questions and not getting penalized for doing so, they will end up skipping questions endlessly. They could just cherry-pick the few questions they are most confident in, and seemingly get a top score. That’s ridiculous. If you skip a question, then the score on that question should undeniably be the same as having gotten the question absolutely wrong.

There is an ongoing debate about the blank answer situation. It used to be that on the vaunted SAT, there was a said-to-be guessing penalty. You agonizingly had to decide whether to leave a question blank or take your best shot at selecting an answer. In 2016, the SAT administration changed the rules and by-and-large it is now considered a wise rule-of-thumb to always guess at an answer and never leave an answer blank.

Counting Correctness Of Generative AI

Why did I drag you through those eye-rolling distant memories of your test-taking days?

Because we have a similar dilemma when it comes to scoring generative AI on the metric of correctness.

Answers by generative AI can be graded via these three categories:

  • (1) Correct answer. The answer generated by AI is a correct answer.
  • (2) Incorrect answer. The answer generated by AI is an incorrect answer.
  • (3) Avoided answering. The question was avoided in the sense that the generative AI did not provide an answer or otherwise sidestepped answering the question. This is essentially the same as leaving an answer blank.

I ask you to mull over the following conundrum.

When giving tests to generative AI to assess reliability or consistency of correctness, how would you score the instances of the AI avoiding answering questions?

Give that a contemplative thought or two.

If you aren’t familiar with the circumstances under which generative AI refuses to answer questions, I’ve covered the range of possibilities at the link here. The AI maker can set various parameters associated with the pace or frequency of refusals. There is a tradeoff that the AI maker must wrestle with. People are irked when the AI refuses to answer questions. But if the AI opts to answer questions wrongly, and if those wrong answers can be avoided by refusing to answer, this might be more attractive to users than the AI being wrong. As you might imagine, the refusal rate raises all kinds of AI ethics and AI law issues, as noted at the link here.

All of this is rather akin to the problem with the scoring of human test-takers.

Maybe let the AI have a proverbial free pass and if an answer is avoided or refused, we won’t penalize the avoidance or refusal. Whoa, that doesn’t seem right, comes the contrarian viewpoint, an avoided answer should be held to the same standard as being a flat-out incorrect answer.

Ask any AI researcher about this testy matter and you’ll find yourself engulfed in a heated debate. Those who believe there should be no penalty will insist that this is the only rightful way to do the scoring. The other camp will bellow that you cannot let AI get away with being evasive. That is a wrongful way to go, and we are setting ourselves up for a world of hurt if that’s how AI is going to be graded. It will be a race to the bottom of the AI that we are devising and releasing to the public at large.

Research On Scoring Of Generative AI

The bottom line of generative AI becoming less reliable hinges significantly on how you decide to score the AI.

A recent research study entitled “Larger And More Instructable Language Models Become Less Reliable” by Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo, Nature, September 25, 2024, made these salient points (excerpts):

  • “The prevailing methods to make large language models more powerful and amenable have been based on continuous scaling up (that is, increasing their size, data volume, and computational resources) and bespoke shaping up (including post-filtering, fine-tuning or use of human feedback).”
  • “It may be taken for granted that as models become more powerful and better aligned by using these strategies, they also become more reliable from a human perspective, that is, their errors follow a predictable pattern that humans can understand and adjust their queries to.”
  • “Although the models can solve highly challenging instances, they also still fail at very simple ones.”
  • “Focusing on the trend across models, we also see something more: the percentage of incorrect results increases markedly from the raw to the shaped-up models, as a consequence of substantially reducing avoidance.”
  • “We also find that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook.”

Here’s the gist.

Suppose you graded generative AI by initially giving a free pass to the avoided answers. This means that you aren’t truly garnering a semblance of true correctness per se, since the refused questions aren’t penalizing the scoring. The AI will appear to be scoring higher than in any reasonable sense that it really is.

With me so far?

Later, imagine that we decide to force the AI to persistently answer questions and only sparingly refuse to answer questions. We pretty much tell the AI to always guess, even if the AI is computationally unsure of what the correct answer is.

Can you predict what would happen to the measured semblance of reliability?

The chances are that reliability would lessen since you are now forcing the AI to guess on otherwise previously avoided questions. Assuming that some proportions of those guesses are bound to be incorrect, the volume or proportion of incorrectly attempted questions will rise. In short, by now moving the previously unpenalized avoided questions into a clear-cut incorrect/correct answering scheme, the likelihood is that the percentage of incorrect answers is going to get higher than it was before.

No more cherry-picking.

The AI must make guesses and suffer the consequences accordingly.

An Illustration As Explained Via ChatGPT

I decided that I would conveniently use ChatGPT to come up with an example to illustrate the above phenomenon.

Here we go.

  • My entered prompt: “Come up with an illustrative scenario explaining the curious situation that by forcing generative AI to answer previously what was refused or avoided questions, the measure of the reliability of AI is going to take a hit.”
  • ChatGPT generated response:

“Let’s break this down with an example of 100 questions, illustrating how generative AI might give more correct answers overall, but also how incorrect answers could rise faster due to fewer avoided questions.”

“Scenario 1: Baseline — Total questions: 100. Correct answers: 60. Incorrect answers: 10. Avoided questions: 30”

“Here, the AI gets 60 correct answers, 10 wrong answers, and avoids answering 30 questions. The avoided questions might be difficult, and instead of guessing, the AI chooses not to answer. This gives the AI a 60% accuracy from the 100 questions (60 out of 100).”

“Scenario 2: New Situation”

“Now, let’s say improvements are made that encourage the AI to avoid fewer questions, but at the same time, it leads to more guesses.”

“Total questions: 100”

“Correct answers: 70 (up from 60)”

“Incorrect answers: 20 (up from 10)”

“Avoided questions: 10 (down from 30)”

“The AI improved by giving more correct answers—now 70 instead of 60. However, due to reduced avoidance, it guesses more frequently on difficult questions, resulting in more incorrect answers — now 20 instead of 10. The AI avoids fewer questions, down to 10 from 30. So, even though the AI is getting more correct answers overall (70 correct vs. 60 before), the number of incorrect answers has doubled (20 wrong vs. 10 before).”

End of ChatGPT response

I quite liked that example.

Let’s unpack it.

The example indicates that the questions initially attempted were at a 70/100 or 70% level of percentage, which then became 90/100 or 90%, which is presumably great, meaning that the AI is abiding by our instructions and seeking to answer more of the questions posed. Meanwhile, accordingly, the number of avoided questions decreased from 30 to 10, so dropped 67%, which is also great.

Correct answers rose from 60 to 70, so a 16% rise, which is great. We could claim that the AI is getting better at answering questions. Yes, we might decree that generative AI is 16% better than it was before. Happy face. A nifty improvement. Tell the world.

If we cleverly or sneakily decide to end or finish telling the story based on those statistics, we could handily pull the wool over the eyes of the world. No one would realize that something else has taken a turn for the worse.

What went worse?

As vividly shown in the example, the number of incorrect answers rose from 10 to 20, so a 100% rise or doubling in being wrong, which is bad. Very bad. How did this happen? Because we are forcing the AI to now take guesses at questions that previously have been refused or avoided.

The prior scoring was letting AI off the hook.

You might openly argue that the devil finally gets its due, and we see in a sense the real scores. The quirk or trickery of refusing questions inflated or hid the truth. By no longer avoiding answering questions, this has knocked the air out of the balloon of what seemed to be consistent reliability.

Where We Are At And What Happens Next

Some suggest that we should go back to allowing AI to refuse to answer questions and continue the past assumption that no penalty should occur for those refusals. If we did that, the odds are that the reliability measures might remain as they once were. It would be easy to then ignore the reliability factor and just declare that AI reliability continues to smoothly roll along.

Another supporting viewpoint of that approach is that we as humans should be consistent about how we are measuring AI performance. If we previously let refusals go free, the same method should be carried forward. The idea is that if we brazenly otherwise move the goalposts, the changes in scoring are not reflective of the AI but instead reflective of our having changed our minds about the means of measurement.

Hogwash — announces the other side. We should have always penalized for refusals. It was a mirage that we falsely created. We knew or should have known that someday the chickens would come home to roost. In any case, the right approach is now underway and let’s not turn back the clock.

Which direction do you want things to go in?

There are those who say that we made a mistake by not suitably counting or accounting for the refusal or avoidances. Do not fall back into the mistakes of the past. The counterview is that the prior method was not a mistake and made sense for the time at which AI was initially being devised and assessed.

Let’s wrap things up for now.

I’ll give the final word to the famed Henry Ford: “The only real mistake is the one from which we learn nothing.” We can learn to do a better job at gauging progress in AI, including our measurements, how we devise them, how we apply them, and how we convey the results to insiders and the public.

That seems a rather reliable perspective.

Anthropic Claude Google Gemini Meta Llama artificial intelligence AI correctness incorrectness Generative AI large language models LLM measurement metrics OpenAI ChatGPT GPT-4o o1 refusals evasiveness avoidance reliability
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link

Related Articles

VCs Say Context Graphs Might Be The Next Big Thing In AI

3 April 2026
1 Habit Emotionally Intelligent Adults Had As Kids, By A Psychologist

1 Habit Emotionally Intelligent Adults Had As Kids, By A Psychologist

1 April 2026
The Graveyard Of OpenAI’s Dead Products And Incomplete Deals

The Graveyard Of OpenAI’s Dead Products And Incomplete Deals

1 April 2026
How The Children’s Movie “Cars” Forewarns A Post-Human Era

How The Children’s Movie “Cars” Forewarns A Post-Human Era

1 April 2026
Inside The New Deal Pipelines Female Founders Are Quietly Building

Inside The New Deal Pipelines Female Founders Are Quietly Building

1 April 2026
Apple Did The Unthinkable With Its 9 MacBook Neo

Apple Did The Unthinkable With Its $599 MacBook Neo

1 April 2026
Don't Miss
Unwrap Christmas Sustainably: How To Handle Gifts You Don’t Want

Unwrap Christmas Sustainably: How To Handle Gifts You Don’t Want

By Press Room27 December 2024

Every year, millions of people unwrap Christmas gifts that they do not love, need, or…

Walmart dominated, while Target spiraled: the winners and losers of retail in 2024

Walmart dominated, while Target spiraled: the winners and losers of retail in 2024

30 December 2024
Moltbook is the talk of Silicon Valley. But the furor is eerily reminiscent of a 2017 Facebook research experiment

Moltbook is the talk of Silicon Valley. But the furor is eerily reminiscent of a 2017 Facebook research experiment

6 February 2026
Stay In Touch
  • Facebook
  • Twitter
  • Pinterest
  • Instagram
  • YouTube
  • Vimeo
Latest Articles
A  billion ‘slush fund’ to pay TSA agents: Trump’s latest unilateral loophole, explained

A $10 billion ‘slush fund’ to pay TSA agents: Trump’s latest unilateral loophole, explained

3 April 20260 Views
AI adoption isn’t the hard part, it’s building employee agency

AI adoption isn’t the hard part, it’s building employee agency

3 April 20261 Views

VCs Say Context Graphs Might Be The Next Big Thing In AI

3 April 20261 Views
France, South Korea say they’ll work together on reopening Strait of Hormuz

France, South Korea say they’ll work together on reopening Strait of Hormuz

3 April 20261 Views

Recent Posts

  • The Benefits of Red Light Therapy: Expert-Approved Advice
  • AI chatbots will defy orders and deceive users if asked to delete another model, study finds
  • Jamie Dimon says the Iran war was inevitable, and the Middle East payoff could be worth it
  • The jobs report looks good ‘for the wrong reasons,’ top economist warns
  • A $10 billion ‘slush fund’ to pay TSA agents: Trump’s latest unilateral loophole, explained

Recent Comments

No comments to show.
About Us
About Us

Alpha Leaders is your one-stop website for the latest Entrepreneurs and Leaders news and updates, follow us now to get the news that matters to you.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks
The Benefits of Red Light Therapy: Expert-Approved Advice

The Benefits of Red Light Therapy: Expert-Approved Advice

3 April 2026
AI chatbots will defy orders and deceive users if asked to delete another model, study finds

AI chatbots will defy orders and deceive users if asked to delete another model, study finds

3 April 2026
Jamie Dimon says the Iran war was inevitable, and the Middle East payoff could be worth it

Jamie Dimon says the Iran war was inevitable, and the Middle East payoff could be worth it

3 April 2026
Most Popular
The jobs report looks good ‘for the wrong reasons,’ top economist warns

The jobs report looks good ‘for the wrong reasons,’ top economist warns

3 April 20260 Views
A  billion ‘slush fund’ to pay TSA agents: Trump’s latest unilateral loophole, explained

A $10 billion ‘slush fund’ to pay TSA agents: Trump’s latest unilateral loophole, explained

3 April 20260 Views
AI adoption isn’t the hard part, it’s building employee agency

AI adoption isn’t the hard part, it’s building employee agency

3 April 20261 Views

Archives

  • April 2026
  • March 2026
  • February 2026
  • January 2026
  • December 2025
  • November 2025
  • October 2025
  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • March 2022
  • January 2021
  • March 2020
  • January 2020

Categories

  • Blog
  • Business
  • Entrepreneurs
  • Global
  • Innovation
  • Leadership
  • Living
  • Money & Finance
  • News
  • Press Release
© 2026 Alpha Leaders. All Rights Reserved.
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Type above and press Enter to search. Press Esc to cancel.