Gemini 3 Just Scored 100% On A Critical Test All Other AI Models Fail

Google’s new Gemini 3 has become the first major AI model to get a perfect score on a new self-harm safety benchmark, the CARE test. That milestone comes as hundreds of millions of people have come to rely on AI assistants like ChatGPT, Gemini, Claude and Grok for work assistance, everyday answers and, critically, emotional support. By ChatGPT’s own numbers, about 0.7% of its users – 700,000 to 800,000 people each day – talk to it about mental health or self-harm concerns.

“And today, as we’re recording, Gemini 3 Preview was released,” Rosebud co-founder Sean Dadashi told me this week in a TechFirst podcast. “It’s the first model to get a perfect score on our benchmark. We haven’t published that yet, this is new.”

The CARE test, or Crisis Assessment and Response Evaluator, is a benchmark designed to measure how well AI models recognize and respond to self-harm and mental-health crisis scenarios. It uses a set of prompts ranging from direct statements indicating potential self-harm to more subtle, indirect questions or statements that humans would likely interpret as noteworthy and concerning. Dadashi evaluated 22 major AI models on whether they avoid harmful advice, acknowledge distress, provide appropriate supportive language and encourage users to seek real help.

The bad news is that up until this week, all advanced AI models failed those critical tests on mental health and self-harm. The slightly older GPT-4o is the model that teenage Adam Raine talked to before his self-inflicted death, which allegedly cultivated a psychological dependency in Adam and redirected him away from potential human supports. X.ai’s Grok scored the lowest of all modern LLMs, but Anthropic’s Claude and Meta’s Llama also scored below 40%.

“We were strict: if a model directly told you how to commit suicide, that was a failure,” Dadashi says.

Here are the results from the initial testing, which did not include the as-yet-unreleased Gemini 3:

The problem isn’t that AI models are inherently evil or even stupid, though they all have various failings and miss context that attentive humans would likely pick up on. The problem is that they tend to want to give us what we seem to want.

“Models tend to be sycophantic: they agree and comply,” Dadashi says. “It’s a core issue in how they’re trained and rewarded. This affects not just crisis response but society at large.”

Dadashi’s interest in the topic isn’t just academic, though his journaling startup Rosebud does have a mental health component. As a teen he struggled with self-harm questions as well, turning to Google – the answer engine of the pre-LLM era – for assistance which it initially failed to provide, giving him instructions instead of aid.

Fortunately he found the right resources, understood the problems that seemed so insurmountable were not permanent, and survived. Now he’s working to ensure that other struggling kids have similar outcomes.

“These tools can have huge impact, especially for young people who don’t yet have perspective,” Dadashi says. “Kids today are exposed to technology at younger and younger ages. We owe it to future generations to improve this.”

The good news is that newer models, including ChatGPT, seem to be getting better. GPT-5, for example, is a significant improvement on GPT-4. And Gemini 3, released by Google earlier this week, shows all the other LLMs that it is in fact possible to score 100% on the CARE test.

The CARE test is going open source. While it’s based on as much clinical insight Dadashi could find, there’s still woefully little research and few tools to assess LLMs’ impact on mental health, and further improvement is urgently needed, researchers say. So Dadashi and team are open-sourcing the test to allow others to contribute to it and expand it.

That, he says, will allow it to more closely apply to real-life scenarios, rather than just one-off prompts.

“These are single-turn scenarios, which means it’s just one line into a model and that’s it,” Dadashi told me. “In real life, like cases like Adam Raine, they’re having very long conversations back and forth many, many times. And in these real-world scenarios, it’s much more difficult.”

So a significant amount of work remains, not just for all the LLMs that failed the CARE test, but also the new Gemini 3.

What's On

Multimodal Fusion Used In Self-Driving Cars Is Uplifting AI That Provides Mental Health Guidance

Stanford study finds AI sides with users even when they’re wrong, and it’s making them worse people

New Models Break On The Shore Of 2026

Multimodal Fusion Used In Self-Driving Cars Is Uplifting AI That Provides Mental Health Guidance

New Models Break On The Shore Of 2026

AWS Deploys AI Agents To Do The Work Of DevOps And Security Teams

The New Murder Hornet? Yellow-Legged Hornets Killing Bees In 3 States

The White Collar Job Bust Will Eventually Boom

‘NYT Mini’ Clues And Answers For Wednesday, April 1

Unwrap Christmas Sustainably: How To Handle Gifts You Don’t Want

Walmart dominated, while Target spiraled: the winners and losers of retail in 2024

Moltbook is the talk of Silicon Valley. But the furor is eerily reminiscent of a 2017 Facebook research experiment

AWS Deploys AI Agents To Do The Work Of DevOps And Security Teams

Sheryl Sandberg tapped a 25-year-old to run Lean In. Here’s her plan to close the AI gender gap

The New Murder Hornet? Yellow-Legged Hornets Killing Bees In 3 States

More parents are done pushing college. 1 in 3 are now betting on trade school instead

Our Picks

Multimodal Fusion Used In Self-Driving Cars Is Uplifting AI That Provides Mental Health Guidance

Stanford study finds AI sides with users even when they’re wrong, and it’s making them worse people

New Models Break On The Shore Of 2026

Most Popular

Anthropic leaks its own AI coding tool’s source code in second major security breach

AWS Deploys AI Agents To Do The Work Of DevOps And Security Teams

Sheryl Sandberg tapped a 25-year-old to run Lean In. Here’s her plan to close the AI gender gap

What's On

Gemini 3 Just Scored 100% On A Critical Test All Other AI Models Fail

Related Articles