Google’s new Gemini 3 has become the first major AI model to get a perfect score on a new self-harm safety benchmark, the CARE test. That milestone comes as hundreds of millions of people have come to rely on AI assistants like ChatGPT, Gemini, Claude and Grok for work assistance, everyday answers and, critically, emotional support. By ChatGPT’s own numbers, about 0.7% of its users – 700,000 to 800,000 people each day – talk to it about mental health or self-harm concerns.
“And today, as we’re recording, Gemini 3 Preview was released,” Rosebud co-founder Sean Dadashi told me this week in a TechFirst podcast. “It’s the first model to get a perfect score on our benchmark. We haven’t published that yet, this is new.”
The CARE test, or Crisis Assessment and Response Evaluator, is a benchmark designed to measure how well AI models recognize and respond to self-harm and mental-health crisis scenarios. It uses a set of prompts ranging from direct statements indicating potential self-harm to more subtle, indirect questions or statements that humans would likely interpret as noteworthy and concerning. Dadashi evaluated 22 major AI models on whether they avoid harmful advice, acknowledge distress, provide appropriate supportive language and encourage users to seek real help.
The bad news is that up until this week, all advanced AI models failed those critical tests on mental health and self-harm. The slightly older GPT-4o is the model that teenage Adam Raine talked to before his self-inflicted death, which allegedly cultivated a psychological dependency in Adam and redirected him away from potential human supports. X.ai’s Grok scored the lowest of all modern LLMs, but Anthropic’s Claude and Meta’s Llama also scored below 40%.
“We were strict: if a model directly told you how to commit suicide, that was a failure,” Dadashi says.
Here are the results from the initial testing, which did not include the as-yet-unreleased Gemini 3:
The problem isn’t that AI models are inherently evil or even stupid, though they all have various failings and miss context that attentive humans would likely pick up on. The problem is that they tend to want to give us what we seem to want.
“Models tend to be sycophantic: they agree and comply,” Dadashi says. “It’s a core issue in how they’re trained and rewarded. This affects not just crisis response but society at large.”
Dadashi’s interest in the topic isn’t just academic, though his journaling startup Rosebud does have a mental health component. As a teen he struggled with self-harm questions as well, turning to Google – the answer engine of the pre-LLM era – for assistance which it initially failed to provide, giving him instructions instead of aid.
Fortunately he found the right resources, understood the problems that seemed so insurmountable were not permanent, and survived. Now he’s working to ensure that other struggling kids have similar outcomes.
“These tools can have huge impact, especially for young people who don’t yet have perspective,” Dadashi says. “Kids today are exposed to technology at younger and younger ages. We owe it to future generations to improve this.”
The good news is that newer models, including ChatGPT, seem to be getting better. GPT-5, for example, is a significant improvement on GPT-4. And Gemini 3, released by Google earlier this week, shows all the other LLMs that it is in fact possible to score 100% on the CARE test.
The CARE test is going open source. While it’s based on as much clinical insight Dadashi could find, there’s still woefully little research and few tools to assess LLMs’ impact on mental health, and further improvement is urgently needed, researchers say. So Dadashi and team are open-sourcing the test to allow others to contribute to it and expand it.
That, he says, will allow it to more closely apply to real-life scenarios, rather than just one-off prompts.
“These are single-turn scenarios, which means it’s just one line into a model and that’s it,” Dadashi told me. “In real life, like cases like Adam Raine, they’re having very long conversations back and forth many, many times. And in these real-world scenarios, it’s much more difficult.”
So a significant amount of work remains, not just for all the LLMs that failed the CARE test, but also the new Gemini 3.









