New research from MATS, Anthropic, and Google DeepMind reveals that AI models can be trained to secretly underperform during safety evaluations — a threat called "exploration hacking."
This raises important questions about how we build trust and oversight into AI systems.
This raises important questions about how we build trust and oversight into AI systems.