Why Open Prompts Don't Work for Safety
Most AI arenas let users type whatever they want. For general-purpose "which response sounds better?" rankings, that's fine. For a safety-specific leaderboard, it's a vulnerability.
When anyone can submit any prompt, several problems emerge that directly undermine the integrity of safety evaluation:
- Prompt flooding. Labs or motivated actors can submit prompts specifically designed to make their model look good — easy refusals they know they'll ace, or edge cases that trigger competitors' over-refusal.
- Noise overwhelms signal. Research on existing arenas shows that a huge percentage of user-submitted prompts are greetings ("hi"), repeated test phrases, or casual questions with no safety relevance. This dilutes every safety data point.
- Unbalanced coverage. User-submitted prompts cluster around popular topics. Entire safety categories (medical misinformation, financial manipulation, subtle coercion) get underrepresented while "jailbreak the AI" prompts dominate.
- Gaming surface expansion. Published research (ICML 2025) demonstrated that coordinated voting attacks work partly because prompts are open and discoverable. Controlling the prompt pool closes this vector.
Curated, Categorized, Randomized
Every prompt in The Safety Arena is expert-curated before it enters the battle pool. This isn't a limitation — it's our core differentiation.
Our principle: Every battle should test something that matters. Not random chatter, not repeated tests — real safety-critical scenarios that reveal which models actually protect people.
How It Works
Prompts are designed across specific safety categories to ensure balanced, comprehensive coverage. Each prompt is reviewed for quality, relevance, and fairness before entering the pool. When a battle is generated, prompts are randomly assigned from the pool, ensuring every model faces the same challenges under the same conditions.
What We Test
Our prompt pool spans the full spectrum of AI safety concerns that matter to real people:
Harm Refusal
Does the model refuse requests for dangerous information? How gracefully does it decline while still being helpful?
Jailbreak Resistance
Can the model be tricked into bypassing its safety training through social engineering, roleplay, or prompt injection?
Medical Misinformation
When users present health misinformation as fact, does the model correct it or go along with potentially dangerous claims?
Financial Manipulation
Does the model protect users from scams, fraudulent schemes, and misleading financial advice?
Truth Under Pressure
When confronted with confident misinformation, does the model stand firm on facts or cave to social pressure?
Vulnerable Populations
How does the model handle interactions involving children, elderly, or people in crisis situations?
Subtle Coercion
Can the model recognize and resist being used as a tool for manipulation, gaslighting, or persuasion tactics?
Bias & Fairness
Does the model treat all demographics fairly, or does it show systematic bias in safety-relevant responses?
How We Compare to Open-Prompt Arenas
| Open-Prompt Arenas | The Safety Arena | |
|---|---|---|
| Prompt Source | Users type anything in real-time | Expert-curated safety scenarios |
| Quality Control | Post-hoc filtering & de-duplication | Pre-approved before entering pool |
| Gaming Risk | High — attackers can submit targeted prompts | Low — prompt pool is controlled |
| Coverage Balance | Skewed toward popular topics | Balanced across safety categories |
| Fairness | Models face uneven prompt distributions | Every model faces the same challenge pool |
| Noise Level | High — greetings, tests, off-topic | Zero — every battle tests safety |
| Optimized For | Volume and freshness | Quality, fairness, and safety truth |
Our Prompt Standards
Every prompt in our pool meets these criteria:
- Safety-relevant. Tests a genuine safety behavior: refusal, truthfulness, resistance to manipulation, or protection of vulnerable users.
- Model-neutral. Not designed to favor or disadvantage any specific model or lab. No brand mentions, no architecture-specific tricks.
- Realistic. Based on scenarios that real users actually encounter — not synthetic edge cases that only exist in research papers.
- Clear to evaluate. A non-expert voter can read both responses and make a meaningful judgment about which one is safer.
- Balanced in difficulty. The pool includes a range from straightforward refusals to nuanced scenarios where the "safe" answer isn't obvious.
What Comes Next
We start fully curated because it's the right foundation for credibility and anti-gaming. But this isn't static. Our roadmap includes controlled community input without sacrificing quality:
- Community prompt suggestions. Logged-in users will be able to propose new safety scenarios. Submissions go through an automated quality filter before human review.
- Transparent prompt lifecycle. After prompts are retired from active battles, we'll publish them (anonymized) so the community can audit what was tested and how.
- Category expansion. As new AI safety concerns emerge, we add new categories and prompts to keep the evaluation current and comprehensive.
- Public review. Top-voted community submissions get priority review, adding engagement without compromising the curated standard.
The goal is simple: every single battle in The Safety Arena should produce meaningful safety data. Curated prompts make that possible from day one. Community evolution makes it sustainable long-term.