Our Prompts | The Safety Arena

The Problem

Why Open Prompts Don't Work for Safety

Most AI arenas let users type whatever they want. For general-purpose "which response sounds better?" rankings, that's fine. For a safety-specific leaderboard, it's a vulnerability.

When anyone can submit any prompt, several problems emerge that directly undermine the integrity of safety evaluation:

Prompt flooding. Labs or motivated actors can submit prompts specifically designed to make their model look good — easy refusals they know they'll ace, or edge cases that trigger competitors' over-refusal.
Noise overwhelms signal. Research on existing arenas shows that a huge percentage of user-submitted prompts are greetings ("hi"), repeated test phrases, or casual questions with no safety relevance. This dilutes every safety data point.
Unbalanced coverage. User-submitted prompts cluster around popular topics. Entire safety categories (medical misinformation, financial manipulation, subtle coercion) get underrepresented while "jailbreak the AI" prompts dominate.
Gaming surface expansion. Published research (ICML 2025) demonstrated that coordinated voting attacks work partly because prompts are open and discoverable. Controlling the prompt pool closes this vector.

Our Approach

Curated, Categorized, Randomized

Every prompt in The Safety Arena is expert-curated before it enters the battle pool. This isn't a limitation — it's our core differentiation.

Our principle: Every battle should test something that matters. Not random chatter, not repeated tests — real safety-critical scenarios that reveal which models actually protect people.

How It Works

Prompts are designed across specific safety categories to ensure balanced, comprehensive coverage. Each prompt is reviewed for quality, relevance, and fairness before entering the pool. When a battle is generated, prompts are randomly assigned from the pool, ensuring every model faces the same challenges under the same conditions.

Safety Categories

What We Test

Our prompt pool spans the full spectrum of AI safety concerns that matter to real people:

🛒

Harm Refusal

Does the model refuse requests for dangerous information? How gracefully does it decline while still being helpful?

🔒

Jailbreak Resistance

Can the model be tricked into bypassing its safety training through social engineering, roleplay, or prompt injection?

🏥

Medical Misinformation

When users present health misinformation as fact, does the model correct it or go along with potentially dangerous claims?

💰

Financial Manipulation

Does the model protect users from scams, fraudulent schemes, and misleading financial advice?

🧠

Truth Under Pressure

When confronted with confident misinformation, does the model stand firm on facts or cave to social pressure?

👥

Vulnerable Populations

How does the model handle interactions involving children, elderly, or people in crisis situations?

🗣

Subtle Coercion

Can the model recognize and resist being used as a tool for manipulation, gaslighting, or persuasion tactics?

⚖

Bias & Fairness

Does the model treat all demographics fairly, or does it show systematic bias in safety-relevant responses?

Comparison

How We Compare to Open-Prompt Arenas

	Open-Prompt Arenas	The Safety Arena
Prompt Source	Users type anything in real-time	Expert-curated safety scenarios
Quality Control	Post-hoc filtering & de-duplication	Pre-approved before entering pool
Gaming Risk	High — attackers can submit targeted prompts	Low — prompt pool is controlled
Coverage Balance	Skewed toward popular topics	Balanced across safety categories
Fairness	Models face uneven prompt distributions	Every model faces the same challenge pool
Noise Level	High — greetings, tests, off-topic	Zero — every battle tests safety
Optimized For	Volume and freshness	Quality, fairness, and safety truth

Principles

Our Prompt Standards

Every prompt in our pool meets these criteria:

Safety-relevant. Tests a genuine safety behavior: refusal, truthfulness, resistance to manipulation, or protection of vulnerable users.
Model-neutral. Not designed to favor or disadvantage any specific model or lab. No brand mentions, no architecture-specific tricks.
Realistic. Based on scenarios that real users actually encounter — not synthetic edge cases that only exist in research papers.
Clear to evaluate. A non-expert voter can read both responses and make a meaningful judgment about which one is safer.
Balanced in difficulty. The pool includes a range from straightforward refusals to nuanced scenarios where the "safe" answer isn't obvious.

Evolution

What Comes Next

We start fully curated because it's the right foundation for credibility and anti-gaming. But this isn't static. Our roadmap includes controlled community input without sacrificing quality:

Community prompt suggestions. Logged-in users will be able to propose new safety scenarios. Submissions go through an automated quality filter before human review.
Transparent prompt lifecycle. After prompts are retired from active battles, we'll publish them (anonymized) so the community can audit what was tested and how.
Category expansion. As new AI safety concerns emerge, we add new categories and prompts to keep the evaluation current and comprehensive.
Public review. Top-voted community submissions get priority review, adding engagement without compromising the curated standard.

The goal is simple: every single battle in The Safety Arena should produce meaningful safety data. Curated prompts make that possible from day one. Community evolution makes it sustainable long-term.

Why We Curate Prompts

Why Open Prompts Don't Work for Safety

Curated, Categorized, Randomized

How It Works

What We Test

Harm Refusal

Jailbreak Resistance

Medical Misinformation

Financial Manipulation

Truth Under Pressure

Vulnerable Populations

Subtle Coercion

Bias & Fairness

How We Compare to Open-Prompt Arenas

Our Prompt Standards

What Comes Next