Jailbreaking — AI Agent Glossary

Definition

Jailbreaking refers to adversarial prompting techniques designed to circumvent the safety training and content policies applied to LLMs. Safety-trained models learn to refuse certain request types; jailbreaking attempts to route around these refusals by framing the same request in ways that don't trigger the model's safety mechanisms. Unlike prompt injection—which overrides developer instructions—jailbreaking specifically targets the model's RLHF-based safety alignment rather than system prompt boundaries.

Engineering Context

Jailbreaking differs from prompt injection in intent: jailbreaks target safety filters; prompt injection targets system prompt overrides. Common techniques: role-play framing, hypothetical scenarios, many-shot prompting, and adversarial suffixes. Defense: input classification to detect jailbreak patterns, output content moderation, and prompt hardening. For enterprise deployments, treat jailbreaking as an adversarial security concern and test systematically. Red-teaming—systematically attempting to break your own system—is essential before production deployment. Implement output moderation as a defense-in-depth layer: even if the jailbreak succeeds at the model level, output filtering can prevent harmful content from reaching users.

Related Terms

Prompt Injection Guardrails Data Privacy Audit Trail Hallucination

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment