Skip to main content
Idun Agent Platform supports 15 guardrail types that validate agent inputs, outputs, or both. Each guardrail has a config_id, a reject message returned when the guard triggers, and type-specific configuration fields.

Guardrail positions

Guardrails are placed in one of two positions:
  • Input: Applied to user messages before they reach the agent
  • Output: Applied to agent responses before they are returned to the user
You can place the same guardrail type in both positions.
config.yaml
guardrails:
  input:
    - config_id: detect_pii
      reject_message: "PII detected in your input"
      pii_entities: ["EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD"]
      on_fail: exception

  output:
    - config_id: toxic_language
      reject_message: "Response contains inappropriate language"
      threshold: 0.7

Guardrail types

BAN_LIST

Blocks messages containing specific words or phrases.
FieldTypeDescription
config_idban_list
reject_messagestringMessage returned when triggered
banned_wordslist[string]Words or phrases to block
- config_id: ban_list
  reject_message: "Message contains banned content"
  banned_words: ["banned-word", "another phrase"]

DETECT_PII

Detects personally identifiable information in text.
FieldTypeDescription
config_iddetect_pii
reject_messagestringMessage returned when triggered
pii_entitieslist[string]PII entity types to detect (e.g., EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, SSN)
on_failstringAction on detection. Default: exception
- config_id: detect_pii
  reject_message: "Personal information detected"
  pii_entities: ["EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD"]
  on_fail: exception

NSFW_TEXT

Detects not-safe-for-work content.
FieldTypeDescription
config_idnsfw_text
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0). Lower values are more sensitive
- config_id: nsfw_text
  reject_message: "Inappropriate content detected"
  threshold: 0.5

COMPETITION_CHECK

Flags mentions of competitor companies or products.
FieldTypeDescription
config_idcompetition_check
reject_messagestringMessage returned when triggered
competitorslist[string]Names of competitor companies or products
- config_id: competition_check
  reject_message: "Competitor reference detected"
  competitors: ["CompetitorA", "CompetitorB"]

BIAS_CHECK

Detects biased language in text.
FieldTypeDescription
config_idbias_check
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0)
- config_id: bias_check
  reject_message: "Biased language detected"
  threshold: 0.7

CORRECT_LANGUAGE

Validates that text is in one of the expected languages.
FieldTypeDescription
config_idcorrect_language
reject_messagestringMessage returned when triggered
expected_languageslist[string]Valid ISO language codes (e.g., en, fr, es)
- config_id: correct_language
  reject_message: "Please use English or French"
  expected_languages: ["en", "fr"]

GIBBERISH_TEXT

Filters nonsensical or garbled input.
FieldTypeDescription
config_idgibberish_text
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0)
- config_id: gibberish_text
  reject_message: "Input appears to be nonsensical"
  threshold: 0.8

TOXIC_LANGUAGE

Detects toxic or harmful language.
FieldTypeDescription
config_idtoxic_language
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0)
- config_id: toxic_language
  reject_message: "Toxic language detected"
  threshold: 0.7

RESTRICT_TO_TOPIC

Keeps conversations within a defined set of allowed topics.
FieldTypeDescription
config_idrestrict_to_topic
reject_messagestringMessage returned when triggered
topicslist[string]List of allowed topics
- config_id: restrict_to_topic
  reject_message: "That topic is outside the scope of this agent"
  topics: ["customer support", "product information", "billing"]

DETECT_JAILBREAK

Detects jailbreak attempts in user input.
FieldTypeDescription
config_iddetect_jailbreak
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0)
- config_id: detect_jailbreak
  reject_message: "Jailbreak attempt detected"
  threshold: 0.5

PROMPT_INJECTION

Detects prompt injection attacks.
FieldTypeDescription
config_idprompt_injection
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0)
- config_id: prompt_injection
  reject_message: "Prompt injection detected"
  threshold: 0.5

RAG_HALLUCINATION

Detects hallucinations in RAG (Retrieval-Augmented Generation) responses by comparing the response against the retrieved context.
FieldTypeDescription
config_idrag_hallucination
reject_messagestringMessage returned when triggered
thresholdfloatSensitivity level (0.0 to 1.0)
- config_id: rag_hallucination
  reject_message: "Response may contain unsupported claims"
  threshold: 0.7

CODE_SCANNER

Scans and validates code in messages, restricting to allowed programming languages.
FieldTypeDescription
config_idcode_scanner
reject_messagestringMessage returned when triggered
allowed_languageslist[string]List of allowed programming languages
- config_id: code_scanner
  reject_message: "Code in this language is not allowed"
  allowed_languages: ["python", "javascript", "sql"]

MODEL_ARMOR (Google Cloud)

Uses Google Cloud’s Model Armor service for content safety evaluation.
FieldTypeDescription
config_idmodel_armor
namestringName of the armor configuration
project_idstringGoogle Cloud project ID
locationstringGoogle Cloud region (e.g., us-central1)
template_idstringModel Armor template ID
- config_id: model_armor
  name: production-armor
  project_id: my-gcp-project
  location: us-central1
  template_id: my-armor-template
Model Armor requires a Google Cloud project with the Model Armor API enabled. This guardrail type does not use the Guardrails AI hub.

CUSTOM_LLM

Uses a large language model as a custom guardrail with a prompt you define.
FieldTypeDescription
config_idcustom_llm
namestringName of the custom guardrail
modelstringLLM model to use for evaluation
promptstringSystem instruction prompt that defines the guardrail logic
Supported models:
Model IDName
Gemini 2.5 flash liteGemini 2.5 Flash Lite
Gemini 2.5 flashGemini 2.5 Flash
Gemini 2.5 proGemini 2.5 Pro
Gemini 3 proGemini 3 Pro
OpenAi GPT-5.1OpenAI GPT-5.1
OpenAi GPT-5 miniOpenAI GPT-5 Mini
OpenAi GPT-5 nanoOpenAI GPT-5 Nano
- config_id: custom_llm
  name: compliance-check
  model: "Gemini 2.5 flash"
  prompt: "Evaluate the following text for regulatory compliance violations. Return PASS if compliant, FAIL if not."
Custom LLM guardrails use a separate LLM call for evaluation. This adds latency and cost to each guarded request.

Summary table

TypeConfig IDPositionKey config fields
Ban listban_listInput/Outputbanned_words
Detect PIIdetect_piiInput/Outputpii_entities, on_fail
NSFW textnsfw_textInput/Outputthreshold
Competition checkcompetition_checkInput/Outputcompetitors
Bias checkbias_checkInput/Outputthreshold
Correct languagecorrect_languageInput/Outputexpected_languages
Gibberish textgibberish_textInput/Outputthreshold
Toxic languagetoxic_languageInput/Outputthreshold
Restrict to topicrestrict_to_topicInput/Outputtopics
Detect jailbreakdetect_jailbreakInputthreshold
Prompt injectionprompt_injectionInputthreshold
RAG hallucinationrag_hallucinationOutputthreshold
Code scannercode_scannerInput/Outputallowed_languages
Model Armormodel_armorInput/Outputproject_id, location, template_id
Custom LLMcustom_llmInput/Outputmodel, prompt
Last modified on March 22, 2026