Guardrails
Guardrails are security checks that run before normal policy evaluation. They provide an additional layer of protection against common attack patterns, catching threats before they reach your MCP servers.Overview
While analyzers examine tool arguments for risk classification, guardrails act as a first line of defense, scanning all incoming requests for known attack patterns.Prompt Injection Guardrail
The prompt injection guardrail scans all tool arguments for common injection patterns that attempt to manipulate AI behavior or bypass security controls.Configuration
Actions
| Action | Behavior |
|---|---|
deny | Block the request entirely (recommended for production) |
prompt | Ask the user for approval before proceeding |
warn | Log a warning but allow the request to continue |
Sensitivity Levels
| Sensitivity | Triggers On | False Positives |
|---|---|---|
strict | write, destructive, dangerous | Higher |
balanced | destructive, dangerous (default) | Moderate |
permissive | dangerous only | Lower |
Detection Patterns
The prompt injection analyzer detects various attack patterns:Instruction Override
Attempts to override or reset the AI’s instructions:System Prompt Manipulation
Attempts to access or modify system prompts:Role Confusion
Attempts to change the AI’s identity or remove restrictions:Base64 Obfuscation
Encoded payloads that hide injection attempts:Unicode Obfuscation
Using lookalike characters to bypass text filters:- Cyrillic homoglyphs (а, е, о, р, с)
- Zero-width characters
- RTL override characters
Markdown/HTML Injection
Malicious links and script injection:Delimiter Injection
Attempting to break out of prompt context:Jailbreak Patterns
Common jailbreak attempts:Allowlists
Exclude specific tools or servers from scanning:Example Configurations
High Security (Production)
Balanced (Development)
Monitoring Only
Audit Logging
When a guardrail is triggered, it’s recorded in the audit log:Combining with Analyzers
Guardrails work alongside content analyzers:- Guardrails run first to catch known attack patterns
- Analyzers then classify the risk level of the content
- Policies determine the final action based on configuration
- Prompt injection attempts are blocked immediately
- If no injection is detected, SQL content is analyzed
- Read operations are allowed, writes require approval
Best Practices
- Enable in production - Always enable prompt injection guardrails in production environments
-
Start with balanced - Use
balancedsensitivity initially, adjust based on false positive rates -
Review warnings - If using
warnaction, regularly review logs for patterns - Allowlist carefully - Only add tools to allowlists if they legitimately need to handle prompt-like content
- Layer defenses - Use guardrails alongside analyzers and strict policies for defense in depth
Competitive Advantage
Prompt injection protection is a key differentiator for AI security tools. Tollgate’s guardrail provides:- Zero-cost protection - No API calls or external services required
- Low latency - Pattern matching runs in microseconds
- Customizable - Adjust sensitivity and allowlists for your use case
- Auditable - Full logging of all triggered guardrails
- Open source - Inspect and extend the detection patterns