feat(guardrails): add ATR (Agent Threat Rules) guardrail integration#28050
feat(guardrails): add ATR (Agent Threat Rules) guardrail integration#28050eeee2345 wants to merge 8 commits into
Conversation
Greptile SummaryAdds
Confidence Score: 5/5Safe to merge; all three guardrail hooks are correctly implemented and previously flagged issues are resolved. Well-isolated new guardrail package with no risk to existing code paths. Streaming hook follows the same pattern as Azure text moderation, severity and tag-filtering logic is correct and tested, and no changes touch critical proxy infrastructure. The only outstanding item is documentation file placement, which does not affect runtime behavior. docs/my-website/docs/proxy/guardrails/atr.md should be moved to the litellm-docs repo per repository policy.
|
| Filename | Overview |
|---|---|
| litellm/proxy/guardrails/guardrail_hooks/atr/atr.py | Core guardrail implementation; all three hooks present; severity handling, None-severity conservatism, and include_tags filtering correctly implemented; docstring for streaming hook overstates aggregation guarantee |
| litellm/proxy/guardrails/guardrail_hooks/atr/init.py | Correctly registers ATRGuardrail in both registries; include_tags forwarded via getattr with safe fallback |
| litellm/types/guardrails.py | Adds ATR enum value and ATRGuardrailLitellmParams mixin; cleanly extends LitellmParams without touching existing fields |
| tests/test_litellm/proxy/guardrails/guardrail_hooks/test_atr.py | 14 fully-mocked unit tests covering all hooks and edge cases including None severity, unknown severity, include_tags scoping, and streaming |
| docs/my-website/docs/proxy/guardrails/atr.md | New guardrail documentation; should be placed in the litellm-docs repo per repository policy |
Reviews (5): Last reviewed commit: "feat(atr-guardrail): add async_post_call..." | Re-trigger Greptile
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
PR overviewThis PR adds an ATR (Agent Threat Rules) guardrail integration for the LiteLLM proxy, wiring ATR evaluation into request and response handling. The touched code focuses on extracting model-visible content from chat, responses, tools, and related payload fields for guardrail processing. There are still three open gaps where model-visible or client-visible fields are not included in ATR evaluation: Responses API instructions and prompt variables, legacy chat function definitions, and tool-call/function-call arguments in model outputs. These allow callers or model responses to place relevant content outside the currently scanned fields, reducing the effectiveness of the new guardrail integration. Four issues have already been addressed, so the PR is moving in the right direction but still needs coverage fixes before the guardrail can be considered complete. Open issues (3)
Fixed/addressed: 4 · PR risk: 6/10 |
|
🤖 litellm-agent: This PR is currently BLOCKED from merge. Score: 2/5 ❌ Why blocked:
Details: Score docked for: 1 PR-related CI failure (Greptile gate: score 3/5 below required 4/5 — request a Greptile review ( Fix the issues above and push an update — the bot will re-review automatically.
|
|
Pushed dc394457 addressing both the veria-ai review and the coverage gap:
|
|
Pushed 514f108 addressing the three Greptile findings: P1 (severity=None AttributeError): guarded against None severity attribute using P2 (unknown severity silently excluded): changed the fallback rank from P1 (include_tags not wired): added Three new unit tests: |
|
@veria-ai — wanted to address the streaming-bypass finding explicitly since the recent pushes (dc394457, 514f108) focused on the non-streaming and Greptile threads and the streaming question deserves its own reply. Acknowledged that the current implementation hooks Per-chunk scanning is unsafe for ATR's rule shape. ATR rules match against complete content (a full LLM response, a full SKILL.md document, a full MCP tool descriptor). A chunk-by-chunk scan against a regex would emit false negatives (the attack pattern split across two chunks never appears in either) and inconsistent false positives (a benign string that happens to look adversarial mid-emission triggers, then doesn't trigger when context arrives). Either failure mode is worse than no streaming-side detection. The correct shape for streaming is buffered post-stream scan, which LiteLLM supports via What this does not cover: an attacker who streams a long-running response specifically to inject content before the buffered post-stream hook fires (e.g. a tool-call interleaved mid-stream that the agent acts on). That requires per-chunk inspection with a stateful aggregator, which is out of scope for a regex-based guardrail and belongs in a separate semantic-gate layer. I'll document this limitation explicitly in the guardrail's docstring rather than silently leaving the gap. Will push the streaming-hook addition shortly. After that, the parity is:
@greptileai — please re-review after the streaming push. The three Greptile P1/P2/P3 findings from the earlier review are addressed in 514f108 with unit tests; the streaming hook will land as a separate commit so it's reviewable independently. |
|
Pushed the async_post_call_streaming_hook in commit on the branch. The hook buffers the complete aggregated output and scans it once at end-of-stream. The design rationale for not scanning per-chunk is in the docstring: attack patterns split across chunks cause false negatives (missed detections) and premature blocking causes false positives. LiteLLM accumulates the full stream before calling this hook, so ATR evaluates the complete response in a single engine.evaluate() call. @greptileai please re-review — this addresses the streaming bypass gap flagged in the initial review. |
Signed-off-by: Adam Lin <adam@agentthreatrule.org>
…ion responses + add coverage - _extract_request_content: also reads data["prompt"] (str or list[str]) so /v1/completions payloads are scanned, not only chat messages - _extract_response_content: also reads choice.text for text completion responses alongside the existing choice.message.content path - tests: add 5 tests covering post-call hook (block + pass), text completion request (str prompt, list prompt), and text completion response (choice.text) to address coverage gap flagged in review
- include_tags: wire config param through __init__ and initialize_guardrail so tag-based rule filtering is honoured at runtime - severity=None: guard against AttributeError when match.severity is explicitly set to None rather than missing (getattr default is bypassed) - unknown severity: treat unrecognised severity strings conservatively (rank 0 = critical) so they are always included in scan results rather than silently dropped - tests: add three new unit tests covering include_tags filtering, None severity, and unknown severity strings
Addresses the veria-ai streaming-bypass finding. Scans the aggregated streamed response after stream completion using LiteLLM's existing post-call streaming surface; per-chunk scanning would emit false negatives for split-across-chunk attack patterns, so we wait for the aggregated text. Three new tests covering the streaming hook: block-on-match, pass-when-no-match, and no-op-on-empty-response. Signed-off-by: Adam Lin <adam@agentthreatrule.org>
…BerriAI#28050 review 2026-05-27) Addresses the two open medium findings veria-ai flagged on BerriAI#28050: 1. tool content bypasses scanning (atr.py:267) _extract_request_content now walks data['tools'] and concatenates function.name + function.description + json.dumps(function.parameters) into the scanned text. Same path applied to tool_choice when carrying a description. Anthropic / Claude tool shape (name + description directly on the tool object) also covered. 2. Responses API content bypasses scanning (atr.py:281) _extract_request_content branches on data['input'] alongside the existing data['messages'] / data['prompt'] paths. Supports both the string-input shape and the content-part-list shape used by /v1/responses. _extract_response_content mirrors this for the response side: walks response.output[*].content[*].text + the top-level response.output_text convenience field. 3. doc file removal docs/my-website/docs/proxy/guardrails/atr.md is removed from this PR per Greptile's repository-policy nit. Will open the equivalent in BerriAI/litellm-docs as a follow-up. Three new tests pin the behaviour: - test_scan_tools_function_description_blocked: tool.function.description with hidden instructions reaches the engine and triggers a block. - test_scan_responses_api_input_blocked: data['input'] content-part shape reaches the engine. - test_scan_responses_api_output_blocked: response['output'][*].content[*].text reaches the engine. All 21 tests pass locally (was 18 before).
Same code paths, same tests; refactored into four helper methods so the top-level extractor stays under Ruff's PLR0915 statement-count limit. _extract_messages_content chat completions messages[] _extract_prompt_content text completions prompt str | list[str] _extract_responses_input OpenAI Responses API data['input'] _extract_tools_content tool definitions + tool_choice _extract_request_content composes the above All 21 tests still pass locally; ruff check clean.
|
@veria-ai pushed addressing both open findings. Tool content (atr.py:267): _extract_request_content now walks data["tools"] Responses API (atr.py:281): the request side now branches on data["input"] Three new tests pin the behaviour:
All 21 ATR-guardrail tests pass locally; ruff clean. Also removed docs/my-website/docs/proxy/guardrails/atr.md from this PR The PLR0915 ruff violation that surfaced on the first push (function |
| parts.append(desc) | ||
| return parts | ||
|
|
||
| def _extract_request_content(self, data: dict) -> str: |
There was a problem hiding this comment.
Medium: Responses instructions bypass ATR scanning
A caller can send a benign /v1/responses input and put the blocked prompt in top-level instructions or prompt template variables; LiteLLM forwards those fields to the model, but _extract_request_content never includes them in the ATR scan. Include model-visible Responses fields such as instructions and string values under prompt.variables before evaluating the request.
… atr-guardrail # Conflicts: # litellm/types/guardrails.py
| Anthropic / Claude direct shape. | ||
| """ | ||
| parts: List[str] = [] | ||
| for tool in data.get("tools") or []: |
There was a problem hiding this comment.
Medium: Legacy function definitions bypass input scanning
LiteLLM accepts functions / function_call on chat completions and forwards them, or folds them into the provider prompt for unsupported models. This extractor only walks tools, so a caller can put the blocked prompt in functions[0].description or functions[0].parameters while keeping messages benign and pass the pre-call ATR scan. Include the legacy functions array in the scanned parts as well.
| if message is None and isinstance(choice, dict): | ||
| message = choice.get("message", {}) | ||
| if message is not None: | ||
| content: Optional[str] = getattr(message, "content", None) |
There was a problem hiding this comment.
Medium: Tool call arguments bypass output scanning
Chat model tool calls are returned in message.tool_calls[].function.arguments or legacy message.function_call.arguments, but this method only appends message.content and choice.text. A prompt that makes the model return a blocked command or secret in tool-call arguments is returned to the client without ATR post-call scanning. Extract and scan those fields before returning the response.
|
Status: this is green and ready for review. CI passes (lint, semgrep, secret-scan, guardrails tests, codecov), and Greptile's re-review is at 5/5 "safe to merge" after the earlier P1/P2 fixes. The first-pass blocker score predates those fixes. Could a maintainer take a look? Happy to fold the remaining response-field coverage suggestions into this PR if you'd like them in scope. |
Adds ATR (Agent Threat Rules) as a guardrail integration for LiteLLM proxy.
ATR is an MIT-licensed open detection rule format for AI agent security threats: prompt injection, tool poisoning, credential exfiltration, context manipulation, and other categories. Same family as Sigma/YARA but targeted at LLM I/O and agent runtime events. Detection runs locally via the pyatr reference engine, so no request data leaves the proxy.
What this adds
litellm/proxy/guardrails/guardrail_hooks/atr/atr.pyplus__init__.pyregistration:ATRGuardrailclass withasync_pre_call_hookandasync_post_call_success_hook, mirroring the per-package layout used by Lasso, Aporia, XecGuard, and the other recent guardrail integrationslitellm/types/proxy/guardrails/guardrail_hooks/atr.py:ATRGuardrailConfigModelfor the UI / config surfacelitellm/types/guardrails.py:SupportedGuardrailIntegrations.ATRenum value andATRGuardrailLitellmParamsmixin exposingrules_pathon the proxy config (severity_thresholdalready lives onContentFilterConfigModeland is reused)tests/test_litellm/proxy/guardrails/guardrail_hooks/test_atr.py: 7 unit tests covering missing-dependency, missing-path, invalid-severity, rule loading, severity filtering, pre-call blocking, and pre-call passingdocs/my-website/docs/proxy/guardrails/atr.md: install + config docsThe package
__init__.pyexportsguardrail_class_registryandguardrail_initializer_registry, so it is auto-discovered bylitellm/proxy/guardrails/guardrail_registry.pywith no central-registry edits required.Usage
Requires
pip install pyatr(optional dependency, not added to LiteLLM's own requirements).Verification
make lintchecks locally:black --checkandruff check .both pass on the new files and acrosslitellm/python tests/documentation_tests/test_circular_imports.pyexits 0 with no new violationsfrom litellm import *succeedsLitellmParamsmixin)Production context
The ATR rule set this guardrail consumes is deployed at Microsoft Agent Governance Toolkit (PRs #908 and #1277, merged 2026-04), Cisco AI Defense skill-scanner (PRs #79 and #99, merged 2026-04), MISP / CIRCL via misp-taxonomies #323 and misp-galaxy #1207 (merged 2026-05), Gen Digital Sage (PR #33, merged 2026-05), and OWASP Agent-Security-Regression-Harness (PR #74, merged 2026-05). pyatr v0.2.4 is on PyPI.
Rule format and rules: https://github.com/Agent-Threat-Rule/agent-threat-rules
Happy to adjust the scope, severity mapping, or hook signature if you prefer a different pattern.