Inspiration

Every engineer who has carried a pager knows the same 2 a.m. story. An alert fires. I start digging through dashboards, jumping between logs, metrics, traces, and deploy histories, trying to assemble a picture under pressure while the clock runs and customers feel the pain.

Splunk's own AI Troubleshooting Agent already helps here. It detects, it diagnoses, and it surfaces probable root causes. But it stops at handing a human a to-do list. It tells you what broke. It does not give you the fix, it does not prove the fix worked, and it does not remember anything for next time, so the same class of incident gets investigated from scratch again and again.

I wanted to close that gap without taking the human out of the decision. The hackathon guidance said it plainly: the best projects keep a human in the loop and give engineers a clearer next move rather than automating for its own sake. That single idea shaped the entire design. Loop is a copilot, not an autopilot.

What it does

Loop runs a real, end-to-end resolution loop on top of live Splunk data:

  1. Detect. It confirms the anomaly and pinpoints its onset minute by querying Splunk directly for p95 latency by minute on the affected service.
  2. Diagnose. It forms competing, falsifiable hypotheses (traffic spike, payment gateway, database pool, bad deploy), writes SPL for each, runs them against Splunk, and keeps only the hypothesis the real data supports. Red herrings are ruled out by evidence, not by guesswork.
  3. Correlate across domains. It links the observability signal, the latency spike, to the platform and CI/CD signal, the exact deployment event that introduced the regression. Connecting signals across domains is something Splunk's unified data makes uniquely possible, and it is what turns a symptom into a cause.
  4. Propose, then wait. It drafts the concrete remediation, both the rollback action and the code-level fix for the underlying anti-pattern, presents it with the supporting evidence, and then halts. Nothing is applied until a human reviews it and clicks Approve.
  5. Verify. After approval, it queries Splunk again over the post-fix window and proves the metric returned to baseline. Resolution is evidence-backed, not a status flip.
  6. Learn. It stores the incident's signature in a vector memory. When a similar pattern recurs, even on a different service, Loop recognizes it instantly and presents the known fix for one-click approval, collapsing time to resolution while keeping the human in command.

The headline moment is the second incident. The first one teaches Loop. The second one, a different service with the same underlying anti-pattern, is recognized immediately and resolved in a single click, with the human still making the call.

How I built it

The system is a monorepo with two deployable services and a real Splunk backend.

  • Agent service (Python, FastAPI). This is the brain. It holds the loop as an explicit state machine that pauses at the human approval gate, a Model Context Protocol client that talks to the Splunk MCP Server, the hosted-model layer, and the incident-memory store. It streams progress over Server-Sent Events and writes every step to the database so the interface can animate in real time.
  • Splunk MCP Server. Every stage is grounded in real queries. The agent connects to the Splunk MCP Server over streamable HTTP with token authentication and runs SPL through the run_splunk_query tool against a seeded e-commerce telemetry index. There is no mocked data in the resolution path.
  • Splunk Hosted Models. Diagnosis reasoning and remediation drafting run on Foundation-Sec-8B, a Splunk hosted model, behind a swappable interface so additional model paths can be added.
  • Web app (Next.js, TypeScript, Tailwind). The interface visualizes the loop as a ring of five arcs that light up and close as the agent advances. It surfaces the live reasoning trace with the actual SPL it ran, the cross-domain correlation moment, the approval gate as a first-class interaction, a before-and-after latency chart pulled from real numbers, and a time-to-resolution counter that drops sharply on the recurring incident.
  • State and memory (Supabase, Postgres with pgvector). Incidents, the live step trace, and the incident-signature memory live here, with realtime subscriptions driving the interface.

To make the demo deterministic and reproducible by a judge, I built a seed generator that plants two incidents in the telemetry: a checkout service where a specific deploy introduces an N+1 query pattern that spikes p95 latency, followed by a rollback and a clean recovery, and a second service that later ships the same anti-pattern. The recovery period is what makes the Verify stage real, and the second incident is what makes the Learn stage visible.

Challenges I ran into

The first challenge was strategic, not technical. My initial concept was a fully autonomous remediation agent. Reading Splunk's guidance carefully made it clear that full autonomy was a liability rather than a strength, so I redesigned the flow around a mandatory human approval gate. The agent does all of the work and then deliberately stops and waits for a person. This turned out to be a better product and a stronger demo, because it shows speed and control at the same time.

The second challenge was the Splunk integration itself. Getting the MCP Server connection right meant working through role configuration, generating tokens from the correct place inside the app rather than the general token settings, and confirming an end-to-end query before building anything on top of it. I treated that first successful live query as a hard gate and refused to build the agent logic until it passed.

The third challenge was keeping the agent honest. It would have been easy to let the model narrate a confident root cause. Instead I constrained every conclusion to cite specific values returned from Splunk, the actual latency numbers, the build identifier on the bad deploy, and the count of repeated queries in the slow traces, so the diagnosis is auditable and grounded rather than asserted.

The fourth challenge was verification. Proving a fix worked is harder than proposing one. The agent re-queries the post-fix window and only marks an incident resolved if the live data confirms the metric returned to baseline.

Accomplishments that I'm proud of

I closed the loop that Splunk's own tooling leaves open, and I did it without removing the human from the decision. The agent detects, diagnoses, correlates across domains, proposes, and then verifies its own work against live data, which is the part most agents skip.

Every step is grounded in real Splunk queries through the MCP Server, with no mocked data in the resolution path, so the diagnosis can be audited line by line.

The cross-incident memory works. A pattern learned on one service is recognized on another and resolved in a single approval, which is the moment that makes the whole idea click.

And the experience is legible. Anyone watching can see the reasoning, the exact SPL, the evidence, and the before-and-after proof, rather than being asked to trust a black box.

What I learned

I learned that the interesting frontier in agentic operations is not autonomy, it is trust. An agent that acts on its own is easy to distrust and easy to dismiss. An agent that does the heavy lifting, shows its evidence, and then asks for a decision is one an engineer would actually let into their workflow.

I also learned how much leverage comes from memory. The difference between an agent that re-investigates every incident and one that recognizes a known pattern is the difference between a tool and a teammate. And I learned that grounding every claim in real returned data, rather than letting the model assert, is what makes an agent credible to the people who would have to rely on it.

What's next for Loop

Expanding the memory so signatures learned in one domain transfer to another, adding the Splunk AI Assistant SPL generation path alongside the hosted model, and broadening the hypothesis library so the diagnosis stage covers more incident classes. The core loop, detect, diagnose, correlate, propose, approve, verify, learn, generalizes well beyond the e-commerce example shown here.

Built With

  • fastapi
  • foundation-sec-8b
  • model-context-protocol
  • next.js
  • pgvector
  • postgresql
  • python
  • react
  • render
  • server-sent-events
  • spl
  • splunk
  • splunk-hosted-models
  • splunk-mcp-server
  • supabase
  • tailwindcss
  • typescript
  • vercel
Share this project:

Updates