Skip to content

fix(runtime): persist sub-session transcript on error path#3151

Merged
dgageot merged 1 commit into
docker:mainfrom
jedp-docker:fix/persist-subsession-on-error
Jun 17, 2026
Merged

fix(runtime): persist sub-session transcript on error path#3151
dgageot merged 1 commit into
docker:mainfrom
jedp-docker:fix/persist-subsession-on-error

Conversation

@jedp-docker

Copy link
Copy Markdown
Contributor

Problem

When a sub-agent's run loop emits an ErrorEvent (model stream failure, loop detector trip, hook-driven termination, tool-load error), runForwarding early-returned at the first error event. This skipped both parent.AddSubSession and the SubSessionCompletedEvent emission.

The persistence pipeline relies on SubSessionCompletedEvent to write the sub-session to the store. This contract is documented explicitly in persistence_observer.go:68–72:

Sub-session events are skipped (the parent absorbs them on SubSessionCompleted), and any SessionScoped event tagged with a different session id (forwarded sub-agent streaming events) is filtered out so it can't pollute the parent's transcript.

The result: any sub-agent that hit an error mid-stream had its entire transcript silently dropped. The parent agent received "Error calling tool: <message>" with no record of what the sub-agent actually did — invisible in session_items, invisible in the TUI.

Fix

  • Capture the first ErrorEvent into a local variable instead of early-returning.
  • Continue draining the channel so the TUI's streamDepth counter stays balanced and any trailing events (notifications, hook output) reach the parent's event stream.
  • Always emit SubSessionCompletedEvent (and call parent.AddSubSession) before returning, regardless of whether the sub-session errored.
  • Guard parent.ToolsApproved propagation to the success path only — a failed sub-session must not silently escalate the parent's tool-approval gate.

Test plan

TestTransferTaskPersistsSubSessionOnError uses mockProviderWithError to make a sub-agent fail to start its model stream. It asserts:

  • parent.Messages contains the sub-session item (fails on pre-fix code)
  • ErrorEvent was forwarded to the parent sink
  • SubSessionCompletedEvent fired (fails on pre-fix code)

Verify by stashing the fix and re-running the test — it fails without the change.

Related

A companion fix for runCollecting (the background-agent path) is in a separate PR — it has the same structural bug but the SubSessionCompletedEvent emission requires a larger interface change.

@aheritier aheritier added area/agent For work that has to do with the general agent loop/agentic features of the app area/sessions For features/issues/fixes related to session lifecycle (resume, persistence, export) kind/fix PR fixes a bug (maps to fix: commit prefix) labels Jun 16, 2026

@docker-agent docker-agent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

This fix correctly addresses the sub-session transcript loss on the error path in runForwarding. The implementation looks sound:

  • Error capture pattern: Captures the first ErrorEvent into a local variable while continuing to drain the channel — correct approach for keeping the TUI streamDepth counter balanced.
  • Unconditional persistence: parent.AddSubSession(s) and SubSessionCompleted are now always emitted regardless of error state — matches the persistence contract documented in persistence_observer.go.
  • ToolsApproved guard: Correctly scoped to the success path only, preventing accidental approval escalation from a failed sub-session scope.
  • Test coverage: TestTransferTaskPersistsSubSessionOnError with mockProviderWithError validates the primary fix path well.

No bugs found in the changed code.

When a sub-session emitted an ErrorEvent (model stream failure, loop
detector, hook-driven termination, tool-load error), runForwarding
early-returned at the first error, skipping both `parent.AddSubSession`
and the `SubSessionCompletedEvent` emission. The persistence pipeline
relies on that event to atomically write the sub-session row and all
its messages to the store — sub-session events are deliberately ignored
during normal streaming because they're "absorbed by SubSessionCompleted"
per the comment on PersistenceObserver.OnEvent.

The result: any sub-agent that hit an error mid-stream had its entire
transcript silently dropped. The parent agent's tool result for
transfer_task became "Error calling tool: <opaque message>" and the
work the sub-agent actually performed (often dozens of tool calls and
many assistant turns) was invisible to anyone walking session_items.

The fix:
- Capture the first ErrorEvent into a local instead of early-returning.
- Keep draining the channel so the TUI's stream-depth counter stays
  balanced and any trailing events (notifications, hook output) reach
  the parent's event stream.
- Always run the persistence emission, then return the captured error.

Test: TestTransferTaskPersistsSubSessionOnError uses mockProviderWithError
to make a sub-agent fail to start its model stream. It asserts that the
parent session's Messages list contains the sub-session item AND that the
events channel saw a SubSessionCompletedEvent — both of which fail on
the pre-fix code (verified by stashing the fix and re-running the test).
@jedp-docker jedp-docker force-pushed the fix/persist-subsession-on-error branch from 18e3c3b to 86ce1ed Compare June 16, 2026 21:34
@jedp-docker jedp-docker marked this pull request as ready for review June 16, 2026 22:21
@jedp-docker jedp-docker requested a review from a team as a code owner June 16, 2026 22:21

@docker-agent docker-agent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The fix correctly addresses the transcript-loss bug in runForwarding. The approach — capture the first ErrorEvent into a local variable, continue draining the child event channel, then unconditionally call parent.AddSubSession and emit SubSessionCompletedEvent before returning — is sound and matches the documented contract in persistence_observer.go.

Changes reviewed:

  • pkg/runtime/agent_delegation.go — error-capture loop, unconditional persistence, ToolsApproved guard
  • pkg/runtime/runtime_test.goTestTransferTaskPersistsSubSessionOnError

Key observations:

  • The drain loop correctly forwards every event (including the ErrorEvent) to the parent sink, keeping the TUI's streamDepth counter balanced.
  • Guarding parent.ToolsApproved = s.ToolsApproved to the success path only is the right semantic: a failed sub-session should not silently escalate the parent's tool-approval gate.
  • The test's close(evts) after handleTaskTransfer returns is safe because channelSink.Emit recovers from send-on-closed-channel panics, and handleTaskTransfer is synchronous — all writes complete before it returns.
  • Only capturing the first ErrorEvent is intentional and correct for control flow.

No bugs were found in the changed code.

@dgageot dgageot merged commit a1329a8 into docker:main Jun 17, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent For work that has to do with the general agent loop/agentic features of the app area/sessions For features/issues/fixes related to session lifecycle (resume, persistence, export) kind/fix PR fixes a bug (maps to fix: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants