Show full content

Imagine you are building an Agentic AI tool that fetches the latest changes in your GitHub codebase and returns the results! Sounds simple right? In the end, it is just us wiring up an agentic workflow. Single LLM, a handful of tools, a durable runtime so it survives restarts, and it should work beautifully on the happy path:
- Turn 1: the model calls
list_pull_requests, sees the response, picks a candidate. - Turn 2: it calls
get_commit, reads the diff, decides on a change. - Turn 3: it calls
search_codeto find the symbol it wants to touch. - Turn 4: it calls
list_commitsto sanity-check the file's recent history.
But then:
- Turn 5:
ResourceExhausted: trying to send message larger than max.
Game over...
The workflow is stuck retrying the same activity forever. The LLM never saw the failure. Your logs don't show a bad tool. The model "did everything right." And yet here you are.
I recently encountered just this, and it's more involved that it seemingly seems so! Welcome to the wonderful world of IT Architecture (and why Senior and above profiles didn't loose their jobs yet)
Why Agentic Applications Are Different
A "normal" LLM application has roughly one shape: user → prompt → model → response. The payloads are bounded by the user input, the prompt template, and a single completion. You can capacity-plan it on the back of an envelope.
Now an agent is an entirely different beast, it flips three things at once:
- The conversation grows recursively. Every turn appends not just the assistant's reply but every tool call and every tool result. By turn 10, the model is reading a transcript that includes nine prior tool outputs.
- Tool outputs are unbounded in practice. A
list_pull_requestscall against GitHub returns ~5–10 KB per PR by default × 30 PRs per page = 150–300 KB per call.get_repository_tree(recursive) can hit multiple MB. None of that size is under your control — it's whatever the upstream API decides to return. - You probably want durability. Production agents need to survive node restarts, retry failed tool calls, and let humans intervene mid-run. So you put them on a durable workflow runtime (Temporal, Dapr Workflows, Inngest, Restate). And those runtimes have their own opinions about payload sizes.
The combination is the trap. Each part is reasonable on its own. The intersection is where things explode.
The Specific Trap: Durable Workflows Meet Chatty Tools
Durable execution frameworks all work by serializing activity inputs and outputs and ferrying them between the orchestrator and worker. That serialization boundary has a cap. Sometimes documented, sometimes not. Always inconvenient when you cross it.
For Dapr 1.17 for example (the runtime we are using at Scrydon), the cap is 2 MiB per activity input or output. It's enforced by the scheduler (dapr.runtime.actor.reminders.scheduler) and — critically — is not configurable via --max-body-size or any other CLI flag. The --max-body-size flag controls a different limit (the gRPC server's body size, which is 16 MiB). Until you read the scheduler source, you assume the flag controls everything. It doesn't.
Above the ceiling, Dapr returns:
ResourceExhausted: trying to send message larger than max
…and the workflow gets stuck in a retry loop, which from a UX standpoint looks like the agent "hanging" with no actionable error.
For a Scrydon customer running an agent in auto-discover mode against GitHub tools, this happened reliably after 4–5 turns. The model didn't suddenly start emitting more text — the tool results crossed the line. And not in the way you'd expect.
The Surprising Root Cause: There Are Two Copies
When we instrumented the workflow state, we found that every turn was round-tripping two representations of every tool result through the activity boundary:
turnState.messages[].content— the tool result formatted for the LLM. This was already capped at 32 KiB by aserializeResulthelper. The LLM didn't need the full body; a truncation marker told it to paginate or re-query if it wanted more.turnState.accumulatedToolCalls[].result— the raw, untruncated result, kept around so the UI could later render "what did this tool actually return?" for debugging. Not capped.
sequenceDiagram
autonumber
participant WF as Dapr Workflow<br/>(orchestrator)
participant ACT as Agent Activity<br/>(dispatcher)
participant LLM as LLM
participant Tool as Tool<br/>(e.g. GitHub API)
participant State as turnState<br/>(serialized payload)
participant UI as UI / Telemetry
Note over WF,State: TURN 1 — payload starts empty
WF->>ACT: invoke(turnState = {messages:[], accumulatedToolCalls:[]})
ACT->>LLM: chat(messages)
LLM-->>ACT: tool_call: list_pull_requests
ACT->>Tool: list_pull_requests()
Tool-->>ACT: raw result (200 KB)
rect rgb(40, 60, 40)
Note over ACT: split into TWO copies
ACT->>ACT: serializeResult(raw) → 32 KiB (LLM view)
ACT->>ACT: keep raw 200 KB (telemetry view)
end
ACT->>State: append messages[].content (32 KiB ✅)
ACT->>State: append accumulatedToolCalls[].result (200 KB ❌ uncapped)
ACT-->>WF: return turnState (≈ 232 KB)
Note over WF,State: TURN 2 — payload carries Turn 1's raw copy
WF->>ACT: invoke(turnState) ← 232 KB IN
ACT->>LLM: chat(messages) — only sees 32 KiB tool msg
LLM-->>ACT: tool_call: get_commit
ACT->>Tool: get_commit()
Tool-->>ACT: raw result (200 KB)
rect rgb(40, 60, 40)
ACT->>ACT: serializeResult → 32 KiB
ACT->>ACT: keep raw 200 KB
end
ACT->>State: append messages[].content
ACT->>State: append accumulatedToolCalls[].result
ACT-->>WF: return turnState (≈ 464 KB)
Note over WF,State: TURNS 3–4 — same shape, payload keeps growing
WF->>ACT: invoke(turnState) ← 464 KB → 696 KB → 928 KB → 1.16 MB → 1.39 MB
Note right of State: messages[] LLM-side total: still ~96 KB ✅<br/>accumulatedToolCalls raw total: linearly growing ❌
Note over WF,State: TURN ~5 — cap breached
WF--xACT: ResourceExhausted 💥<br/>"trying to send message larger than max"<br/>(turnState > 2 MiB scheduler cap)
Note over LLM: LLM never sees the failure —<br/>activity is stuck retrying.
Note over UI: UI was the only consumer of<br/>accumulatedToolCalls[].result —<br/>and it doesn't even need it inline<br/>it fetches on click.
While the first is "interesting", it's actually the second one tthat is the killer. It exists purely for telemetry and the debugging UX. The LLM never reads it. But because it lives inside turnState, it gets serialized into every subsequent activity's input. Three 200 KB tool calls per turn × 5 turns and you've blown past 2 MiB — even though, from the model's perspective, the conversation has barely started.
This is a class of bug that's easy to miss in design and brutal to debug in production:
- It doesn't trip on day one because small tool outputs don't exceed the cap.
- It scales with turn count, not concurrency — so load-testing one big turn won't catch it.
- The failure surfaces in the orchestrator, far from the code that wrote the bad payload.
- The duplicated copy was added for a good reason (full-fidelity debugging) by a different code path than the one that owns transcript size.
A core lesson: in agent systems, every piece of conversation state must answer two questions independently — "does the model need this?" and "where does the storage cost land?" When the answers diverge, you need a different storage mechanism.
What the Community Does (and Why It Mostly Doesn't Fit)
Before designing a fix, we read how other agent frameworks and durable-execution platforms handle this. Five patterns, ranked by prevalence:
PatternWho does itMechanismSliding window / tail-NLangGraph (trim_messages), AutoGen (BufferedChatCompletionContext, TokenLimitedChatCompletionContext), Mastra (TokenLimiterProcessor)Drop oldest messages before each LLM call. Zero infra. Loses old tool results from the LLM's view.Hard truncation at write timeSmolagents (truncate_content, 20 KB head+tail)Clip each tool result when stored. Predictable but no recovery path.LLM-driven compactionOpenAI Agents SDK (OpenAIResponsesCompactionSession)Periodic LLM summarization. Most info-preserving, but adds an extra LLM call per N items. Premium opt-in, not a default anywhere else.External session store + recency windowMastra (MessageHistory)Agent state is a cursor into an external DB. Naturally sidesteps payload limits.Payload Codec / Claim CheckTemporalA codec intercepts activity I/O, externalizes payloads above a threshold to a blob store, and replaces them with {ref: key}. Applied transparently outside application code.
A few things stand out once you put them side by side.
No framework keeps raw tool result JSON in agent state long-term. They all do something — window it, truncate it, summarize it, or externalize it. If your design holds the raw blob in transit, you are an outlier and you will eventually pay for it.
LLM-driven compaction is rare and optional. Only OpenAI Agents SDK ships it as a default, and even there it triggers at ≥10 items, not per-call. You probably don't want to pay 30%-ish overhead per tool result just to make your transcript fit in a payload window.
The closest precedent for our exact constraint — durable execution with a hard per-payload cap — is Temporal's Claim Check codec. Temporal's docs explicitly recommend it for PayloadSize issues. That's the same problem we have, on a different orchestrator. So we lifted the pattern.
The other patterns are not wrong — many of them are complementary. But none of them alone solves durable execution + per-payload cap. You need an externalization story.
The Fix: Claim Check via Dapr State Store
What we ended up doing is what is called "The Claim Check pattern". It is older than agentic AI (Enterprise Integration Patterns, Hohpe & Woolf, 2003): when a message is too large for the bus, drop the payload into a side store and put a pointer on the bus instead. The receiver fetches the payload on demand. Temporal's payload codec automates this for activity I/O.
For our case, the externalization happens at the dispatcher, not at a transport codec. The shape:
turnState (round-trips through 2 MiB activity boundary):
messages: LLMMessage[]
// tool message content = truncated to 32 KiB for LLM consumption
accumulatedToolCalls: {
id, name, args, durationMs, success, error?,
resultRef?: string, // pointer into state store
truncatedPreview?: string, // ≤4 KiB head for the UI's at-a-glance view
}[] // refs only, no raw `result`
iteration, discoveredToolIds
Dapr State Store (scope = executionId, NOT in activity payload):
tool-result/{executionId}/{nodeId}/{toolCallId} -> { result }
TTL: 24h (default)
The dispatcher (dispatchOne in apps/agentic/lib/workflows/agent/tool-loop-internals.ts:213) splits the tool result into three projections at the moment it's produced:
- LLM content — 32 KiB-truncated JSON for the conversation transcript.
- Truncated preview — ≤4 KiB head for the UI's at-a-glance row.
- Raw result — written to the state store under a deterministic ref.
The key construction lives in one place to keep the writer and reader honest (apps/agentic/lib/workflows/agent/tool-result-store.ts:71):
export function buildResultRef(
executionId: string, nodeId: string, toolCallId: string
): string {
return `tool-result/${executionId}/${nodeId}/${toolCallId}`;
}
And the write itself is fire-and-forget-safe (tool-result-store.ts:156):
export async function putToolResult(ref, result, options) {
try {
const client = getDaprClient();
await client.state.save(WORKFLOW_STATE_STORE_NAME, [{
key: ref,
value: result,
metadata: { ttlInSeconds: String(ttlSeconds) },
}]);
} catch (error) {
logger.warn("Failed to persist tool result to state store", { ref, error });
// intentionally not re-thrown
}
}
The swallowed error is deliberate. A state-store hiccup must not fail the agent turn — the LLM transcript already carries the truncated content the model needs to keep going. The UI's full-result lookup is the only feature that degrades on failure. Choose your failure surface; don't leak infrastructure failure into model behavior.
The UI resolves a ref through a server route (apps/agentic/app/routes/api/executions/$executionId/tool-results/$nodeId/$toolCallId.ts), which authorizes the read against the owning workflow (resolved from executionId via the execution log — never trusting a client-supplied workflowId).
The Trade-offs We Made (and Why)
Every architectural choice closes some doors. The ones that matter here:
TTL-based cleanup over explicit deletion. The Dapr workflow body is deterministic and can't do direct I/O, so cleanup would need a separate activity. Dapr state stores have no native prefix-delete, so doing it right would require maintaining a per-execution index entry on every put — extra state churn on the hot path. Also, the UI affordance is the entire reason results live in the store. Deleting on workflow completion would break debugging exactly when it's most useful. The chosen state store (PostgreSQL v2 via Dapr) supports ttlInSeconds, so we set 24h on every write. A deleteToolResult export is still there for deployments that need faster eviction.
Preview alongside ref, not ref alone. It would be cheaper to carry only the ref and force the UI to fetch on every render. But the trace-span UI shows a list of tool calls at-a-glance with a preview line per row. Forcing a fetch per row turns a single list view into N+1 round trips. A 4 KiB preview kept in the workflow payload is cheap (20 tools × 5 turns × 4 KiB ≈ 400 KiB, still well under 2 MiB) and saves the rendering trip.
Lean record over backwards-compatible record. We dropped result, startTime, endTime from the persisted shape and renamed duration → durationMs. The telemetry layer (mergeToolCallTelemetry) re-surfaces the preview as result so legacy trace-span renderers keep working without a synchronized UI change. Schema evolution in a hot-path serialized type is painful; treating the persistence record and the render contract as separate types — with one mapping layer between them — let us evolve the former without churning the latter.
What Else Makes Agentic Production Hard
Payload caps are one specific instance of a broader pattern: agent systems have invariants that simple chat apps don't. Once you start running them at production scale, you trip on:
Cost variance you can't predict. A single user prompt can fan out into 30 tool calls and three model retries. You can budget per-call but not per-session. Cost dashboards have to bucket by execution, not request.
Observability that can't rely on logs alone. A failed agent run might emit 200 spans across a workflow, a model call, a tool dispatch, a state-store write, and a telemetry merge. If your trace tooling doesn't speak workflow-step-aware spans, you'll be reading raw logs forever to figure out which turn went wrong.
Drift in tool contracts. Vendor tool schemas change (GitHub fields, OpenAI function calling shapes). Your prompts subtly start mis-formatting calls. Failures are silent — the model gets an error, retries, and burns budget. You need contract tests at the integration boundary, not just unit tests on your wrapper.
Eval that survives non-determinism. LLM outputs change between model versions. So do tool-call decisions. Eval suites that score golden outputs end up either too strict (every model bump regresses) or too loose (real regressions slip through). The frameworks that survive are the ones that test behavioral properties — "did the agent eventually paginate when truncation was hinted?" — not exact transcripts.
Authorization at the right layer. A tool that a user can invoke through chat is — operationally — a server function. If your model can call delete_workspace, your auth layer is the model. That's not where you want it. The remedy is to constrain the tool surface per session (deny-by-default), evaluate authorization at dispatch time, and never trust the LLM's argument bag — inject sensitive params (credentials, user context) after the merge so the model can't override them.
Backpressure that respects model latency. If your agent is rate-limited by the model, doubling worker concurrency doesn't help. You need backpressure that knows about token budgets, not just request budgets. Most workflow engines don't ship this; you build it.
Replayability without re-firing side effects. Durable execution promises "exactly-once activity execution," but only if your activities are idempotent and deterministic. LLM calls are neither. The standard remedy — record the response on first run, replay it on resume — has its own failure mode if your activity payload, including that recorded response, exceeds the platform's payload cap. Which is where we came in.
The Underlying Lesson
The issue we hit was a 2 MiB payload cap colliding with chatty GitHub tools (which is what we want actually!). So in our case, the fix was a Claim Check externalization to a Dapr state store.
The deeper lesson is harder to summarize and worth more:
Agentic systems are distributed systems that happen to have an LLM in the loop. The non-LLM parts will hurt you first.
Every assumption you've internalized about payload sizes, retry semantics, idempotency, state hydration, and observability gets re-litigated the moment you wrap a model in a durable runtime and call it an "agent." The model is usually the most predictable part of the stack — it costs what it costs, it returns what it returns, mostly within bounds. The orchestration around it is where the production failures actually live.
If you're building agentic AI in production, the work that matters isn't picking the right model. It's:
- Treating every piece of state as a question about who reads it and where the bytes land.
- Separating the model's view of the world from the operator's view, with explicit projections between them.
- Borrowing patterns from the durable-execution community (Temporal, Inngest, Restate) rather than re-deriving them from scratch with a model in the loop.
- Failing loudly at the infrastructure layer and never letting infrastructure failure look like model failure.
The fun is in the model. The scaling is in everything else.
References
- Scrydon issue #1072
- Temporal — Payload Codec / Claim Check pattern
- Hohpe & Woolf — Enterprise Integration Patterns, "Claim Check"
- AutoGen —
autogen_core/model_context/_buffered_chat_completion_context.py - Mastra —
packages/core/src/processors/processors/token-limiter.ts - LangGraph —
libs/prebuilt/langgraph/prebuilt/chat_agent_executor.py - Smolagents —
truncate_content - OpenAI Agents SDK —
OpenAIResponsesCompactionSession






























Data Engineering CentralDaniel Beach






















