LLM Output Streaming: The Edge Cases That Bite
- Streaming and structured output (JSON) need an incremental parser, not a string accumulator.
- Mid-stream failures cannot be retried transparently. Design for partial-result handling.
- Token-level streaming is for UI; structured server-side processing should usually buffer to the end.
- Backpressure on the consumer side is real; producers can outrun consumers.
A team rolled out streaming on their AI feature and saw the support inbox fill with reports of “the assistant gave me a half-answer”. Investigation showed a mix of three failure modes: model timeouts at the provider, network resets at the load balancer, and JSON parse errors when the team tried to render structured output that had only partially arrived.
The product had been built assuming the LLM call returned a complete response. Streaming changed that assumption silently; the front-end and back-end both still treated outputs as complete strings, just delivered late. The fix took two weeks of refactoring across the streaming pipeline, with three new patterns added to handle the edge cases the original design did not consider.
This piece is about those patterns. Streaming is the default for any UI where a human is watching the output, and the engineering work to do it right is significantly more than “set stream=True”.
Why streaming is worth the complexity
For chat-style UIs, streaming improves perceived latency dramatically. First tokens appear in 200 to 500 ms; the full response takes 5 to 30 seconds. With streaming, the user sees text scrolling almost immediately. Without it, they stare at a loading indicator for the full duration.
Studies on chat UX consistently show users tolerate longer total responses if they begin to render quickly. The subjective experience of “the assistant is thinking” beats “the assistant is silent” by a wide margin.
For non-chat use cases (machine-to-machine pipelines, batch processing, structured-output APIs), streaming usually adds complexity without payback. The downstream consumer waits for the complete output anyway; the streaming buffer just adds engineering surface for failures.
Default rule: stream when a human is reading; buffer when a machine is consuming.
Edge case 1: partial JSON
If your prompt asks for JSON output and you stream the response token by token, the JSON arrives in pieces. Standard JSON.parse cannot handle “{name: “Al” (the model has not finished yet). Trying to parse on every token spams parse errors until the response is complete.
Three approaches:
Buffer to completion before parsing. Stream the tokens to the UI as plain text (so the UI feels responsive), but only attempt to parse the structured output when the stream ends. The UI shows the JSON as a text blob; the structured handling waits.
let buffer = "";
for await (const chunk of stream) {
buffer += chunk.text;
ui.appendStreamingText(chunk.text);
}
// Parse only when stream ends
const parsed = JSON.parse(buffer);
Incremental JSON parser. Use a parser that can handle partial input and emit progressively-completed objects. Several libraries exist (partial-json-parser in npm; jsonstream-equivalents in Python). The parser emits “I see a name field with value ‘Al’ so far”; you can render it as it grows.
Constrained streaming with structured-output APIs. Anthropic and OpenAI both offer “tool use” or “structured output” modes where the model is constrained to emit valid JSON-shaped tokens. Combined with an incremental parser, you can render fields as they complete.
For most chat-style UIs, the first option (buffer-then-parse) is enough. For richer interactions where the user sees structured fields update in real time, the second is worth the engineering.
Edge case 2: mid-stream failures
The stream starts cleanly, the model produces 600 tokens, then the connection drops. What do you do?
Three failure shapes and their responses:
| Shape | Response |
|---|---|
| Network disconnect (recoverable) | Retry from the beginning (LLMs are not resumable mid-stream) |
| Provider error (rate limit, server error) | Retry with backoff after a short wait |
| Hard timeout from the model itself | Capture partial output, log, decide per use case |
| User aborts (closed tab, hit cancel) | Drop the stream cleanly, do not consume more tokens |
Critical: you cannot resume a stream from where it stopped. The provider does not support “continue from token 600”. A retry is a fresh request that may produce a different output.
For chat UIs, the right pattern is usually:
- On disconnect, attempt one transparent retry.
- If that fails, surface “the response was interrupted, retry?” to the user.
- Never silently merge a partial old response with a fresh full response.
The third point matters. We have seen teams append a retry’s output to the partial original; the user sees “I think you should consid… oh wait no, the right answer is X” because the retry started from scratch.
Edge case 3: backpressure
The model produces tokens at one rate. The consumer (a UI rendering character-by-character, or a downstream service processing tokens) consumes them at another rate. If the producer is faster than the consumer, tokens queue. Eventually the buffer fills.
For chat UIs this rarely matters; the user reads slower than the model produces. For pipelines that do work per token (token-level analytics, real-time translation), backpressure is a real concern.
Two patterns:
Drop intermediate states. If the consumer cannot keep up with token-level updates, drop intermediate tokens and only forward the latest accumulated state every N milliseconds. UI gets smoother updates at the cost of some animation granularity.
Bounded buffer with explicit slow-mode. When the buffer fills, the consumer signals upstream to slow down (or the producer pauses). This is full reactive-streams discipline; only worth it for high-throughput pipelines.
For most chat UIs, neither matters. For server-side streaming pipelines, design for backpressure on day one or accept that high-traffic windows will silently drop data.
Edge case 4: token leakage on cancel
User starts a long generation, then closes the tab. If your back-end keeps streaming until completion, you are paying for tokens nobody is reading.
The fix:
# Server-side: detect client disconnect, abort the model call
async def stream_endpoint(request, llm_client):
async for token in llm_client.stream(prompt):
if await request.is_disconnected():
llm_client.cancel()
return
yield token
The provider’s API needs to support cancellation (Anthropic and OpenAI both do via their SDK’s Aborted controllers). Without cancellation, the user closes the tab and you bill for another 3 to 8 seconds of generation that nobody sees.
At low volume this is invisible. At scale, abandoned streams add up. We have seen teams discover 10 to 20 percent of their LLM bill came from abandoned streams.
Edge case 5: provider event ordering
Streaming protocols differ across providers. Anthropic uses message_start, content_block_delta, message_stop events. OpenAI uses chunk objects with delta fields. Each has subtle ordering rules: content_block_start before content_block_delta, ping events that mean nothing, error events that may come after partial deltas.
Provider SDKs usually abstract this. If you implement raw SSE consumption yourself, read the protocol docs carefully and handle every event type the provider sends, even the ones that look unimportant. We have seen production bugs from teams ignoring “ping” events that turned out to carry useful state in some provider versions.
What we install on engagements
For an LLM-streaming feature in production:
- Stream architecture review: where does streaming start, where does it end, what consumes it.
- Partial output handling: incremental parser if needed, buffer-then-parse if that suffices.
- Cancellation propagation: end-to-end abort signals from client to provider.
- Mid-stream failure UX: explicit handling, no silent retries that mix partial and fresh outputs.
- Observability: per-stream completion rate, abandoned-stream rate, error categorisation.
Total: typically one to two engineer-weeks for a first implementation, less for retrofits.
The teams that get streaming right have a chat UX that feels fast and reliable. The teams that copy the streaming snippet from the SDK docs and ship it directly have a feature that works in demos and breaks in subtle ways under real traffic. The work to make it robust is real and pays back in not-having-to-debug-it-during-launch-week.
Questions teams ask
Is streaming worth the complexity?
For chat-style UIs, yes; perceived latency drops sharply when first tokens appear quickly. For machine-to-machine pipelines, usually not; buffer to completion and parse once. Pick based on whether a human is reading the output as it arrives.
Can I stream structured JSON output safely?
Yes with an incremental JSON parser (a state machine that can handle partial objects). Standard JSON.parse cannot; it requires a complete string. Several libraries provide partial JSON parsing for streaming use cases.
What if the model stops mid-stream?
It happens. Network disconnects, provider timeouts, model cap hits. Capture what was received, log the truncation, and either retry from scratch (most cases) or display the partial output with a clear 'incomplete' indicator. Silent retry is rarely correct.