Latency Is a Feature Decision, Not an Infrastructure Problem

Most teams treat latency as an optimization problem. Ship the feature, measure the slowness, throw infrastructure at it. Maybe add a spinner. Call it done.

That framing is wrong, and it's especially wrong for AI products. Latency shapes the entire user experience model. It determines what architecture is even possible. And by the time you're in production watching time-to-first-byte metrics, the decisions that actually matter have already been made.

The Two Failure Modes

There are two ways teams get this wrong, and they're roughly opposite mistakes.

The first is ignoring latency entirely during design. You build a clean server-side API, wire it up to an LLM, and discover in beta that your users are waiting 8-12 seconds for a response. Now you're scrambling. You add streaming after the fact. You prototype a skeleton screen. You argue about whether a spinner is good enough. Everything is expensive because the architecture wasn't designed for this.

The second failure mode is over-engineering upfront. Teams I've been on have spent months on low-latency infrastructure before validating that users even wanted the feature. You end up with a beautifully optimized system serving a product hypothesis that turned out to be wrong.

The right question isn't "how fast can we make this?" It's "how much latency can this user experience tolerate, and how does the answer shape what we build?"

Jakob Nielsen's 40-year-old research still holds: under 0.1 seconds feels instantaneous, under 1 second keeps the user's flow intact, and past 10 seconds you've lost them. For AI products, almost nothing lives in the 0.1s bucket. Most live somewhere between 1 and 30 seconds. That gap is where all the interesting design decisions happen.

Perceived Latency Is a Product Decision

Here is something that took me a while to internalize: the user's experience of waiting has almost nothing to do with the actual latency number.

Loading bars are as old as computers. They've always served a psychological function more than an informational one. A system that shows a loading bar feels smarter and more deliberate than one that freezes and then returns an answer. Users forgive latency they can see happening.

For AI specifically, this gets more interesting. Chain-of-thought reasoning gives you something to show the user. When the model is working through a multi-step problem, streaming those intermediate steps doesn't just fill time. It earns trust. The user watches the system consider the problem, reject an approach, try another angle. By the time the final answer arrives, they believe it because they watched it get derived.

I've seen teams skip streaming at launch because it wasn't in the original tech stack. They were using HTTP/REST, not gRPC, so streaming wasn't free. It was a retrofit project. That's a real cost to absorb mid-product cycle. Starting with "what does the user experience while waiting" as a first-order requirement changes the stack choices you make on day one.

In the five months since ChatGPT launched, users have already started developing expectations around streamed token output. If your AI product buffers the full response and then displays it all at once, users don't just find it slower. It feels wrong -- like the system is hiding something. For AI features where several seconds of inference time is a best case, the only lever you have is perception management.

The Pre-Compute Pattern and Its Hidden Cost

Some AI features can't be made fast in real time. The model inference (the time to generate a response) is too expensive, the amount of context you need to feed the model is too large, or the feature requires coordination across multiple systems. When that's true, the latency conversation shifts entirely.

One pattern I've used: pre-compute the AI output asynchronously and persist it in a draft state. Surface it to the user by flipping a flag. From the user's perspective, the experience is instant. There's no wait. The draft is just there when they need it.

This can work well for things like suggested replies, document summaries, or weekly digest content where freshness is acceptable. The latency problem goes away at the UX layer.

But the cost is real and easy to underestimate. You're running inference on content that may never be viewed. Compute runs whether the user opens the app or not. You're also pre-computing against a state of the world that might be stale by the time the user sees it. And your infrastructure requirements shift from latency-sensitive (spike when users are active) to throughput-sensitive (steady burn around the clock).

The question every team needs to answer explicitly: how instantaneous does this experience have to be to keep user attention, and does pre-compute get us there? If yes, model the actual compute cost of pre-computing at scale before committing to the pattern.

What Happens When You Move Work to the Client

A different tradeoff worth naming: client-side work can dramatically reduce server-round-trip latency, but you're trading one problem for several others.

I've seen teams build hybrid search infrastructure where the ranking and scoring happened client-side, backed by large local caches. The result was an auto-suggest experience that felt genuinely fast -- sub-100ms -- without the server being in the hot path.

The tradeoff is that you now own the cache invalidation, the ranking logic, the scoring weights, and all of the platform-specific edge cases that come with running that code on every client device. Server-side, you'd update a model and redeploy. Client-side, you're shipping an SDK update and waiting for adoption. The relevance of your auto-suggest can be stuck at whatever ranking weights you shipped three months ago for users who haven't updated.

This is an underappreciated tradeoff in the latency conversation. Moving work to the client is a legitimate architecture choice. But it shifts complexity from infrastructure to product and release discipline. Make sure you're signing up for what that actually entails.

Make the Decision at Design Time

The architecture decisions that determine your latency profile are made early. Streaming vs. buffered. Client-side vs. server-side. Real-time vs. pre-computed. These aren't infrastructure choices you can defer. They shape your API contracts, your data flows, your release process, and your user experience model.

The failure mode in early AI product work is treating every hard problem as something to figure out later. Latency is not a later problem. If a 12-second wait will break the experience, you need to know that before you've built the feature around a 12-second inference call.

The teams I've seen get this right ask a few specific questions during design, not in production retrospectives:

  • What is the maximum acceptable latency for this interaction, and what happens to the user experience if we exceed it?
  • Can we stream partial output, and if not, what do we show while the user waits?
  • Is the answer to this question the same on mobile as on desktop?
  • Are there features that need to be instant enough that pre-compute is the only viable path, and what does that cost at scale?

These aren't engineering questions. They're product questions with engineering consequences. The right people to answer them are in the same room, and the right time to answer them is before the first line of code.

The teams that don't ask these questions early end up asking a different one in production: why does this feel slow? By then, the answers are expensive. The architecture is set, the contracts are written, and the only options left are spinners and apologies.