Reliability Is the New Usability
Reliability, not raw capability, is the UX challenge that will determine whether AI agents earn trust at work.

Ask any AI tool to do something simple, and it performs well. Ask it to do something complex - multiple steps, real-world messiness, ambiguous inputs - and the story gets more complicated than the benchmarks suggest.
There's a gap between what AI can do and what AI does reliably on real tasks, with real users, in real conditions. Arvind Narayanan and Sayash Kapoor have been documenting this carefully at AI Normal Technology. Their CRUX framework draws the distinction cleanly: capability is what an agent does under ideal conditions. Reliability is whether it does that consistently, across the full range of real tasks, in the open-world complexity of actual use.
Benchmarks measure capability, users experience reliability. Those numbers are not the same, and the gap between them is not small. As tasks get longer, messier, and more interdependent - the kind of tasks that would actually save a knowledge worker significant time - AI agent performance drops off in ways that benchmark scores don't predict. An agent that performs at 90% on a benchmark might perform at 50% on a real production task. That difference is where the trust problem lives.
This isn't a reason to be pessimistic about AI. It's a precise description of where the design work is.
UX has solved exactly this class of problem before, at smaller scale. Every error state you've ever designed acknowledges that systems fail. Every recovery flow makes failure survivable. Progress indicators, graceful degradation, confirmation dialogs, undo functions - all design responses to the fundamental reality that systems don't always do what users expect.
AI agents operating in agentic workflows - doing multi-step tasks, acting with autonomy, making decisions on behalf of users - introduce a new version of that same challenge. The question isn't just "what does the interface look like when the AI is working?" It's also "what does the user see when the AI fails midway through a ten-step process?", "how does the user recover when the AI made a wrong assumption three steps back?", and "how does the user know whether to trust this particular output?" None of those are engineering questions.
The trust calibration problem is the most interesting one. Users need to understand, without becoming technical experts, how much to rely on any given AI output. Too much trust creates risk - the user accepts a wrong output without catching it, and the error propagates. Too little trust creates friction - the user double-checks everything, and the productivity gain evaporates. The right calibration sits somewhere between those failure modes, and it's different for different tasks, users, and contexts.
Designing for that calibration is genuinely difficult work. It requires understanding the failure modes of the system you're building with. It requires feedback mechanisms that give users useful signals about AI confidence - not just when the system is working, but when it's uncertain. It requires recovery flows that are forgiving enough to use under stress, and enough visibility that users can maintain appropriate oversight without needing to understand the underlying model.
This is not a new design problem - it's a design problem at new scale, in a new context, with higher stakes than most previous interfaces carried. Agentic AI that makes wrong calls autonomously has the potential for consequential failures that a UI bug typically doesn't. The design discipline required to make those systems trustworthy is correspondingly more serious.
The capability-reliability gap is not fixed. It responds to design. Well-designed agentic interfaces - with clear feedback, visible failure states, and good recovery flows - perform better in practice than poorly designed ones, not because the model is different, but because the interaction design gives users the context they need to supervise effectively and catch errors before they compound.
That's the opportunity: design the system so that when AI fails - and it will - users can see it, understand it, and address it. The system becomes trustworthy not because it's perfect, but because its imperfections are legible and recoverable.
People who use well-designed agentic systems get more done with less stress. They delegate the right tasks, catch the right errors, and build appropriate trust over time - instead of bouncing between over-reliance and abandonment. They experience AI as capable and dependable rather than capable and unpredictable.
That experience is designed, it doesn't happen by default. It happens when the right question is asked: not "can the AI do this?" but "what does the user need to know when it doesn't?" The gap between AI capability and AI reliability is the design space of the next several years, and it's where good design thinking has the most to contribute.