In my first part of this series I argued why AI needs humans: 1) that we are the adopters, we hook it in, we provide the semantic context, and 2) paid workers make the economy function, money in their pockets is necessary to pay for AI.
We are still very early in both AI adoption and maturity. I believe the future will be highly visual and experiential, not just linguistic.
This morning I woke up to yet another post claiming that user experience no longer matters for software. The argument was familiar: in the age of agents, the only things that matter are context and giving the model access to data. It's very quick to jump to "MCP and an LLM is all you need." The agent does the work and interfaces fade away.
I don't buy it.
To believe that, I would have to forget something basic about humans: we are overwhelmingly visual creatures. The space dedicated in the human brain to visual processing is roughly 6.5x larger than the space dedicated to language processing. The brain can extract the gist of an image in as little as 13 milliseconds, while visual word recognition and semantic language processing typically unfold over hundreds of milliseconds. In Lionel G. Standing's classic 1973 memory study, after studying 200 items, participants freely recalled 51.6 pictures, versus 24.5 visually presented words and 37.5 auditory words — about a 2.1x advantage over written words and 1.4x over spoken words.
“Our brains dedicate 6.5x more space to visual processing than language processing”
So why would AI interfaces stop at text?
You might be thinking, chat agents today have begun to incorporate imagery. Agentic commerce surfaces product cards, ChatGPT and Nano Banana show that image generation through stable diffusion and prompting can be quite powerful. And I agree, however these are still secondary to text-based LLM responses.
One of the richest forms of human-computer interaction is the computer game. The earliest games were text-based. They were powerful because language is flexible and imagination fills in the gaps. But no one expected text adventures to remain the dominant interface once 3D graphics engines, GPUs, and real-time simulation became ubiquitous.
The same thing will happen with AI.
Language is powerful because it is fluid. It lets us express ambiguity, intent, judgment, and context in ways that rigid interfaces like SQL, dashboards, forms, and JSON never could. But visuals can be fluid too. A visual AI interface does not have to mean a static dashboard. It can mean a living model of the business. A simulated store. A product catalog that reorganizes around customer intent. A merchandising workspace that shows demand signals, margin pressure, visual similarity, inventory risk, and local context in one experiential surface.
AI is strongest when it fuses signals across sources and builds context. That context should not be trapped in text. It should be visible. Imagine a world where explanations aren't in a sea of characters, but synthesized through rich visual explainers, all generated on demand.
“The next generation of AI software will not just respond with text, it will let people see systems or even experience them.”
For retail, that matters enormously. Much of retail operations is rooted in spreadsheets — but it shouldn't be. Retail is physical, spatial, visual, seasonal, emotional, and operational. Products have color, texture, shape, placement, adjacency, geographic ties, cultural ties, seasonal ties. Stores layouts are impacted by foot traffic, internal flow, feature walls and display layouts. They are nestled in neighborhoods, impacted by parking, traffic, and construction projects. Retail stores paired with experiences (such as restaurants) see increased demand.
“A text box can describe but not express the full richness of retail”
So I disagree with the premise of replacing user experiences with text-based agents. The opportunity is to make AI-driven user experiences as expressive as the real world in which the full range of our sensory systems live.
The best AI interfaces will combine language, visuals, simulation, and ambient signals. You will still ask questions in natural language, because language is the most flexible input mechanism we have. But the output will increasingly look like an environment you can explore and experience, not a paragraph you parse.
In other words, show me, don't tell me.
