Skip to content

UX ResearchUX/UI, Evals

The missing seat at the frontier team table 

How user research and data science can together shape LLM evaluations

  –   The estimated reading time is 8 min.

A green rectangular table with papers, pencils, and books, surrounded by colorful office chairs—red, orange, green, purple, and blue—on a bright blue floor. One chair at the front is shown as a white silhouette.

Frontier teams and AI products

Frontier teams are designed to collapse distance: between idea and production, between insight and implementation, between model and product. Designers, engineers, and data scientists work side by side with direct access to live models, shipping AI-native features at startup speed inside large organizations.

This makes sense in theory. But as AI products move from demos to real workflows, things can easily break down. Demos, for example, tend to root in scenarios far tidier than real work, missing nuances, and failing to accurately reflect customer realities. With AI products, we judge the quality of an experience through model evaluations (aka “evals”), which are structured ways of testing how well an AI system performs in real or representative scenarios. You can think of them as reality checks for AI.

Experiences can pass an eval with flying colors through the lens of data science (did it hallucinate?), engineering (did it run properly?), or governance (is it compliant?), but completely fail to meet user needs if the response is not also evaluated on the basis of it being useful, trustworthy, and aligned with user expectations.

You need the totality of these lenses to create great experiences, which is why UXR is shifting from not only helping AI product teams “build the right thing” by identifying where AI can meaningfully support people, but to ourselves helping “build the thing right” by shaping and evaluating AI model outputs. In the AI era, the model output forms a significant part of the user experience, and few disciplines understand what users want in their output more than user research.

Two lenses, one problem space

In the rush to use AI to build AI, user research can sometimes be treated as adjacent rather than foundational. Research isn’t intentionally excluded; it’s simply not assumed to be an essential, load-bearing element of the model evaluation process. Frontier team charters commonly list engineers, data scientists, and designers, while research is assumed to be covered by design or existing analytics.

Data science and engineering share a powerful epistemology. They operate through metrics, scale, statistical confidence, and reproducibility. This is a strength, and it’s what enables frontier teams to move fast with rigor.

User research brings a complementary lens. It’s optimized not just for signal, but for depth of signal: understanding intent, context, edge cases, and the conditions under which systems fail. Where data science asks, “Does this work at scale?”, research asks, “How does this impact individual people?” and “What did we fail to imagine?”

The issue isn’t that one lens is better than the other. It’s that only one is routinely treated as essential when defining evaluation frameworks. When evaluation is driven primarily by available data, teams risk optimizing what is measurable rather than what actually matters.

It’s worth noting that AI-assisted research tools — synthetic users, automated sentiment analysis, LLM-generated personas — are increasingly valuable here, especially for scale. They help explore large pattern spaces quickly and surface signals that would otherwise be missed. But they share a fundamental limitation with quantitative data: they operate on existing patterns. They don’t notice hesitation. They don’t question workarounds users no longer recognize as broken. They don’t tell you when the abstraction itself is wrong. So, while these tools extend research capability, they don’t replace the need for human judgment, particularly when defining what “good” should look like in outputs.

When qualitative insight changes the eval itself

Imagine an AI agent that highlights potential issues — and the impact of those issues — on sales orders. Early evaluation scenarios were built around a simplifying assumption: each issue affects a single customer order. This created a clean, tractable abstraction from a data perspective.

A modern office chair with a round seat and curved backrest, upholstered in purple fabric. The chair has a five-wheel purple base and adjustable height lever.

Qualitative research told a different story. In real enterprise workflows, sales orders are rarely handled in isolation.

Users routinely work across multiple linked orders; approvals cascade, and actions ripple through interconnected systems. The “single‑order” scenario wasn’t just incomplete – it represented a situation that rarely occurs in practice. 

As a result, the original evaluation setup shielded the agent from the kinds of trade‑offs and downstream impacts that users must routinely navigate. By oversimplifying the workflow context, the eval would have failed to assess the agent’s ability to support real‑world decision‑making. 

This insight from the user research team doesn’t invalidate the work of data science – it enriches it. Together, researchers and data scientists can expand evaluation scenarios to reflect multi-order realities, changing what the model was tested against and, ultimately, how well it performs in production. 

This kind of contribution rarely shows up as a metric. But then, it materially improves model quality. 

A modern yellow office chair with a round seat, open curved backrest, and five wheels on a sturdy base, set against a plain white background.

 

When the metrics mislead

We’ve seen similar dynamics elsewhere. A conversational AI team optimized for session length – a reasonable proxy for engagement. The numbers looked strong. Users were spending more time on the product. 

But user interviews and observations revealed something else: longer sessions correlated not with satisfaction, but with confusion. Users were rephrasing questions, circling the same issues, unsure how to exit. The metric signaling success was actually masking user frustration and thus, failure. 

By pairing behavioral data with qualitative insight, the team reframed success around task completion. Within two sprints, repeat queries dropped by nearly a third. 

This isn’t an edge case. In AI systems, behavioral data often plays a dual role – the evaluation signal and the training signal. Without an external qualitative check, systems risk optimizing for their own feedback loops, reinforcing patterns without understanding experience. 

In this context, user research acts as an epistemic safeguard as well as a mirror to our own assumptions. 

Speed without sequence

A commonly misplaced objection is that research slows frontier teams down. For instance, in two-week sprints, there’s no room for long studies or delayed insights. 

But the research that works at the frontier isn’t sequential. It’s parallel and lightweight by design: 

  • Principled judgment from deep familiarity with the users 
  • Rapid concept tests with a handful of users within an afternoon 
  • Assumption mapping at sprint kickoff to surface the riskiest beliefs about users 
  • Diary studies that run alongside development 
  • Joint synthesis sessions where data scientists and researchers review findings together 

On one team we observed, a single concept test early in a sprint surfaced a fundamental misunderstanding of an approval workflow. Catching it early prevented a postlaunch redesign that would have cost weeks. 

Speed wasn’t lost. Rework was avoided. 

A modern, light blue office chair with a curved backrest, round seat, and five caster wheels, set against a plain white background.

Designing evaluations together

Frontier teams exist to dissolve silos. The promise is that better products emerge when critical perspectives come together early and often. 

As AI systems move from assisting tasks to reshaping workflows, we can no longer treat product decisions as a purely technical exercise. What we choose to optimize, what we count as success, what we include in our training datasets and test scenarios—these are product and ethical decisions, whether we name them or not.

This shift also places a new responsibility on user researchers: to make their contribution undeniable.

UX research is evolving alongside AI systems, and researchers need to show how they can help teams do more than collect data and interpret results – they can help teams define what success and a good user experience should look like. 
 

That starts with moving upstream. Researchers need to be in the room earlier—not just when it’s time to validate output, but when teams are deciding what counts as “good.” That includes decisions about datasets, scenarios, success criteria, and failure modes. The value of qualitative insight often comes not from proving a metric wrong, but from revealing when the abstraction itself is wrong. When a scenario is unrealistic, or when a metric hides real experience. 

Doing this well requires fluency in how models and evals work – not to replace data science, but to collaborate with it. Frontier teams won’t always invite research by default but need to take active steps to involve researchers early and visibly to improve outcomes. The goal isn’t to replace but to synchronize. 

If you are part of a frontier team, try this in your next sprint planning session: 

Ask the team to name the three riskiest assumptions you’re making about your users or their workflows. Write them down. Then ask who is responsible for testing those assumptions – not with data you already have, but with evidence you haven’t yet gathered. 

If the answer requires both a data scientist and a researcher in the room, you’re not missing a seat. 

You’re re-building the table.  

 

A minimalist silhouette of an office chair with wheels, seen from the side, on a plain light background.
English (United States)
Your Privacy Choices Opt-Out Icon Your Privacy Choices
Consumer Health Privacy Sitemap Contact Microsoft Privacy Manage cookies Terms of use Trademarks Safety & eco Recycling About our ads