Skip to content
Niall Cook
← Notes

Domain-specific AI products need assurance, not just good prompts

There's nothing wrong with delegating reasoning-like tasks to generative models, but once you embed those models into a product workflow — deciding how to prompt them, which models and parameters to use, how to post-process outputs, and how to chain multiple AI-assisted steps together — you cannot simply rely on the provider's own quality benchmarks.

The problem for many companies building AI products is that proper assurance can be bloody hard work; sometimes harder than developing the product itself.

It's not sufficient to just ask "does the output look good?" In regulated industries in particular, you need to be able to show, repeatedly, that each AI-supported task is doing what it is meant to do — and not doing anything that it shouldn't.

So while you may be testing prompts, reviewing outputs and collecting user feedback, that is very different from having a repeatable assurance layer underneath the product.

What assurance looks like in practice

Over the past few months I've been developing this kind of framework for a client building LegalTech applications on top of foundation models. For one workflow alone, that has meant breaking the process into over 15 individual, testable tasks, building an evaluation pipeline that uses the same calls as the production code, drafting and iterating golden test sets covering each task, and repeatedly reviewing, testing, and refining the results.

In the process, we've had to answer and document a whole bunch of important questions:

  • What tasks in the workflow is AI assisting?
  • How do foundation model outputs flow through the application?
  • What do good outputs look like for each task?
  • What are the predictable failure modes?
  • What are the best examples to test with?
  • What metrics actually matter?
  • What changes need to trigger the assurance pipeline?

Where AI workflows actually fail

In product workflows, foundation models are rarely just doing one generic thing. They can be retrieving, classifying, summarising, ranking, extracting, drafting, checking sources, and making judgements that later outputs depend on.

Each of those steps can fail in different ways. A final output that "looks good" could actually hide a weak intermediate step. A plausible summary can distort a source found earlier in the workflow. A relevance filter can drop an important piece of earlier evidence. A prompt change can improve one task while damaging others.

My sense is that many organisations aren't avoiding assurance because they don’t care. More likely, they're not sure why they need it, and even then they may not have a practical framework for doing it. That's perhaps why AI assurance is still treated as something abstract, legalistic or overkill. But at product level, it can be very concrete: task inventories, golden test sets, metrics and thresholds, evaluation runs, reviewer notes, and system cards.

The point is not perfection

The point is not to prove that domain-specific AI applications are perfect. It is to give users appropriate reassurance, and to enable product teams to test and improve their systems, explain them to clients, and know when a change has made them worse.

And as the pace at which foundation models change accelerates, for professional AI applications that is going to matter more and more.