Why your test concept suddenly grows massively when you start testing AI…
… and how to keep control. 2/5
In the last article, 1/5 Semantic testing of non-deterministic AI systems, we covered the basics: non-determinism, semantics, and why “expected result = <single value>” is about as useful for AI assistants as a ruler is for measuring the wind. Today we go a step further. What does this mean for quality assurance as a discipline? Not just for individual tests, but for strategy, test stages, evidence and efficiency.
When you test an AI assistant that understands speech, makes decisions and triggers connected systems, you aren’t testing one feature. You are testing a chain of capabilities. And you are testing it under conditions that are rarely perfect: noise, context, changing data states, variable latencies, model updates, multi-intents, and rules and (behaviour) policies that differ across countries. The result: a near-endless space of functional and non-functional conditions.
Your QA needs to act much more systematically and pragmatically. Systematically, because without structure you drown in the possibility space. Pragmatically, because you would never finish if you exhaust every exploratory test variation.
That is precisely where it isn’t only your test case catalogue that grows. Your test concept itself bloats. Suddenly you need additional test approaches with custom definitions for acceptance corridors, new evaluation logic and extended stability criteria, plus new rules for what counts as “demonstrated” in the first place.

The challenge of the unknown
To make matters tougher, the heart of this composite, the integrated language model, regardless of vendor, has to be treated as a foreign unit, because its behaviour can’t be fully specified through classic specs and reproducibly predicted. For testers this matters because it adds a second dimension to an already exploding combinatorics: unknown behaviour inside a system that still has to work reliably.

Many integrated language models are taken as a finished baseline solution, or only adjusted in spots (updates, fine-tuning, safety layer), without the project team being able to control or influence every internal decision rule. The clean consequence at the core: you treat the model as a black box and demonstrate quality through observable behaviour and effects in the connected system.
And as if that weren’t sporting enough, this component carries some very concrete side effects for quality assurance.
First, defect patterns become harder to grasp. When an AI-driven reaction is “off”, it isn’t immediately clear whether the cause sits in intent recognition, in context management, in tool selection or in the linguistic output, and even with an identical test setup the next run might suddenly look fine again.
Second, the character of requirements shifts. With deterministic components you can specify very precisely what has to happen.

In an LLM-involved test, by contrast, you have to accept that parts of the spec can only be expressed as guardrails. Which information is mandatory, which phrasings are admissible, how must uncertainty be communicated, when does the system need to ask back, when does it need to refuse?
These softened requirements are harder to align, to operationalise and to evaluate consistently, especially when different stakeholders hold different expectations about “good” answers.
Third, an additional dynamic emerges through context and data dependency. An assistant reacts not only to the current sentence but to conversation history, user state, available data and system responses from other components.
Quality in the connected system: why end-to-end tests are decisive
In classic testing you can catch many quality issues “down” in the V: unit tests, integration tests, then system tests. Clear interface traffic gives you an early sense of the maturity your system should be at. For all involved subsystems with a classic development base, this still applies.
As soon as AI takes over the steering of your composite, however, the decisive issues often only show up at the end-to-end view. Why? Because the defects don’t sit in one component but in the variability with which an orchestrator “moderates” the overall system, since there are no absolutely unambiguous inputs or reactions to them.

Picture an AI assistant in a car. The driver says: “I’m cold.” How does the language model interpret that? Maybe with two degrees warmer? Or it closes the window and turns on the seat heating. Or all of it together?
Each individual reaction may be plausible on its own, but the overall result might still be wrong for the user. And the responsibility for the desired effect no longer sits with the user to phrase the command correctly, it sits with the AI to interpret it correctly and respond sensibly when ambiguity arises. So instead of a random behaviour as the “right” option, we expect our assistant to determine the actual intent in dialogue with the driver when things are unclear.
End-to-end test strategy for AI assistants
A clean E2E approach for assistants therefore thinks in terms of user goals and system reactions. For a use case you don’t only define “what should happen” but also “how do I recognise that it actually happened?”.

Methodically, that means you build scenarios along a chain. Input (speech/text), interpretation (intent/parameters), decision (am I allowed to?), execution (connected-system trigger), feedback (response to user).
That sequence is also the debug logic. When something goes wrong, you want to know whether the intent was misread, whether parameters were derived incorrectly, whether the connected system reacted differently than expected, or whether “only” the response is off.
For that you need clear responsibility boundaries. Which decisions may the LLM make, which may it only prepare, and which have to be decided deterministically outside?
Equally important: what may the component not do? Those are your guardrails, which we will go into more deeply next week. You need clear no-go rules.
The result of all this? An infinite number of combinations.
Structured and repeatable testing despite infinite combinations
The most important point: E2E tests for AI assistants are no longer standardised “system tests as before”. They are the place where you can only make non-determinism manageable if you set the binding guardrails mentioned at the start, the acceptance corridors and measurement points that matter to you. You should focus less on the mass of all conceivable use cases and more on identifying the truly important and relevant scenarios with a possibility space around them.
Then you define the expected “statistical” stability. How often does my system have to be “right”? This is where the success rates from the previous article come in. To gain trust in the behaviour, it is sensible to test particularly relevant scenarios repeatedly and define a minimum success quota, so you can demonstrate whether you are observing robust behaviour or just luck.
You don’t test every possibility, you cover a possibility space within this defined action radius.
Finally you secure the expected transparency to the outside. Even though you can’t crack open your black box, you want diagnosability. You need logs, traces, request/response correlations, context. In tests, logged reasoning can show how the behaviour came about. A crucial precondition: in tests, full access to the entire flow from input to result has to be ensured, and recorded with every test run.
You test representatively, risk-based and systematically.
The first lever is clustering. You group inputs into classes. Many phrasings belong to the same intent. Many intents belong to the same domain. Many domains belong to the same risk category. The entry condition is a serious risk analysis. If it doesn’t exist, start it the moment you have identified your cluster categories. The result is a hierarchy that lets you reach reliable coverage. You don’t test “every sentence”, you test “every important class”. As the saying goes: I don’t have to look at every leaf to know whether the tree is healthy.


The second lever is controlled variation. You don’t test under the most “stable” conditions you can find, you vary deliberately along central influencing factors (language, context, tool results) and check acceptance corridors. An acceptance corridor doesn’t describe the exact wording, it describes the frame in which meaning, action and tonality count as “correct enough” while observing defined no-go boundaries. You prove the system’s robustness through variation.
The third lever is designing measurement points well in your expected result. If you use the connected system as a truth point, you can verify outcomes reproducibly even when wording varies. The triggered reaction of your target function is decisive. That also lets you reduce the share of “semantic evaluation” again and increase the share of objective checks against system state.


The fourth lever is regression as the entry ritual into the test cycle. Non-deterministic systems need a different regression philosophy. You don’t only want to know occasionally “is it still green today”, you want to know constantly “where does it stay stable in trend”. Repeatable, regularly run core sets are important for that, in a way that lets you spot behavioural changes quickly so you can identify focus areas for the larger test sets.
When you combine all of this, something emerges that is essential for AI testing: manageability through structure. You accept the infinity in theory, but you build a practical map you can navigate.
Wrap-up and outlook
Today you saw why AI doesn’t “abolish” QA experts but makes their job more demanding. And the E2E test gains a lot more importance. You only test AI systems through clearly defined expectations, stability criteria and a real test strategy. And you don’t beat the infinite combinatorics with mass, you beat it with structure.
Great that you stuck with it this far. What experiences have you had testing AI systems? And if while reading you thought “okay, this is exactly our problem, we have AI in the product but our test approach feels like yesterday’s news”, that’s a pretty common state in 2026. That is exactly where my consulting comes in. QCT helps teams turn an AI prototype into a manageable product. Get in touch and we’ll take a closer look at your scenario.
Next week things get serious and “educational”. Guardrails, ethics, bias, abuse and toxic behaviour. We look at how to approach those topics, including provocation and adversarial scenarios designed to draw your system out of its comfort zone or really push it onto the mat.
Missed an article?
You can find them all in CT View for a re-read.
