QA in the age of non-determinism, 1/5
As promised, this week’s article in the series “QA in the age of non-determinism” goes deeper into the topic:
“Semantic testing of non-deterministic AI systems“.
Sounds super sexy on paper, doesn’t it. Right. Still, there is a lot to it, and anyone jumping on the AI hype train who wants to integrate a language model into an application or system should know how to stabilise the resulting product.
Testing with AI is far more complex than many think, and the variety of scenarios grows fast. I’m currently developing a test strategy for exactly that on a client project, and the wild ride started in the very first conversation with the customer.
My client expects everything from their AI: functionality with and without online connectivity, reliability, robustness, guard rails, security and global market support. A dreamer? No, an emotional global brand with high standards. None of these aspects is nice-to-have. Every quality element has to be met, no exceptions. Everything is must-have. Security, factual accuracy, voice-triggered actions, permanent availability, none of it can be left to chance. Logical.
Test resources? Endless? Of course not. So we need efficient routes, a reliable test strategy and creative ideas, but first of all a deep understanding of the system. So let’s get into AI system testing from the basics, starting with a few definitions.
What does semantics mean?
Semantics, at its core, is “meaning”. When someone says “I’m cold”, the semantics is the user’s underlying intent: “Please make it warmer.” The same content can be wrapped in many phrasings: “Turn the heating up”, “Could you raise the temperature?”, “There’s a draft”, “I’m freezing”. Different statements, same wish for warmth.

In AI assistants this meaning is the pivot point, because it determines which action gets triggered in the connected systems. The assistant typically does three things in sequence: it interprets what you want, it decides whether it is allowed to, and it executes or refuses, plus a corresponding answer to the user. Semantics doesn’t replace precision, it shifts it.
We become precise about meaning, intent and expected behaviour, and about boundaries.

Semantic testing therefore means we don’t primarily check whether the answer sounds exactly as expected, but whether it is adequate in content and whether the system behaviour is correct. That covers four levels of evaluation logic, which you can picture as a stepped system, where each level only matters once the previous one has been climbed successfully.
Evaluation logic for AI outputs: how to judge the unpredictable
Now comes the part where classic QA instincts tend to stumble. To design a test case, I create a so-called test oracle for each scenario. A test oracle is nothing mystical, just a synonym for the prediction of what a test case is supposed to deliver, against which you decide whether the result is acceptable. Also known as “expected result”.
In deterministic software testing this is often a clearly defined string that should appear pretty much exactly that way for the test case to be allowed to pass. In a non-deterministic context you adjust the oracle’s expectations. With AI systems this oracle is usually formulated multi-dimensionally. It doesn’t evaluate one exact state as right or wrong, it evaluates several criteria in parallel.

The first criterion is intent and goal achievement
Did the assistant understand what was meant, and was the goal achieved? When the user’s intent is “move my appointment by two days”, the goal isn’t a 48-hour shift to the second plus a defined response sentence. It can equally read: “the AI confirmed that the appointment is to be moved to a date two days later, offered to adjust, extend or shorten the appointment due to a time conflict and then moved it to the desired time, or correctly explained why that wasn’t possible”.
The second level is action correctness

When the assistant triggers connected systems, the action has to be correct in business terms. Parameters need to match, the right function has to be called, side effects need to fit. This is the area you can often check very objectively: logs, parameters of API calls, status changes, entries, transactions. AI testing becomes pleasantly “deterministic” again here, because the connected system is your most important measurement point.
The third level is communicative validity
Answers may vary, but they have to stay within a quality frame. That means no contradictions, no fabricated facts, no unclear promises, no “I have completed this” when in truth it failed. Here you define an acceptance corridor. A corridor is a set of guardrails: you allow variation, but only inside clear boundaries.

The fourth criterion is policy and safety conformity

This is what article 3 will cover concretely, under guard rails and ethics. Are the reactions safe, socially acceptable in tone, appropriate? Important: any one of these levels can still fail the test case overall. A test does not pass if the goal was reached but one of these step rules was violated.
Now the trick that makes non-determinism testable. You no longer evaluate just a single run. For critical cases you work with repetitions and success rates plus statistical values, and these need to be agreed by all project stakeholders in a binding way. A non-deterministic system can be correct 10 out of 10 runs, or 9 out of 10 with one miss.

The second scenario feels “almost as good” in a regression run, but at the management demo it can get embarrassing, and in production it is the start of a support avalanche. That is why the team defines a minimum quota for important use cases, above which you can say “this is stable”.
A common objection: “So I have to run everything a hundred times!” No. You don’t have to test everything a hundred times. But for the riskiest and most business-critical cases you have to understand how stable the system really is. The volume and depth depend on your time and resource constraints, and on how much residual risk you are willing to carry.
Non-determinism forces you to prioritise, and prioritisation is what professional quality assurance is about anyway.
So the most important recommendation for kicking off test planning and building your test case catalogues for AI testing is more relevant than ever: get your team together and walk through the connected systems, the use cases and the expectations as a group. Run a scenario-based risk analysis. Only when you know where a failure of the AI in the system will hurt the user or really annoy them can you focus your resources on minimising that risk. The recommendation comes deliberately at the end here, because while risk analysis ought to be part of the methodical process, it is mostly skipped or only carried out at very low intensity in real life.
Wrap-up and outlook
If you got this far, you have understood the three pillars that make AI testing grown-up. Non-determinism isn’t unmanageable, but it forces you to take a different approach, design your test cases differently and measure results differently. Semantics moves the focus from wording to meaning. And clean evaluation logic replaces rigid expected results with acceptance corridors, objective measurement points and, where needed, stability quotas. Your testing becomes risk-based first, function-based second.
Next week we continue with: Why your test concept suddenly grows massively, and how to keep it under control. There we take a step further and turn these basics into an actual QA strategy. What does it mean for end-to-end test methods? How do you build E2E tests when an assistant could trigger several connected systems? How do you test black-box components you only partially control? And how do you achieve repeatability when the test case space is theoretically unlimited?
By the way: if AI assistants or agents are being introduced in your company too and you notice that classic test approaches don’t quite cover it any more, that isn’t a QA failure, it is simply a different playing field. That is exactly what I support with my consulting work: building a practical test strategy, identifying risks, defining evaluation logic and metrics, designing E2E scenarios and creating the conviction that “AI is demonstrably manageable”.
Curious? Then let’s talk about your scenario.
