TL;DR
- Binary criteria answer yes/no questions and suit hard policy rules with no grey area.
- Multi-option criteria capture nuance where outcome quality falls into distinct, named buckets.
- Scored criteria use a numeric range and are best for behaviours that exist on a spectrum, like tone or thoroughness.
- Mixing all three types in one QA scorecard gives a richer picture than any single format alone.
- Every metric type still needs a clear, single-concept question to produce consistent, auditable AI scores [1].
About the Author: Revelir AI built RevelirQA, an AI customer service QA platform running in production at high-volume enterprises across Southeast Asia, including Xendit and Tiket.com. This article draws on direct experience configuring multilingual QA scorecards across fintech and travel support operations.
Why does metric format matter in an AI QA scorecard?
The format of a QA metric is not a cosmetic choice. It determines what the AI scoring engine is actually being asked to decide, and therefore how reliable and useful the output will be. A poorly matched format creates ambiguity that compounds with every ticket scored.
Consider a criterion like "Did the service representative verify the customer's identity before sharing account details?" This is a hard yes/no question. Asking the AI to score it on a 1-5 scale introduces false precision: what does a 3 mean here? By contrast, a criterion like "How clearly did the service representative explain the next steps?" genuinely exists on a spectrum, and collapsing it to pass/fail loses information.
The three formats serve fundamentally different evaluation jobs:
- Binary enforces policy compliance where the rule is absolute.
- Multi-option categorises quality when distinct, named outcomes exist.
- Scored criteria measures degree of quality on a continuous behaviour.
Getting the match right is the first step toward a QA scorecard that produces actionable coaching data rather than noise [1].
What is a binary QA metric and when should you use it?
A binary metric is a criterion with exactly two possible outcomes, most commonly pass or fail, yes or no. It is the right format whenever the underlying policy is absolute: the behaviour either happened or it did not, and partial credit is not meaningful.
Best-fit use cases for binary metrics:
- Compliance checks ("Did the service representative include the required regulatory disclosure?")
- Security verification ("Did the service representative confirm identity before accessing the account?")
- Prohibited behaviour flags ("Did the service representative make any promise outside the approved refund policy?")
Binary criteria are easy for an AI scoring engine to reason about because the decision boundary is clear. The prompt can ask a direct question and map the answer to 1 or 0. The risk is over-using binary format on questions that actually have nuance, which forces the AI to make a harsh call where a spectrum would be more honest [1].
What is a multi-option metric and where does it add value over binary?
Building on the logic of binary, a multi-option metric extends the outcome space to three or more named categories, each with a distinct label and typically a different point value. It captures the gap between "this happened" and "how well it happened."
A good example is escalation handling. Rather than pass/fail, you might define:
| Option Label | Description | Points |
|---|---|---|
| Fully resolved | Service representative resolved without escalation needed | 3 |
| Escalated correctly | Service representative escalated to the right queue with context | 2 |
| Escalated incorrectly | Service representative escalated to the wrong team or without context | 1 |
| Did not escalate when required | Issue needed escalation; service representative closed without it | 0 |
Multi-option metrics require more upfront definition work. Each option needs a clear, mutually exclusive description that the AI can distinguish without ambiguity. Overlapping labels are the most common failure mode and produce inconsistent scores across tickets [2].
What is a scored criterion and when is a numeric range the right choice?
A scored criterion assigns a numeric value within a defined range, for example 0 to 3 or 1 to 5, to a single behaviour. It is appropriate for qualities that genuinely exist on a continuum, where the difference between a 2 and a 3 reflects a real and meaningful difference in performance.
Tone, empathy, explanation clarity, and response completeness are the most common scored criteria in customer service QA. A 0-to-3 scale tends to outperform 0-to-5 in practice because a five-point scale introduces middle-score ambiguity that makes AI calibration harder [1].
Tips for configuring scored criteria effectively:
- Anchor every score point with a behavioural description, not just a number.
- Keep the criterion focused on one behaviour. Combining tone and completeness into one scored question makes the score uninterpretable.
- Assign weights that reflect the criterion's business importance relative to others on the scorecard [1].
How should you structure a QA scorecard that mixes all three metric types?
A well-designed QA scorecard typically combines all three formats in deliberate layers. Compliance and policy rules sit at the binary level. Process quality and procedural outcomes use multi-option. Soft skills and communication quality use scored criteria.
| Layer | Metric Type | Example Criteria |
|---|---|---|
| Policy compliance | Binary | Regulatory disclosure given, identity verified |
| Process adherence | Multi-option | Escalation handling, resolution pathway |
| Communication quality | Scored (0-3) | Tone, explanation clarity, empathy |
One practical principle: any binary fail on a compliance criterion should be capable of overriding the overall score, regardless of how well the service representative scored on communication. A service representative who was warm and clear but skipped an identity check has still failed the interaction at the policy level.
Platforms like RevelirQA support all three metric types within the same QA scorecard, letting teams configure criteria that reflect their actual SOPs rather than a generic template. Because every score includes a full reasoning trace showing which policy documents were retrieved and how the criterion was applied, QA managers can audit any individual result and refine criteria with confidence.
Frequently Asked Questions
Can one QA scorecard use all three metric types together?
Yes, and it should. Binary handles hard rules, multi-option handles categorical outcomes, and scored criteria handle behaviour on a spectrum. Mixing formats gives a more complete quality picture than any single format alone.
How many criteria should a QA scorecard include?
Fewer criteria scored precisely outperform more criteria scored loosely. Most high-performing scorecards use between eight and fifteen criteria. Beyond that, the marginal information value of each additional criterion tends to drop while scoring complexity rises [1].
What makes a poorly written QA metric?
The most common issues are double-barrelled questions (asking two things in one criterion), vague language without behavioural anchors, and binary format applied to a question that actually has meaningful gradations [1].
How does an AI scoring engine handle multilingual conversations?
A well-configured AI scoring engine evaluates the conversation in the language it was conducted in, applying the same scorecard criteria regardless of language. RevelirQA runs in production across English, Indonesian, Thai, and Tagalog support queues without requiring separate scorecard configurations per language.
What is the difference between a QA metric and a QA scorecard?
A QA metric is a single evaluation criterion with its format (binary, multi-option, or scored) and weight defined. A QA scorecard is the full set of metrics applied together to evaluate one conversation. The scorecard produces a composite score; individual metrics produce the component inputs.
Should AI systems and human representatives use the same QA scorecard?
In most cases, yes, particularly for compliance and resolution criteria. Consistency in the scorecard is what makes performance comparable across your entire support operation. RevelirQA applies the same criteria to AI-handled and human-handled conversations so CX leaders see a unified quality view.
Revelir AI builds RevelirQA, an AI customer service QA platform that scores 100% of support conversations against a company's own policies and SOPs, retrieved via RAG before each evaluation. Unlike manual sampling, which covers only 1-5% of tickets, RevelirQA provides every service representative, human or AI, with the same consistent scorecard and a full audit trail on every score. RevelirQA is in production at Xendit and Tiket.com, handling thousands of conversations per week across multilingual, high-volume support environments. The platform integrates with any helpdesk via API and is available as a SaaS or dedicated tenant deployment.
See how Revelir AI configures custom QA scorecards built around your policies, not generic benchmarks. Visit https://www.revelir.ai/ to learn more or get in touch with the team.
References
- AI scoring best practices - Genesys Cloud Resource Center (help.mypurecloud.com)
- Building custom LLM evaluation metrics | Rhesis AI Blog (rhesis.ai)
