The Mandatory vs. Advisory Criteria Split: How Enterprise QA Teams Decide Which Scorecard Items Are Pass/Fail Versus Weighted

Published on:
June 15, 2026

The Mandatory vs. Advisory Criteria Split: How...

In any customer service QA scorecard, not all criteria deserve equal treatment. Some failures should immediately flag a conversation as non-compliant regardless of how well everything else went. Others represent quality ideals that should influence a score but not automatically disqualify the interaction. Getting this split right is one of the most consequential design decisions a QA team makes, because a scorecard that treats everything as weighted will quietly pass conversations that carry real compliance or safety risk, while one that makes everything pass/fail will produce scores so binary they offer no coaching signal. This article explains how to make that call with clarity and consistency.

TL;DR
  • Mandatory (pass/fail) criteria should cover compliance obligations, safety risks, and non-negotiable policies where any failure invalidates the interaction.
  • Advisory (weighted) criteria cover quality dimensions where partial performance still has value and improvement can be measured incrementally.
  • The split is not permanent: criteria should be reviewed as regulation, product, and SOP evolve.
  • A QA scorecard that blends both types gives teams a compliance floor and a quality ceiling to coach toward.
  • Scoring 100% of conversations, rather than a sample, is the only reliable way to enforce mandatory criteria at scale without gaps.
About the Author Revelir AI builds AI quality assurance software for customer service teams at high-volume enterprises. Its scoring engine runs across 100% of support conversations at clients including Xendit and Tiket.com, giving it direct operational insight into how QA scorecards behave in production environments.

What Is the Mandatory vs. Advisory Split in a QA Scorecard?

The mandatory vs. advisory split is the design principle that divides scorecard criteria into two tiers: items that must be satisfied in every conversation (pass/fail), and items that contribute to an overall quality score but allow for partial credit (weighted). This is the structural backbone of any serious QA scorecard, and most enterprise teams either apply it inconsistently or collapse the distinction entirely, which produces scores that are neither trustworthy for compliance nor useful for coaching [1].

The core logic is straightforward:

  • Mandatory criteria represent non-negotiable standards. Failure on any single mandatory item should fail the entire conversation, regardless of how high the weighted score is. Examples: disclosing a call recording, following an escalation policy for a disputed transaction, never sharing account credentials.
  • Advisory (weighted) criteria represent quality dimensions where doing better earns a higher score, but doing less does not constitute a policy breach. Examples: using empathy language, offering a proactive resolution, correctly categorising the contact reason.

Without this split, a scorecard averaging across all criteria can produce a passing score on a conversation where a compliance-critical step was skipped entirely. That is not a minor scoring nuance. In regulated industries like fintech, it is an audit risk [4].

How Do You Decide Which Criteria Belong in Each Tier?

Building on the distinction above, the harder practical question is where to draw the line. The decision is not about how important a criterion feels. It is about the consequence of failure and whether partial performance still has value.

Use these filters to classify each scorecard item:

Filter Question Answer "Yes" = Mandatory Answer "No" = Advisory (Weighted)
Does a failure here create a compliance, legal, or regulatory exposure? Regulatory disclosure skipped Tone was impersonal but accurate
Does partial completion still serve the customer? No: identity verification is all-or-nothing Yes: a partial proactive offer is better than none
Would a QA manager ever pass this conversation despite the failure? No: safety or policy breach Sometimes: suboptimal phrasing is still acceptable
Is the behaviour explicitly required by a written SOP or policy? Yes, verbatim SOP step No, it is a best practice or style guide item

A useful heuristic: if your compliance or legal team would mark this as an audit finding, it is mandatory. If your coaching team would note it as an improvement opportunity, it is advisory [2].

What Makes a QA Scorecard Template Fail in Practice?

A related but distinct question is why well-designed scorecards still produce poor outcomes once they are deployed. The most common failure modes are not in the criteria themselves but in how the tiers are applied at volume [1].

  • Everything becomes weighted over time. Teams soften mandatory criteria to avoid pushback, eventually treating compliance items as merely influencing the score. This erodes the QA program's integrity entirely.
  • The scorecard is applied to a 1-5% sample. Manual QA review, even with a perfectly designed scorecard, only covers a small fraction of conversations. A mandatory criterion can be routinely violated in the other 95% with no signal surfacing. This is the core reason enterprises are moving toward automated QA scoring at full conversation volume [4].
  • Criteria definitions are ambiguous. A mandatory item like "confirmed customer identity" means different things to different reviewers if the SOP is not the reference document used during scoring. Scoring against your own SOPs, rather than a generic QA scorecard, is what makes mandatory criteria enforceable [2].
  • Weights are assigned by intuition, not impact. Weighted criteria should reflect what actually drives customer outcomes. Assigning equal weight to all advisory criteria ignores the fact that some quality dimensions matter more to retention and resolution than others [3].

How Should Weights Be Assigned to Advisory Criteria?

Stepping back from the structural classification, a separate concern is how to make the weighted tier meaningful. A QA scorecard where every advisory criterion carries the same weight is only marginally better than a checklist. Weight should reflect the relative impact of each criterion on the outcomes the business actually cares about [3].

A practical approach to weight assignment:

  1. Anchor to customer outcome data. If first-contact resolution and CSAT are your primary metrics, weight criteria that most directly drive those outcomes higher. Empathy language may matter less than resolution accuracy in a fintech context.
  2. Segment by contact type. A billing dispute conversation and a password reset conversation should not carry identical weights on every criterion. A QA scorecard that supports configurable criteria per queue or contact reason is more accurate than a one-size approach [1].
  3. Revisit weights quarterly. Products change, SOPs evolve, and new failure patterns emerge. A QA scorecard is a living document, not a one-time build [2].
  4. Use scoring data to validate weights. If an advisory criterion consistently scores high but correlates with poor CSAT, its weight is miscalibrated. The data should inform the scorecard design, not just confirm it.

How Does Automated AI Scoring Change This Design Decision?

Building on the weight assignment challenge, the harder question is whether AI scoring changes how teams should think about the mandatory vs. advisory split at all. It does, in two important ways.

First, full conversation coverage makes mandatory criteria actually enforceable. When scoring is manual, teams often avoid making too many criteria mandatory because each fail triggers a review workflow they cannot sustain at volume. When an AI scoring engine evaluates 100% of conversations, a mandatory fail can be flagged, triaged, and acted on systematically without creating a bottleneck. This removes one of the main practical pressures that causes teams to soften mandatory criteria over time.

Second, AI scoring is only as precise as the policy documents it references. A scoring engine that retrieves your actual SOPs before evaluating each conversation will apply mandatory criteria consistently because it is checking against the specific rule, not a reviewer's memory of it. This is the core architectural difference between policy-grounded AI QA and generic QA scorecard scoring.

RevelirQA is built around exactly this architecture. It ingests your SOPs and QA scorecard into a vector database, retrieves the relevant policy documents before each evaluation, and applies your mandatory and weighted criteria consistently across every conversation. Because every score carries a full reasoning trace, including the documents retrieved and the logic applied, QA managers can audit why a mandatory criterion was flagged, not just that it was. For teams at companies like Xendit, where compliance on financial conversations is not optional, that auditability is not a nice-to-have. It is a requirement.

Frequently Asked Questions

How many mandatory criteria should a QA scorecard have?

Most enterprise QA scorecards carry between three and eight mandatory criteria. More than ten typically signals that the classification has drifted, with quality items being elevated to mandatory status incorrectly. If a criterion does not create compliance, safety, or policy exposure on failure, it should be advisory [1].

Can a conversation pass overall if it fails a mandatory criterion?

No. That is the defining feature of a mandatory criterion. A high weighted score should never offset a mandatory fail. If your scoring system allows this, the mandatory tier is not functioning correctly [2].

What is the difference between a binary criterion and a mandatory criterion?

Binary means the criterion is scored as yes or no, with no partial credit. Mandatory means a fail on that criterion fails the entire conversation. All mandatory criteria are binary, but not all binary criteria are mandatory. A binary item can also be advisory, contributing its full weight or zero to the score without triggering an automatic fail [3].

How often should the mandatory vs. advisory split be reviewed?

At minimum quarterly, and immediately when a regulatory change, product update, or new SOP is introduced. Criteria classification should follow policy changes, not lag them [2].

Does this split apply to conversations handled by both human and AI systems?

Yes, and applying it consistently across both is increasingly important as companies run AI chatbots alongside human staff. A mandatory criterion around compliance disclosure applies regardless of whether the conversation was handled by a human or an automated system. Unified scoring across both gives CX leaders an accurate picture of where compliance gaps actually exist.

How does manual QA sampling affect enforcement of mandatory criteria?

Significantly. If only 1-5% of conversations are reviewed, mandatory criteria violations in the remaining volume go undetected. A team may believe its compliance rate is high because its sample looks clean, while systemic failures exist in conversations never reviewed. Full conversation coverage is the only reliable solution [4].

Can weighted criteria be promoted to mandatory status?

Yes, and this is appropriate when a previously advisory behaviour becomes a regulatory or SOP requirement. The reverse is also valid: if a mandatory criterion is retired from policy, it should be reclassified. Treat the scorecard as a living document that mirrors your current policies, not your policies from the time the scorecard was first built [1].

About Revelir AI

Revelir AI builds AI quality assurance software for customer service operations at enterprise scale. Its scoring engine, RevelirQA, evaluates 100% of support conversations against each client's own policies and QA scorecard, retrieved via RAG before every evaluation. Every score carries a full reasoning trace, giving QA and compliance teams an auditable record of every decision. RevelirQA is in production at global enterprises including Xendit and Tiket.com, scoring thousands of conversations per week across English, Indonesian, Thai, and Tagalog. It evaluates both human and AI interactions on the same scorecard, giving CX leaders a unified view of quality across their entire support operation. Revelir AI is headquartered in Singapore and integrates with any helpdesk via API.

Ready to build a QA scorecard that enforces your compliance floor and coaches toward your quality ceiling?

RevelirQA scores 100% of your customer service conversations against your own policies, with full auditability on every decision. Learn more at https://www.revelir.ai/.

References

  1. QA Process: The Complete Guide for Modern Teams (qasphere.com)
  2. QA strategy framework: 6 phases from zero to full coverage (betterqa.co)
  3. The QA Manager's Essential Guide to Test Management (www.testmonitor.com)
  4. How to Build an Enterprise QA Strategy-A Comprehensive Guide (www.testdevlab.com)
💬