One Rubric, Two Workforces How to Evaluate Human Agents and AI Bots Under a Unified Quality Framework

Most customer service QA frameworks were designed for humans. Now that AI agents handle a growing share of conversations, those frameworks are breaking under the pressure of a question they were never built to answer: how do you hold a bot and a person to the same standard?

The answer is a unified quality rubric — a single evaluation framework that applies consistent scoring criteria to every conversation, regardless of who (or what) handled it. When built correctly, this approach does not dilute standards for humans or lower the bar for AI. It raises the floor for both.

TL;DR

Most QA frameworks were built for human agents and do not translate cleanly to AI-handled conversations.
A unified rubric evaluates both human agents and AI bots against the same criteria, eliminating the double standard that creates blind spots.
Customer service QA automation is the practical mechanism that makes unified evaluation possible at scale.
The rubric must be grounded in your own policies and SOPs — not generic benchmarks — to produce scores that are actually actionable.
Tracking sentiment arc (how a customer felt at the start vs. the end) reveals quality failures that a simple resolution outcome misses entirely.

About the Author: Revelir AI is an AI customer service platform with enterprise clients in production across fintech and travel, including Xendit and Tiket.com. Revelir's QA scoring engine evaluates 100% of conversations — human and AI-handled — against each client's own policies, making unified quality frameworks a live operational reality, not a theoretical exercise.

Why Does Having Two Separate QA Frameworks Create Problems?

Running separate evaluation processes for human agents and AI bots is more than an administrative inconvenience. It creates structurally incomparable data.

When your human QA process samples 5% of tickets manually and your AI bot gets no evaluation at all (a common default), you cannot answer basic operational questions:

Is our AI bot performing better or worse than our human agents on refund requests?
Are customers ending conversations with the bot more frustrated than they started?
Where does the bot fail in ways that require a human escalation?

Without a shared framework, these questions go unanswered. Worse, AI bots often get a free pass because the tooling to evaluate them simply does not exist in most QA stacks. According to Training Industry, effective competency evaluation requires identifying the specific knowledge and skill sets core to a job task — the same logic applies when defining what "good" looks like for an AI agent on a specific conversation type.

What Should a Unified QA Rubric Actually Contain?

A unified rubric is not a single score. It is a structured set of criteria, each weighted and defined precisely enough that a human evaluator and an AI scoring engine produce consistent results.

Core rubric dimensions for both human agents and AI bots:

Dimension	What It Measures	Applies to Human	Applies to AI Bot
Policy Adherence	Did the response follow your SOPs?	Yes	Yes
Resolution Accuracy	Was the customer's issue actually resolved?	Yes	Yes
Tone Appropriateness	Was the language professional and on-brand?	Yes	Yes
Escalation Judgment	Was escalation triggered at the right moment?	Yes	Yes
Response Completeness	Was all required information provided?	Yes	Yes
Empathy Signals	Were emotional cues acknowledged?	Yes	Partially

The "Partially" on empathy for AI bots is intentional. Bots can be scored on whether they used empathetic language patterns, but the depth of that score should be calibrated differently than for humans. This is not a lower standard — it is an honest one.

Training Industry's research on quantitative rubrics for employee competency reinforces this point: rubrics should be developed from the actual knowledge and skill sets required for the specific tasks being evaluated, not copied from a generic template. The same principle applies when adapting criteria for AI.

How Does Customer Service QA Automation Make This Scalable?

Manual QA — a supervisor sampling 5-10% of tickets — cannot scale to evaluate 100% of human conversations, let alone the additional volume an AI bot generates. Customer service QA automation closes that gap.

With an automated scoring engine, every conversation (human or bot-handled) gets evaluated against the same rubric, every time. No sampling bias. No evaluator fatigue. No inconsistency between reviewers on different shifts.

The practical requirements for this to work:

The scoring engine must ingest your own policies. Generic AI scoring against industry benchmarks produces scores that cannot be actioned. Your rubric needs to reflect your specific SOPs, your escalation thresholds, your brand tone guidelines.
Every score needs a reasoning trace. For compliance-sensitive industries like fintech, a score without an audit trail is not enough. You need to see which policy document was retrieved, what the prompt was, and why the score was assigned.
The same rubric must run on both conversation types. Separate scoring pipelines for humans and bots reintroduce the comparability problem you were trying to solve.

RevelirQA addresses this by ingesting each client's knowledge base and SOPs into a vector database. Before scoring any conversation, the engine retrieves the relevant policy documents — the same documents your human agents are expected to follow. The score is then generated against your actual standards, not a generic benchmark. Every score includes a full reasoning trace, which is already running in production at Xendit and Tiket.com.

What Does Sentiment Arc Reveal That Resolution Outcome Misses?

A ticket marked "resolved" tells you the issue was closed. It does not tell you whether the customer who started the conversation frustrated ended it even more frustrated — or whether a customer who started neutral left feeling dismissed.

Sentiment arc — tracking how a customer felt at the start of a conversation versus at the end — is one of the most underused dimensions in quality evaluation, for both human and AI-handled conversations.

Consider a scenario: an AI bot correctly processes a refund request (resolution outcome: success), but the customer had to repeat their information three times and the bot's language was cold throughout. The ticket resolves. The QA score, if it only measures accuracy, passes. The customer's sentiment moved from neutral to negative.

At scale, this pattern becomes a retention risk. If 15% of technically resolved tickets this week ended with worse customer sentiment than they started, that is a signal no resolution rate metric will catch.

Revelir Insights tracks this sentiment arc across every ticket, surfacing patterns like this so CX leaders can identify coaching opportunities for both human agents and AI bot behaviour.

Frequently Asked Questions

Can the same rubric really apply fairly to both humans and AI bots?
Yes, with calibration. The core dimensions (policy adherence, resolution accuracy, tone) apply equally. Some dimensions like empathy require calibrated scoring criteria that account for the different nature of AI-generated language.

What happens when my AI bot handles a conversation type my rubric was not designed for?
Flag it. A well-built unified rubric should surface gaps where no policy exists for a given contact type — which is itself valuable signal for your QA and product teams.

How often should the rubric be updated?
Treat your rubric like a living document. Policy changes, new product launches, and evolving customer expectations should all trigger a rubric review. Quarterly at minimum; monthly if you are operating in a fast-changing environment.

Does customer service QA automation replace human QA reviewers?
It replaces manual sampling, not human judgment. Automation handles the volume. Human reviewers focus on edge cases, calibration, and coaching conversations that require context an AI cannot fully interpret.

How do I handle multilingual conversations in a unified rubric?
Your scoring engine needs to operate natively in the languages your customers use. A rubric that only scores English conversations leaves significant volume unreviewed. Revelir's platform runs in production across Indonesian-language, high-volume environments, so multilingual coverage is built-in, not a workaround.

About Revelir AI

Revelir AI is an AI customer service platform that evaluates 100% of support conversations — human and AI-handled — under a single, policy-grounded quality framework. RevelirQA, its AI scoring engine, ingests each client's SOPs via RAG and scores every conversation with a full audit trail, making it suited for compliance-sensitive industries like fintech. Revelir Insights enriches every ticket with sentiment arc, reason for contact, and custom metrics, connecting to Claude via MCP so CX leaders can query their entire support dataset in plain English. Enterprise clients including Xendit and Tiket.com run Revelir in production, processing thousands of tickets per week.

Ready to evaluate your entire customer service operation — human and AI — under one consistent framework? Explore Revelir AI at www.revelir.ai.

References

Training Industry. Developing Quantitative Rubrics for Employee Competency Determination. https://trainingindustry.com/magazine/summer-2023/developing-quantitative-rubrics-for-employee-competency-determination/
TechClass. Boost Employee Reviews with 10 Proven Performance Tips. https://www.techclass.com/resources/learning-and-development-articles/10-best-practices-for-effective-employee-performance-reviews