Most AI quality assurance scoring fails not because the AI is unintelligent, but because it is uninformed. Generic rubrics score agents against benchmarks that have nothing to do with how a specific company defines a good interaction. RevelirQA solves this with a fundamentally different architecture: before evaluating any conversation, the scoring engine retrieves the company's own SOPs and policies from a vector database using Retrieval-Augmented Generation (RAG), then scores every ticket against those exact standards. The result is consistent, auditable, policy-grounded evaluation at 100% conversation coverage with no sampling bias and no generic guesswork [1].
- RevelirQA retrieves your actual policies via RAG before scoring any conversation, not generic industry benchmarks.
- Every evaluation produces a full reasoning trace: model used, prompt, documents retrieved, and score rationale.
- 100% coverage replaces manual sampling, eliminating the blind spots that come with reviewing only a fraction of tickets.
- The same scoring rubric evaluates both human agents and AI agents, giving CX leaders a unified quality view.
- Enterprise clients Xendit and Tiket.com are already running this in production at high volume [1].
Why Do Most AI Scoring Engines Get QA Wrong?
The core failure mode of conventional AI scoring is context blindness. A generic model asked to score a customer service conversation has no knowledge of your refund policy, your escalation thresholds, or the specific tone your brand requires in a complaint scenario. It applies averaged, inferred standards derived from training data, which may or may not resemble how your business actually defines quality.
The downstream consequences are significant:
- Agents get penalised or rewarded for reasons unrelated to your actual standards.
- Coaching feedback becomes difficult to defend when it isn't grounded in documented policy.
- In regulated industries like fintech, a score without a retrievable evidence trail creates compliance exposure.
"The problem isn't that AI can't score conversations. It's that most AI scores conversations without reading the rulebook first."
Manual QA sampling has its own structural problem: even a thorough team reviewing a fixed percentage of tickets each week leaves the majority of conversations unexamined. Volume spikes, new agent onboarding periods, and product launches are exactly when quality risks increase, and they are also exactly when sampled review is least representative.
How Does RAG-Powered QA Actually Work?
Retrieval-Augmented Generation (RAG) is an architecture that combines a language model's reasoning ability with a retrieval step: before generating a response or evaluation, the system fetches relevant documents from a knowledge store and includes them as context [1].
In RevelirQA, the process works as follows:
- Ingestion: Your knowledge base, SOPs, and internal policies are ingested and embedded into a vector database.
- Retrieval: When a conversation is submitted for scoring, the engine runs a semantic search to retrieve the documents most relevant to that specific interaction (e.g. the refund policy if the ticket concerns a refund).
- Scoring: The language model receives both the conversation and the retrieved policy documents, then evaluates agent performance against those specific standards.
- Trace generation: Every score is accompanied by a full reasoning trace: which model was used, what prompt was applied, which documents were retrieved, and how the score was derived [1].
This means two agents handling the same type of ticket in different weeks are scored against the same retrieved policy, not against whatever a model infers is appropriate.
What Does a Full Audit Trail on Every Score Actually Mean?
For most QA platforms, a score is a number with a rationale summary. For RevelirQA, every evaluation is fully observable:
| Audit Trail Component | What It Contains | Why It Matters |
|---|---|---|
| Model used | Which AI model produced the evaluation | Reproducibility and version accountability |
| Prompt | The exact instruction given to the model | Transparency into scoring criteria applied |
| Documents retrieved | The specific SOPs or policies fetched via RAG | Proves the score is policy-grounded, not generic |
| Reasoning trace | Step-by-step evaluation logic | Enables agent coaching with specific, defensible justification |
This level of observability is not a nice-to-have for industries like fintech. It is a requirement. Xendit and Tiket.com operate in environments where every customer interaction may carry regulatory or reputational weight, and the ability to show exactly how a quality score was derived is central to operating responsibly at scale [1].
How Does RevelirQA Handle Human and AI Agents Differently?
It does not treat them differently, and that is intentional. As enterprises deploy AI agents alongside human representatives, quality assurance faces a fragmentation problem: separate rubrics for bots and humans create inconsistent standards and make it impossible to compare performance across the full support operation.
RevelirQA applies the same policy-grounded rubric to every conversation regardless of who or what handled it. A ticket resolved by the Revelir Support Agent is scored with the same retrieved SOPs and the same evaluation logic as a ticket handled by a senior human agent. This gives CX leaders a unified, comparable quality view across their entire operation [2].
What Does 100% Coverage Change in Practice?
The shift from sampled to full-coverage QA is not just a volume improvement. It changes what questions you can answer:
- With sampling: "Based on the 8% of tickets we reviewed this week, quality appears stable."
- With 100% coverage: "Quality dropped on day three of the product outage. Here are the specific failure patterns across every conversation from that period."
Full coverage also removes selection bias. Manual QA teams, even well-intentioned ones, tend to review tickets that are easy to find: escalations, high-CSAT outliers, or tickets flagged by the helpdesk. The average interaction, which is where systemic issues accumulate, often goes unreviewed for weeks.
Frequently Asked Questions
Revelir AI is an AI customer service platform built for global enterprise, headquartered in Singapore and founded by Rasmus Chow, a YC W22 alumnus. The platform spans three integrated layers: an autonomous Support Agent, RevelirQA (a RAG-powered scoring engine), and Revelir Insights (an insights engine with Claude MCP integration). Enterprise clients Xendit and Tiket.com are in production, processing thousands of tickets per week across multilingual, high-volume environments [1]. Revelir integrates with any helpdesk via API, including Zendesk and Salesforce, and is built to serve compliance-sensitive industries that require full auditability on every AI evaluation.
See how RevelirQA scores against your policies, not generic benchmarks.
Learn more or get in touch at www.revelir.ai
References
- Revelir AI Launches Automated QA Engine, Secures Xendit and Tiket.com as Enterprise Clients - The Tennessean (www.tennessean.com)
- Revelir Product Walkthrough (www.tella.tv)
