Inside Revelir AI: How We Built a Scoring Engine That...

Most AI quality assurance scoring fails not because the AI is unintelligent, but because it is uninformed. Generic rubrics score agents against benchmarks that have nothing to do with how a specific company defines a good interaction. RevelirQA solves this with a fundamentally different architecture: before evaluating any conversation, the scoring engine retrieves the company's own SOPs and policies from a vector database using Retrieval-Augmented Generation (RAG), then scores every ticket against those exact standards. The result is consistent, auditable, policy-grounded evaluation at 100% conversation coverage with no sampling bias and no generic guesswork ^[1].

TL;DR

RevelirQA retrieves your actual policies via RAG before scoring any conversation, not generic industry benchmarks.
Every evaluation produces a full reasoning trace: model used, prompt, documents retrieved, and score rationale.
100% coverage replaces manual sampling, eliminating the blind spots that come with reviewing only a fraction of tickets.
The same scoring rubric evaluates both human agents and AI agents, giving CX leaders a unified quality view.
Enterprise clients Xendit and Tiket.com are already running this in production at high volume ^[1].

About the Author: Revelir AI is an AI customer service platform built for high-volume enterprise environments, with production deployments at Xendit and Tiket.com processing thousands of tickets per week. Revelir's QA scoring engine is purpose-built around policy-grounded evaluation, making the team uniquely qualified to discuss the architectural and operational decisions behind effective AI-driven quality assurance ^[1].

Why Do Most AI Scoring Engines Get QA Wrong?

The core failure mode of conventional AI scoring is context blindness. A generic model asked to score a customer service conversation has no knowledge of your refund policy, your escalation thresholds, or the specific tone your brand requires in a complaint scenario. It applies averaged, inferred standards derived from training data, which may or may not resemble how your business actually defines quality.

The downstream consequences are significant:

Agents get penalised or rewarded for reasons unrelated to your actual standards.
Coaching feedback becomes difficult to defend when it isn't grounded in documented policy.
In regulated industries like fintech, a score without a retrievable evidence trail creates compliance exposure.

"The problem isn't that AI can't score conversations. It's that most AI scores conversations without reading the rulebook first."

Manual QA sampling has its own structural problem: even a thorough team reviewing a fixed percentage of tickets each week leaves the majority of conversations unexamined. Volume spikes, new agent onboarding periods, and product launches are exactly when quality risks increase, and they are also exactly when sampled review is least representative.

How Does RAG-Powered QA Actually Work?

Retrieval-Augmented Generation (RAG) is an architecture that combines a language model's reasoning ability with a retrieval step: before generating a response or evaluation, the system fetches relevant documents from a knowledge store and includes them as context ^[1].

In RevelirQA, the process works as follows:

Ingestion: Your knowledge base, SOPs, and internal policies are ingested and embedded into a vector database.
Retrieval: When a conversation is submitted for scoring, the engine runs a semantic search to retrieve the documents most relevant to that specific interaction (e.g. the refund policy if the ticket concerns a refund).
Scoring: The language model receives both the conversation and the retrieved policy documents, then evaluates agent performance against those specific standards.
Trace generation: Every score is accompanied by a full reasoning trace: which model was used, what prompt was applied, which documents were retrieved, and how the score was derived ^[1].

This means two agents handling the same type of ticket in different weeks are scored against the same retrieved policy, not against whatever a model infers is appropriate.

What Does a Full Audit Trail on Every Score Actually Mean?

For most QA platforms, a score is a number with a rationale summary. For RevelirQA, every evaluation is fully observable:

Audit Trail Component	What It Contains	Why It Matters
Model used	Which AI model produced the evaluation	Reproducibility and version accountability
Prompt	The exact instruction given to the model	Transparency into scoring criteria applied
Documents retrieved	The specific SOPs or policies fetched via RAG	Proves the score is policy-grounded, not generic
Reasoning trace	Step-by-step evaluation logic	Enables agent coaching with specific, defensible justification

This level of observability is not a nice-to-have for industries like fintech. It is a requirement. Xendit and Tiket.com operate in environments where every customer interaction may carry regulatory or reputational weight, and the ability to show exactly how a quality score was derived is central to operating responsibly at scale ^[1].

How Does RevelirQA Handle Human and AI Agents Differently?

It does not treat them differently, and that is intentional. As enterprises deploy AI agents alongside human representatives, quality assurance faces a fragmentation problem: separate rubrics for bots and humans create inconsistent standards and make it impossible to compare performance across the full support operation.

RevelirQA applies the same policy-grounded rubric to every conversation regardless of who or what handled it. A ticket resolved by the Revelir Support Agent is scored with the same retrieved SOPs and the same evaluation logic as a ticket handled by a senior human agent. This gives CX leaders a unified, comparable quality view across their entire operation ^[2].

What Does 100% Coverage Change in Practice?

The shift from sampled to full-coverage QA is not just a volume improvement. It changes what questions you can answer:

With sampling: "Based on the 8% of tickets we reviewed this week, quality appears stable."
With 100% coverage: "Quality dropped on day three of the product outage. Here are the specific failure patterns across every conversation from that period."

Full coverage also removes selection bias. Manual QA teams, even well-intentioned ones, tend to review tickets that are easy to find: escalations, high-CSAT outliers, or tickets flagged by the helpdesk. The average interaction, which is where systemic issues accumulate, often goes unreviewed for weeks.

Frequently Asked Questions

Q: What file formats or systems can Revelir ingest for the policy knowledge base?

RevelirQA ingests your knowledge base and SOP documents into a vector database. The platform integrates with any helpdesk via API, including Zendesk and Salesforce, so existing documentation workflows do not need to change.

Q: Does the scoring engine work in languages other than English?

Yes. RevelirQA is proven in multilingual, high-volume environments including Indonesian-language customer service operations at Xendit and Tiket.com ^[1].

Q: How is RevelirQA different from the QA feature built into Zendesk or Salesforce?

Native helpdesk QA features typically apply fixed, generic rubrics and score a sampled subset of tickets. RevelirQA retrieves your specific policies before every evaluation, covers 100% of conversations, and produces a full reasoning trace per score. It also evaluates AI agents under the same rubric as human agents, which helpdesk-native QA does not support.

Q: What happens when our policies change?

Because scoring is grounded in the vector database rather than hard-coded prompts, updating your ingested documents updates the policy context the scoring engine retrieves. Scores issued after an update reflect the new policy; historical scores retain their original retrieved documents for auditability.

Q: Is RevelirQA a standalone product or does it require the full Revelir platform?

RevelirQA can be adopted as part of Revelir's broader AI customer service platform, which also includes the Revelir Support Agent and Revelir Insights. Plans are structured around conversation volume with Essential, Professional, and Enterprise tiers.

Q: How does the audit trail support compliance requirements?

Every evaluation stores the model version, prompt, retrieved documents, and reasoning chain. This means any score can be reconstructed and explained on demand, which is critical in fintech and regulated industries where AI-assisted decisions must be defensible ^[1].

Q: How long does it take to get RevelirQA operational with existing helpdesk data?

Setup involves connecting your helpdesk via API and ingesting your policy documents into the vector database. Because the platform is API-first, it does not require a helpdesk migration or custom engineering work on your side.

About Revelir AI
Revelir AI is an AI customer service platform built for global enterprise, headquartered in Singapore and founded by Rasmus Chow, a YC W22 alumnus. The platform spans three integrated layers: an autonomous Support Agent, RevelirQA (a RAG-powered scoring engine), and Revelir Insights (an insights engine with Claude MCP integration). Enterprise clients Xendit and Tiket.com are in production, processing thousands of tickets per week across multilingual, high-volume environments ^[1]. Revelir integrates with any helpdesk via API, including Zendesk and Salesforce, and is built to serve compliance-sensitive industries that require full auditability on every AI evaluation.

See how RevelirQA scores against your policies, not generic benchmarks.

Learn more or get in touch at www.revelir.ai

References

Revelir AI Launches Automated QA Engine, Secures Xendit and Tiket.com as Enterprise Clients - The Tennessean (www.tennessean.com)
Revelir Product Walkthrough (www.tella.tv)

Inside Revelir AI: How We Built a Scoring Engine That Retrieves Your Policies Before It Judges a Single Conversation