The Script Compliance Problem: Why Policy Adherence...

When conversations switch from English to Bahasa Indonesia mid-ticket, most AI quality assurance systems silently fail. They either skip the score entirely, return a false positive because the policy text is in English while the conversation is not, or worse, flag compliant conversations as non-compliant because the semantic match breaks across languages. The core issue is not translation. It is that policy adherence scoring was built for monolingual conversations, and most enterprise contact centres are not monolingual.

TL;DR

Language-switching mid-conversation (code-switching) is standard practice in Southeast Asian, MENA, and multilingual European contact centres, but most QA tools score only the dominant language.
Script adherence scoring fails when the policy document and the conversation are in different languages, causing missed-policy detections and false compliance flags ^[2].
The fix requires QA scoring that evaluates semantic intent across languages, not keyword matching against a fixed-language policy document.
Without full conversation coverage, language-switched tickets fall into a blind spot that manual sampling cannot reliably catch.
Multilingual-aware scoring is now a compliance requirement for regulated industries, not a nice-to-have feature.

About the Author: Revelir AI is an AI quality assurance platform for global enterprise, running production QA scoring across hundreds of thousands of customer service conversations in English, Indonesian, Thai, and Tagalog for enterprise clients including Xendit and Tiket.com. The company's core scoring engine was built from the ground up to handle multilingual and code-switched conversations at scale.

What Is Code-Switching and Why Does It Happen in Contact Centres?

Code-switching is when a speaker alternates between two or more languages within a single conversation. In customer service, it is not a mistake or a workaround. It is a natural communication strategy to match a customer's comfort level, clarify technical terms, or de-escalate tense interactions. A Xendit customer might open a ticket in English but describe a transaction error in Indonesian because that is the language where the detail is most precise for them.

Code-switching is especially prevalent across Southeast Asia, where conversations routinely combine English with Bahasa Indonesia, Tagalog, Thai, or local dialects. But it also appears in:

MENA contact centres mixing Modern Standard Arabic and English
European BPOs handling French-English or Spanish-English tickets
US-based teams serving Spanish-dominant communities

This is not a regional niche. Any enterprise operating in a multilingual market will encounter code-switched tickets. The question is whether their QA system can evaluate them fairly.

Why Do Standard Policy Adherence Scoring Systems Fail Here?

Building on the prevalence of code-switching, the harder question is why QA tooling has not caught up. Standard script adherence scoring compares a conversation against a policy document using keyword or semantic matching ^[2]. When both the policy and the conversation are in English, this works adequately. When the conversation is in Bahasa Indonesia against an English-language SOP, the scoring model faces three specific failure modes:

Failure Mode	What Happens	Business Impact
Semantic mismatch	The AI cannot align "verifikasi identitas" with "identity verification" in the policy, so the compliance check fails even though the conversation was compliant ^[1]	False non-compliance flags; conversations penalised unfairly
Silent skip	The scorer detects a non-English segment and skips scoring that section entirely	Real policy violations go undetected; inflated compliance scores
Partial scoring	Only the English portions are evaluated, giving a score based on an incomplete conversation	QA data is unreliable; coaching decisions are based on partial evidence

The result is that code-switched tickets become a compliance blind spot. And because manual QA sampling only reviews somewhere between 1% and 5% of tickets ^[3], the probability of a human reviewer catching a pattern in the language-switched subset is close to zero.

Is This a Translation Problem or a Scoring Architecture Problem?

A common instinct is to solve this with a translation pre-processing step: translate everything into English, then score. This sounds reasonable but introduces its own failure chain.

Machine translation flattens nuance. Regulatory disclosures, refund policy language, and escalation phrases carry meaning that shifts in translation.
Translated text loses the original intent signals that the scoring model needs to detect whether, for example, required disclaimers were provided in the correct tone ^[1].
Translation adds latency and cost at scale. For a contact centre processing tens of thousands of tickets per week, this compounds quickly.
It does not solve mixed-language tickets, where a single message contains two languages in a single sentence.

The more durable fix is a scoring architecture that understands multiple languages natively, retrieving the relevant SOP in context and matching the conversation's intent against it semantically, regardless of the surface language of the conversation.

What Does Correct Multilingual Policy Adherence Scoring Look Like?

Stepping back from the failure modes, a practical question emerges: what would a well-designed multilingual QA scoring system actually do differently? Several properties are non-negotiable:

Language-agnostic intent detection. The scoring model should evaluate whether required concepts were communicated, not whether specific words matched ^[2].
Policy retrieval in context. The SOP should be retrieved and applied to each conversation segment based on what is being discussed, not just globally appended to a prompt.
Consistent QA scorecard across languages. The same QA scorecard should apply whether the conversation is in English or Tagalog. Varying standards by language creates compliance gaps and coaching inequity ^[4].
100% coverage. Sampling will always under-represent the language-switched subset. Full conversation scoring is the only reliable approach.
Auditable reasoning. For regulated industries, every score should carry a reasoning trace showing which policy was retrieved, what was said, and why the score was given ^[5].

RevelirQA was built with these constraints as first principles. The platform ingests customer SOPs into a vector database and retrieves the relevant policy before every evaluation, scoring in English, Indonesian, Thai, and Tagalog on the same QA scorecard. Xendit and Tiket.com run RevelirQA in production across thousands of tickets per week in these environments.

What Are the Compliance Risks of Getting This Wrong?

A related but distinct question is what the actual exposure looks like when multilingual policy adherence scoring breaks down. The risk is not abstract.

In fintech, regulators increasingly expect documented evidence that required disclosures were communicated correctly. A QA system that silently skips non-English content cannot produce that evidence ^[6].
In travel and e-commerce, refund and cancellation policy adherence directly affects customer retention. Missed-policy conversations that only appear in the language-switched segment never surface for coaching.
Conversations that switch languages to serve customers better get penalised or under-scored simply because the QA system cannot follow them. This creates a perverse incentive to communicate in a single language even when that is worse for the customer.

"A compliance framework that only works in one language is not a compliance framework. It is a sampling strategy with a language bias."

Frequently Asked Questions

What is script adherence scoring in customer service QA?

Script adherence scoring measures how closely conversations follow a defined conversation script or policy during customer interactions. It checks whether required disclosures, procedures, and prohibited responses were handled correctly ^[2].

Why does code-switching cause QA systems to fail?

Most QA tools perform keyword or semantic matching between a conversation and a policy document. When these are in different languages, the match breaks. The system either misses compliance violations or flags compliant conversations incorrectly ^[1].

Is a pre-translation approach sufficient for multilingual QA?

Not reliably. Translation introduces meaning loss, adds cost and latency at scale, and does not handle intra-sentence code-switching. A scoring architecture with native multilingual understanding is more accurate and more efficient.

How does full conversation coverage help with multilingual compliance?

Manual sampling reviews 1-5% of tickets, and the language-switched subset is disproportionately underrepresented in that sample. Scoring 100% of conversations ensures that policy violations in non-dominant language segments are caught consistently.

Which industries are most exposed to this problem?

Fintech, travel, and e-commerce operating in multilingual markets carry the highest exposure because policy adherence is tied to regulatory requirements and customer retention. Any enterprise contact centre in Southeast Asia, MENA, or multilingual Europe faces this challenge.

What is RAG-powered QA scoring?

RAG (Retrieval-Augmented Generation) means the QA system retrieves the relevant policy documents from a vector database before scoring each conversation, rather than relying on a static prompt. This ensures the AI is scoring against your actual SOPs, not generic benchmarks.

Does scoring AI conversation systems raise different multilingual challenges than scoring humans?

AI conversation systems can code-switch just as human interactions do, particularly when trained on multilingual customer data. A unified QA scoring approach that applies the same QA scorecard to both human and AI conversation systems is the only way to maintain consistent quality across a hybrid support team.

About Revelir AI

Revelir AI builds RevelirQA, an AI quality assurance platform that scores 100% of customer service conversations against a company's own policies and QA scorecard. The platform retrieves SOPs via RAG before every evaluation, supports English, Indonesian, Thai, and Tagalog natively, and provides a full reasoning trace on every score for complete auditability. Enterprise clients including Xendit and Tiket.com run RevelirQA in production across thousands of tickets per week. RevelirQA integrates with any helpdesk via API and evaluates both human and AI conversations on a single consistent QA scorecard, giving CX leaders one unified view of quality across their entire support operation.

If your QA scoring goes silent when conversations switch languages, you have a compliance gap you cannot see. Talk to Revelir AI to see how multilingual-aware scoring works in production.

References

Regulatory Script Adherence for AI Voice Agents | Hamming AI Resources (hamming.ai)
Ensuring Script Adherence in a Contact Centre - CX Today (www.cxtoday.com)
Agent Scripting: A Complete Guide for... | Process Shepherd (www.processshepherd.com)
How to Improve Adherence in a Call Center | MiaRec (blog.miarec.com)
Towards Enforcing Company Policy Adherence in Agentic Workflows (arxiv.org)
Compliance policy management: ensuring regulatory adherence (www.dataguard.com)

The Script Compliance Problem: Why Policy Adherence Scoring Breaks Down When Agents Switch Languages Mid-Conversation