Why Code-Switching Kills Your QA Scores

Code-switching, the practice of flipping between two or more languages mid-conversation, is the default communication style across much of Southeast Asia's customer service landscape. A customer service representative at an Indonesian fintech might open in formal Bahasa Indonesia, pivot to English for a technical term, then close with a casual "oke ya, kak" to soften the interaction. This is not sloppy communication; it is culturally fluent service. The problem is that most QA frameworks were not designed for it. When a scoring system cannot parse mixed-language conversations accurately, it penalises service teams for cultural competence, rewards stiff formality, and produces scores that bear no resemblance to actual service quality. The fix is a QA approach built from the ground up for multilingual, high-context conversations.

TL;DR

Code-switching is a feature of great service in SEAA, not a quality problem. Your QA framework needs to reflect that.
Most QA tools fail on mixed-language conversations because they evaluate language structure, not intent and policy compliance.
Keyword-based and translation-layer scoring introduce systematic bias against multilingual customer service interactions.
Effective multilingual QA scores against your own SOPs and QA scorecard, not generic benchmarks.
Full conversation coverage matters more in multilingual environments, because sampling bias compounds when reviewers unconsciously favour tickets they can easily read.

About the Author: Revelir AI builds AI customer service QA software for high-volume customer service teams. Its scoring engine, RevelirQA, runs on thousands of conversations per week for enterprise clients including Xendit and Tiket.com, with proven multilingual scoring across Indonesian-language, English, Thai, and Tagalog environments.

What is code-switching, and why is it the norm in Southeast Asian customer service?

Code-switching is the alternation between two or more languages or dialects within a single conversation, often within a single sentence. In Southeast Asia, it is not an edge case. It reflects how people actually communicate in high-density multilingual societies. A Tiket.com customer asking about a refund might write: "Halo kak, mau nanya soal my booking yang kemarin, it says pending tapi uangnya udah kedebet." That sentence blends formal address ("kak"), colloquial Bahasa ("mau nanya"), English nouns ("booking," "pending"), and a phonetic local contraction ("kedebet" for "terdebet"). A good service representative mirrors this register. A bad QA tool marks it as inconsistent or unresolved.

This matters because the volume is enormous. Indonesia alone processes hundreds of millions of customer service interactions annually across fintech, e-commerce, and travel. If your QA framework cannot read these conversations correctly, you are scoring in the dark.

Where do traditional QA approaches break down on mixed-language tickets?

The failure modes are systematic, not random. Understanding them is the first step to fixing them.

QA Method	How It Handles Code-Switching	The Resulting Problem
Manual sampling	Reviewers may unconsciously select tickets they can read fluently	Mixed-language tickets are underrepresented; pattern gaps go undetected
Keyword matching	Looks for terms in one language; misses synonyms or slang in another	Service teams who use contextually correct informal language get penalised
Translate-then-score	Converts to English before evaluating	Tone, register, and cultural nuance are lost in translation; empathy markers vanish
Generic AI scoring	Trained on majority-English data; treats Bahasa slang as noise	Consistent bias against service teams who communicate in culturally appropriate ways

Manual sampling is especially dangerous here. Traditional QA teams review somewhere between 1% and 5% of all tickets ^[1]. In a multilingual environment, that sample is almost certainly not representative, because reviewers naturally gravitate toward tickets they can evaluate quickly. The mixed-language conversations end up in the unreviewed pile, and the coaching insights buried in them never surface.

Why does scoring tone and empathy get harder when languages mix?

Building on the structural failures above, there is a subtler problem: empathy is language-specific. "Maaf ya kak, ini memang tidak seharusnya terjadi" carries a warmth and personal accountability that a literal English translation ("Sorry, this should not have happened") strips out entirely. A QA scorecard criterion like "service team acknowledged customer frustration" can only be fairly evaluated if the scoring system understands what acknowledgement looks like in Bahasa Indonesia, including informal registers.

The same applies to closing statements, escalation language, and solution confirmation. Indonesian customer service has culturally specific closing norms ("Ada lagi yang bisa kami bantu?") that differ from English conventions. An AI that flags the absence of an English-style closing as a policy miss is not scoring quality; it is scoring cultural conformity.

Empathy markers in Bahasa: "ya kak," "maaf banget," "paham kok" signal warmth in ways a translate-first system will flatten.
Escalation language: "Saya eskalasikan ke tim terkait" and "I'll escalate this" are functionally identical, but a keyword system may only recognise one.
Slang resolution confirmation: "Oke fix ya" is a valid resolution confirmation in casual Indonesian service contexts.

What does a QA framework that actually handles code-switching look like?

A QA framework built for multilingual environments rests on three principles: evaluate intent over language, score against your own policies not generic benchmarks, and cover 100% of conversations.

1. Score intent and policy compliance, not linguistic form

The question a QA scorecard should answer is not "did the service representative say the right words?" but "did the service representative communicate the right information, in a way that aligns with our SOPs, and in a tone appropriate for the customer?" This requires a scoring engine that can reason across languages simultaneously, not one that normalises everything to a single tongue first.

2. Ground every evaluation in your own SOPs

Generic AI models score against generic standards. If your refund policy says disputes must be acknowledged within the first two exchanges, that criterion needs to be evaluated against your actual policy document, retrieved at scoring time. RevelirQA ingests a company's knowledge base and SOPs into a vector database, then retrieves the relevant policy before scoring each conversation. This means a mixed-language conversation about a Xendit payment dispute is scored against Xendit's actual SOP, not a generalised fintech template.

3. Cover every conversation, not a sample

Sampling bias compounds in multilingual environments. When you score 100% of conversations, the mixed-language tickets get the same scrutiny as the clean English ones. Patterns that would never surface in a 1-5% sample, such as service teams consistently mishandling Indonesian-language escalation requests, become visible ^[1].

How should QA scorecards be structured for multilingual service teams?

A QA scorecard for mixed-language environments should separate language-agnostic criteria from language-specific ones. Conflating them produces scores that punish cultural fluency.

Language-agnostic criteria (evaluate in any language): policy compliance, resolution accuracy, escalation handling, first-contact resolution, correct product information.
Language-sensitive criteria (require contextual interpretation): tone appropriateness, empathy expression, greeting and closing format, use of honorifics.
Context-dependent criteria: register matching (was the service representative's formality level appropriate to the customer's opening register?), whether slang use was customer-led or service-team-initiated.

Custom scoring metrics, whether binary, multi-option, or scored, allow QA teams to define these criteria in terms that reflect actual service expectations rather than imported templates from markets where code-switching does not occur.

Frequently Asked Questions

Can AI QA tools accurately score Bahasa Indonesia conversations without translating them first?

Yes, provided the underlying model has sufficient multilingual training data and the scoring logic is grounded in your own SOPs rather than English-language benchmarks. Translation-first approaches lose tone and cultural nuance; native multilingual evaluation preserves them.

Does code-switching indicate lower service quality?

No. In Southeast Asian markets, code-switching is a sign of cultural fluency. Penalising it in QA scores produces misleading performance data and can demotivate service teams who are genuinely serving customers well.

How do you set a consistent QA scorecard when service teams work in multiple languages?

Separate criteria that are language-agnostic (policy compliance, resolution accuracy) from those that are language-sensitive (tone, empathy, closing format). Apply the same standards across languages, but interpret language-sensitive criteria in the context of each language's norms.

Why does sampling bias matter more in multilingual environments?

Because manual reviewers unconsciously favour tickets they can read quickly and confidently. Mixed-language tickets tend to be skipped, meaning the most culturally complex conversations, often the highest-risk ones, are never reviewed.

Can the same QA platform evaluate both human service teams and AI chatbots in mixed-language environments?

It should. As companies deploy AI chatbots alongside human service teams, a unified scoring engine that applies the same QA scorecard to both gives CX leaders a consistent view of quality across the entire support operation, regardless of who or what handled the conversation.

What is the risk of using a generic AI model to score Indonesian-language tickets?

Generic models are trained predominantly on English data. Applied to Bahasa Indonesia or Betawi slang, they produce systematic scoring errors: missed empathy signals, false policy violations, and tone misreadings. The result is a QA score that reflects the model's language limitations, not the service team's actual performance.

How do you handle slang or informal contractions ("kedebet," "oke fix ya") in QA evaluation?

A contextually aware scoring engine should recognise these as valid functional equivalents of formal terms, provided the underlying intent and policy compliance are met. Flagging informal language as a quality miss without checking whether the substance of the response was correct is a false positive that erodes team trust in the QA process ^[1].

About Revelir AI

Revelir AI builds AI customer service QA software for customer service teams that operate at scale. Its scoring engine, RevelirQA, evaluates 100% of support conversations against a company's own policies and QA scorecard, with a full reasoning trace behind every score. RevelirQA is in production at Xendit and Tiket.com, scoring thousands of tickets per week across English, Indonesian-language, Thai, and Tagalog environments. Built for global enterprise teams, the platform delivers consistent, auditable, and multilingual QA at volumes that manual review cannot reach.

Ready to score every conversation, in every language your teams actually use?

Learn more or get in touch with Revelir AI at www.revelir.ai

References

Why Scoring 100% of Calls Could Make Your Agents Worse - Bucher + Suter (www.bucher-suter.com)

Why Code-Switching Kills Your QA Score:s How to Evaluate Agents Who Blend Bahasa Indonesia, English, and Local Slang in a Single Conversation