How to Detect Policy Violations When Customers Write in...

Detecting policy violations in multilingual customer service conversations is already hard. It becomes significantly harder when those conversations are written not in standard script but in romanised Bahasa Indonesia (common in WhatsApp and live chat), phonetically spelled Thai, or the blended Tagalog-English mix that Filipino customers use every day. Standard QA tools trained on formal English or even standard Indonesian fail to catch violations buried inside informal, code-switched language. This article explains why that gap exists, what makes these languages structurally difficult for automated compliance monitoring, and how QA teams can close it.

TL;DR

Romanised Bahasa Indonesia, Thai transliteration, and informal Tagalog create blind spots for QA tools built on formal-language assumptions.
Code-switching, slang, and non-standard spelling mean policy keywords appear in forms most engines do not recognise.
Automated compliance monitoring that uses policy-grounded reasoning (not keyword matching) is the only reliable fix at scale.
Multilingual sentiment analysis must account for emotional register shifts that happen mid-conversation and mid-language.
The best QA automation tools for Southeast Asian contact centres score 100% of conversations, not a sampled subset, so no violation pattern hides in the unreviewed majority.

About the Author: This article is written by the team at Revelir AI, whose QA scoring engine runs on thousands of live customer service conversations per week at companies including Xendit and Tiket.com, scoring interactions written in Indonesian, Thai, and Tagalog alongside English.

Why Does Informal Language Break Most QA Detection Systems?

Most automated QA tools are built on a monolingual, formal-language assumption: the agent writes in standard English (or one standard target language), and the system checks that output against a predefined list of required phrases or prohibited keywords. That architecture works reasonably well in a single-language, formal-register contact centre. It fails when customers and agents communicate in ways that are linguistically valid but orthographically unpredictable.

The core problem is not the language itself; it is the gap between how these languages are written in real customer service interactions versus how NLP systems expect to see them.

Romanised Bahasa Indonesia: Indonesian has no official romanisation problem (it already uses the Latin alphabet), but casual digital writing introduces heavy abbreviation ("gak" for "tidak," "udah" for "sudah"), slang terms that carry strong sentiment, and Betawi or Javanese dialect insertions that a standard Indonesian model will not recognise ^[3].
Thai transliteration: Thai is written in its own script. When customers or agents type Thai phonetically in roman characters (common on WhatsApp), the same word can be spelled four or five different ways with no standardisation. A policy keyword like a required disclosure phrase will never appear in its expected form ^[3].
Informal Tagalog (Taglish): Filipino customer service conversations routinely switch between Tagalog and English mid-sentence. "Nakapag-request na po ako pero wala pang update" contains a Tagalog verb form, a politeness marker ("po"), and no English at all, yet the emotional weight and the implied SLA complaint are entirely clear to a human reviewer and entirely opaque to a keyword-based QA filter ^[3].

The research term for this is code-switching: the practice of alternating between two or more languages or registers within a single conversation ^[3]. It is not an edge case in Southeast Asian customer service. It is the default mode.

What Specific Policy Violations Are Most Likely to Be Missed?

Building on that linguistic picture, the violations that slip through are predictable once you understand where the detection logic breaks down.

Violation Type	Why It Gets Missed in Informal Language	Example Context
Missing required disclosures	Agent delivers the disclosure in informal Bahasa ("jadi ini ya prosesnya...") instead of the scripted phrase; keyword match fails.	Fintech, regulated product explanations
Unauthorised promises or commitments	Commitment phrased in Taglish slang ("sure naman 'yan, by tomorrow") does not match any flagged English phrase.	E-commerce fulfilment, refund timelines
Escalation SOP not followed	Thai transliteration of "I will check with my supervisor" varies too widely for pattern matching.	Travel, fintech complaints
Sensitive data handling breaches	Agent confirms OTP or account detail in romanised Indonesian slang that a data-handling policy keyword list never covers.	Fintech, banking service
Tone or professionalism violations	Dismissive language in Tagalog ("bahala na") reads as neutral to an English-only sentiment model.	Any high-volume service queue

The data protection angle adds regulatory weight. Indonesia's Personal Data Protection Law places specific obligations on how organisations handle data shared in service interactions ^[1]. A tool that cannot read an Indonesian-language conversation accurately cannot tell you whether an agent handled that data correctly.

How Does Multilingual Sentiment Analysis Fit Into This Problem?

A related but distinct question is whether sentiment analysis can serve as a proxy for detecting violations, since an unhappy customer might signal that something went wrong even if the specific policy breach is hard to identify. The short answer: sentiment is useful context, but it is not a substitute for policy-grounded evaluation.

Multilingual sentiment analysis in Southeast Asian languages faces its own structural challenges:

Politeness markers in Tagalog ("po," "opo") and Javanese-inflected Indonesian can make a frustrated message read as neutral in tone even when the customer is expressing serious dissatisfaction.
Sentiment can shift direction across a single conversation, starting negative and ending resolved, or starting polite and ending hostile. A single sentiment score on a ticket hides this arc entirely.
Code-switched sentences often carry sentiment in one language and factual content in another, which confuses models trained to score a sentence holistically.

The more precise approach is to track sentiment at the start and end of a conversation separately, surfacing cases where a customer's tone worsened despite a technically "resolved" ticket. That sentiment arc is a stronger signal of retention risk than a single aggregated score.

What Makes a QA Tool Actually Capable of Handling These Languages?

Stepping back from the linguistic detail, the practical question for a QA or CX operations leader is: what separates a tool that actually works from one that merely claims multilingual support?

Three capabilities matter most:

Policy-grounded reasoning, not keyword matching. A scoring engine that retrieves your actual SOPs and evaluates whether the agent's response fulfils the policy's intent will catch informal-language compliance failures that a keyword filter never could. The AI understands what the policy requires and judges the conversation against that standard, regardless of which words the agent chose.
100% conversation coverage. Manual QA reviews somewhere between 1% and 5% of tickets, and reviewers tend to pull tickets that are already flagged or easy to assess. The violations buried in informal-language conversations, which are harder to review quickly, are systematically under-represented in that sample. Automated scoring of every conversation removes that bias entirely.
An auditable reasoning trace. When the system flags a violation in a Thai transliteration or Taglish conversation, a QA analyst needs to understand why. A trace showing which policy document was retrieved, what the model was asked, and how it reasoned to its score is what separates a defensible finding from a black-box alert.

This is the architecture that Revelir AI's RevelirQA, an AI quality assurance platform, is built around. It ingests a client's own knowledge base and SOPs into a vector database, retrieves the relevant policy context before evaluating each conversation, and produces a score with a full reasoning trace behind it. Xendit and Tiket.com run this across thousands of tickets per week in Indonesian-language and multilingual queues, in production, not as a pilot.

Step-by-Step: Building a Detection Process for Informal Language Queues

For teams that want to tighten their process today, here is a practical approach regardless of which QA automation tool you use.

Audit your current QA metrics for language assumptions. Go through each criterion and ask: "Would a human reviewer be able to score this if the conversation were entirely in romanised Indonesian or Taglish?" If not, rewrite the criterion to describe the policy intent rather than a specific phrase.
Tag your informal-language ticket volume. Many helpdesks let you tag by language or channel. Understand what share of your queue is non-standard and whether your current QA coverage even touches it.
Review your SOP documentation for implied language requirements. Some SOPs are written assuming formal Indonesian or standard English. If your agents and customers communicate informally, your SOPs need to acknowledge that and describe what compliance looks like in that register.
Separate sentiment arc from resolution status. Configure your QA process to capture how the customer's tone changed over the course of the conversation, not just whether the ticket was closed.
Pilot 100% scoring on one informal-language queue. Compare the policy violation rate surfaced by automated scoring against what manual sampling has been catching. The gap is usually significant.

Frequently Asked Questions

Can standard LLMs score conversations written in informal Bahasa Indonesia or Taglish?

Large language models have broad multilingual capability, but without access to your specific policies and SOPs, they can only evaluate conversations against generic quality benchmarks. Policy-specific scoring requires the AI to retrieve and reason against your actual documents before producing a score.

Is code-switching in customer service conversations a Southeast Asia-specific issue?

Code-switching is documented across many multilingual markets globally, but it is especially prevalent in Southeast Asian service queues where customers frequently mix local languages with English. These specific language combinations (Tagalog-English, Indonesian-English, Thai transliteration) are a differentiator for how RevelirQA was built, requiring models familiar with those patterns.

Does Indonesia's data protection law affect how QA tools process customer conversations?

Yes. Indonesia's Personal Data Protection Law creates obligations around how personal data shared in service interactions is handled and processed ^[1]. QA tools that process conversation data involving Indonesian customers need to operate within that framework, which makes an auditable processing trail more than a convenience.

What is the difference between automated compliance monitoring and keyword-based flagging?

Keyword flagging looks for specific words or phrases and fails when informal language produces equivalent meaning through different words. Automated compliance monitoring using policy-grounded AI evaluates whether the intent of the policy was fulfilled, making it robust to informal spelling, slang, and code-switching.

How do the best QA automation tools handle Thai when it is written in roman characters instead of Thai script?

Thai transliteration has no single standard, so the same phrase can appear in multiple spellings. Tools that rely on script-matching or dictionary lookup break immediately. Scoring engines that reason over meaning rather than surface form are far more reliable for transliterated Thai queues ^[3].

Why does sampling bias matter more in multilingual queues?

Manual reviewers naturally gravitate toward tickets they can read and assess quickly. Informal-language conversations take longer to interpret and are more likely to be skipped or under-reviewed. This means the violation rate in your informal-language queue is almost certainly higher than your sampled data suggests.

Do contracts and communications in Bahasa Indonesia need to be in formal Indonesian?

Indonesian law requires that agreements involving Indonesian parties be written in Bahasa Indonesia ^[2]. For customer service communications, this does not mandate formal register, but it does reinforce why QA processes need to be capable of evaluating Indonesian-language interactions rather than defaulting to English-only review.

About Revelir AI

Revelir AI builds RevelirQA, an AI quality assurance platform that evaluates 100% of customer service conversations against a client's own policies and SOPs. It is deployed in production at Xendit and Tiket.com, scoring thousands of tickets per week across English, Indonesian, Thai, and Tagalog queues. RevelirQA provides a full reasoning trace on every evaluation, making it suitable for fintech and other regulated industries that need auditable quality assurance records. The platform integrates with any helpdesk via API and is available as a SaaS or dedicated tenant deployment.

If your QA process is only covering a fraction of your conversations, and none of your informal-language tickets are being reviewed reliably, there is a straightforward way to fix both problems at once.

See how RevelirQA scores 100% of conversations in multilingual queues, with full policy context and an auditable reasoning trace on every score.

Learn more at revelir.ai

References

Data Protection Laws and Regulations Report 2025 - 2026 Indonesia (iclg.com)
Indonesian language requirement for contracts - Thu, October 24, 2019 - The Jakarta Post (www.thejakartapost.com)
Beyond Monolingual Assumptions: A Survey on Code-Switched NLP in the Era of Large Language Models across Modalities (arxiv.org)

How to Detect Policy Violations When Customers Write in Romanised Bahasa Indonesia, Thai Transliteration, or Informal Tagalog