Beyond English-First Benchmarks: How to Define What "Good" Looks Like for Agents Working in Bahasa Indonesia, Thai, and Tagalog

Published on:
June 15, 2026

Beyond English-First Benchmarks: How to Define What...

Most QA scorecards were written in English, for English-speaking customer service representatives, and tested against English conversations. When companies apply those same criteria to representatives handling tickets in Bahasa Indonesian, Thai, or Tagalog, they are not measuring quality. They are measuring linguistic proximity to a benchmark that was never designed for these languages. Defining what "good" looks like across Southeast Asian languages requires rethinking the criteria themselves, not just translating the words.

TL;DR

  • English QA scorecards break down in Bahasa Indonesia, Thai, and Tagalog because politeness, formality, and resolution signals work differently in each language.
  • Direct translation of criteria is insufficient. Each language needs culturally grounded definitions of tone, empathy, and completeness.
  • Consistent, policy-aligned scoring across languages is achievable, but only when the QA system ingests your actual SOPs rather than applying generic benchmarks.
  • Sampling 1-5% of multilingual tickets amplifies blind spots. Full-conversation coverage is the only reliable baseline.
  • AI scoring engines that support multilingual evaluation can apply the same rubric consistently, while still accommodating language-specific criteria.

About the Author: Revelir AI builds AI quality assurance platforms for high-volume, digitally-native businesses. Its scoring engine, RevelirQA, is in production at Xendit and Tiket.com, scoring thousands of multilingual tickets per week across Bahasa Indonesia and other Southeast Asian languages.

Why Do English-First QA Frameworks Fail in Southeast Asian Languages?

Quality standards for customer service have historically been defined in English, reflecting the dominance of English in early CX research and tooling [1]. Applying those standards elsewhere creates a category error: the criteria were not derived from what "good" looks like in a given language. They were derived from English conversational norms and then assumed to be universal.

Three structural reasons explain the failure:

  • Politeness operates differently. In Thai, formality is encoded grammatically through particles like "khrap" and "kha." A representative omitting these is not simply being casual. They are violating a social register that carries real emotional weight for the customer. An English rubric scoring for "polite tone" has no mechanism to detect this.
  • Empathy signals are language-specific. Phrases that demonstrate empathy in English ("I completely understand your frustration") often sound hollow or overly formal when literally translated into Tagalog or Bahasa Indonesia. Native speakers use different constructions, and scoring literal translation as "empathetic" rewards the wrong behaviour.
  • Resolution confirmation is not universal. In English, closing a ticket often follows a clear "Is there anything else I can help you with?" In Bahasa Indonesia and Tagalog, resolution is frequently confirmed through softer, relational phrases that an English-trained model or reviewer may not recognise as a proper close.
"Good quality in a customer service interaction is notoriously context-dependent, and that context includes language and culture, not just the content of the policy being applied." [1]

What Should a Language-Specific QA Scorecard Actually Measure?

Building on the structural failures above, the harder question is what criteria should replace or supplement the English defaults. A well-designed QA scorecard for multilingual teams should separate universal criteria from language-specific ones rather than collapsing them into a single translated rubric [3].

Criteria Category Universal (applies across languages) Language-Specific (requires local definition)
Policy compliance Did the representative follow the correct refund/escalation SOP? Did the representative use the locally appropriate phrasing to communicate the policy?
Tone and register Was the representative respectful? Did the representative use the correct grammatical formality markers? (Thai particles, Javanese speech levels in Bahasa Indonesia)
Empathy Did the representative acknowledge the customer's situation? Did the representative use culturally appropriate empathy expressions, not literal translations?
Resolution confirmation Was the issue resolved? Was the close phrased in a way that feels complete to a native speaker?
Accuracy Was the information given factually correct? Was technical terminology localised correctly (e.g. "dompet digital" vs "e-wallet" in Indonesian)?

How Does Code-Switching Complicate QA Scoring?

A related but distinct challenge is code-switching: customer service representatives in Southeast Asia routinely blend English terms into Bahasa Indonesia or Tagalog sentences, and customers do the same. This is not a quality failure. It is a linguistic reality of these markets.

An example from a Tiket.com-style interaction: a representative might write "Silakan cek booking ID kamu di email ya, kak" (Please check your booking ID in your email), mixing English loanwords naturally into a Bahasa Indonesia sentence. A QA system scoring for "responds in the customer's language" would flag this incorrectly as a deviation unless the scorecard is configured to recognise localised English borrowings as standard in context.

QA criteria should therefore include a definition of acceptable code-switching rather than treating all language mixing as an error. The test is always: does this choice serve the customer's comprehension, or does it create confusion?

What Role Does AI Play in Scoring Multilingual Conversations Consistently?

Stepping back from the criteria design, a separate operational concern is consistency. Human QA reviewers who sample 1-5% of tickets introduce two compounding biases in multilingual environments: they tend to review tickets they can read fluently, and their interpretation of tone varies by reviewer. This means a representative working in Thai is often evaluated less frequently and less consistently than an English-speaking counterpart.

AI scoring engines can address both problems, but only if they are built for it. Generic AI benchmarks struggle with nuanced language quality assessment even in English [4], and the gap widens for lower-resource languages. The key differentiator is whether the scoring system is evaluating against your own policies or against a generic model of "good customer service."

RevelirQA addresses this by ingesting a company's actual SOPs and knowledge base via RAG before evaluating each conversation. The scoring engine retrieves the relevant policy, applies the configured QA scorecard, and produces a reasoning trace showing exactly which policy document was retrieved and why the score was assigned. This approach means the same audit-grade standard applies whether the ticket was written in English, Bahasa Indonesia, Thai, or Tagalog, and it covers 100% of conversations, not a sampled subset.

How Should Teams Build and Validate a Multilingual QA Scorecard?

Building on the criteria framework above, the practical implementation follows a clear sequence:

  1. Start with your universal criteria. Define policy compliance, accuracy, and resolution confirmation in language-neutral terms tied directly to your SOPs.
  2. Layer language-specific criteria. For each target language, work with native-speaking QA leads to define tone, register, empathy expressions, and acceptable code-switching standards. Document these as explicit criteria, not informal guidance.
  3. Configure custom scoring metrics. Binary criteria (did the representative confirm the resolution: yes/no) work well for compliance items. Multi-level scoring works better for tone and empathy, where partial credit reflects real variation.
  4. Run a calibration set. Before deploying AI scoring, have native-speaking reviewers score 50-100 conversations per language. Use these to verify that the AI scoring output aligns with human judgment on language-specific criteria.
  5. Review coaching outputs by language. Aggregate policy misses per language cohort. Patterns that appear only in one language cohort usually indicate a criteria gap, not a representative performance gap.

Frequently Asked Questions

Can one QA scorecard cover Bahasa Indonesia, Thai, and Tagalog, or does each language need its own?

A hybrid approach works best. Universal criteria (policy compliance, accuracy, resolution) can share a common framework. Tone, register, and empathy criteria should be defined separately per language, because what sounds appropriately respectful in Thai reads very differently in Tagalog.

Is translating an English QA scorecard into Bahasa Indonesia sufficient?

No. Translation preserves the words but not the intent. Criteria like "uses empathetic language" need to be rewritten from a native-language perspective, not translated from an English original, to reflect how empathy is actually expressed in that language.

How do you handle code-switching in QA scoring?

Define it explicitly in your scorecard. Code-switching that aids customer comprehension (common English loanwords in Bahasa Indonesia, for example) should be marked as acceptable. Code-switching that creates confusion or mixes registers inappropriately should be flagged. The default should never be "all mixing is an error."

Why is sampling 1-5% of multilingual tickets a particular problem?

Because human reviewers tend to pull tickets they can read fluently, Thai and Tagalog tickets are reviewed less often than English ones. This means representatives in those languages receive fewer coaching signals and less consistent feedback, which compounds over time into a quality gap that management cannot see.

How does AI scoring maintain consistency across languages?

By applying the same configured rubric to every conversation regardless of language, and by retrieving the same policy documents before each evaluation. The consistency comes from the system architecture, not from the AI having perfect multilingual comprehension. A full reasoning trace on every score lets QA leads verify that the right policy was applied [2].

What is the biggest mistake companies make when QA-ing multilingual teams?

Treating language as a formatting variable rather than a substantive quality dimension. Companies that simply translate their English scorecard and call it "multilingual QA" are measuring how well representatives approximate English-style communication, not how well they serve customers in their own language.

Does AI quality assurance work for high-volume multilingual environments?

Yes, and it is arguably where it delivers the most value. Manual review at scale in multiple languages is operationally impossible at high fidelity. AI scoring engines that support multilingual evaluation can cover 100% of conversations consistently, surfacing coaching opportunities that sampling would never catch.

About Revelir AI

Revelir AI builds RevelirQA, an AI quality assurance platform that scores 100% of support conversations against a company's own policies and QA scorecard. Founded in Singapore in 2025 by a Y Combinator alumnus, the platform is in production at Xendit and Tiket.com, scoring thousands of tickets per week in Bahasa Indonesia and other languages. RevelirQA integrates with any helpdesk via API, provides a full reasoning trace on every evaluation for compliance-critical industries, and evaluates both human representatives and AI chatbots through a single consistent scoring framework built for global enterprise.

Ready to move beyond English-first benchmarks and build a QA framework that actually works for your multilingual team?

Learn more about RevelirQA at revelir.ai

References

  1. Beyond the Checklist: What Does Good Teaching Look Like? | NEA (www.nea.org)
  2. Essay Criteria That Are Beyond the Reach of AI? | Educational Technology and Change Journal (etcjournal.com)
  3. The Complete Guide to Talent Evaluation (2026) (juicebox.ai)
  4. Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models (arxiv.org)
💬