How to Handle Edge Cases in Your QA Scorecard | Revelir AI

When a customer service conversation doesn't map cleanly to your existing QA scorecard criteria, most teams either force a score that doesn't fit or skip the ticket entirely. Both choices quietly corrupt your quality data. The right approach is to treat edge cases not as exceptions to be discarded, but as structured signals that your scorecard needs to evolve. This article explains how to identify, categorise, and operationalise QA edge cases so they improve your programme rather than undermine it.

TL;DR

Edge cases in QA scorecards are conversations where existing criteria are ambiguous, missing, or inapplicable - they are not errors, they are gaps in your framework.
Forcing a score on a mismatched conversation creates measurement bias; skipping it creates coverage gaps. Neither is acceptable at scale.
A structured edge case protocol - flag, log, review, update - turns exceptions into scorecard improvements.
AI-powered QA scoring can surface edge cases systematically across 100% of tickets, rather than leaving them to surface randomly in a 1-5% manual sample ^[5].
The goal is a living scorecard: one that becomes more precise every time an edge case is resolved.

About the Author: Revelir AI builds AI quality assurance platform for high-volume customer service teams at companies like Xendit and Tiket.com, running thousands of conversations per week across multilingual environments. The insights here are drawn directly from designing scorecards that hold up under real production conditions.

What Exactly Is a QA Scorecard Edge Case?

An edge case in a QA scorecard is any conversation where your existing criteria produce an ambiguous, inapplicable, or contradictory result ^[1]. This is the foundational definition that everything else in this article builds on. Unlike a straightforward fail or pass, an edge case forces a reviewer to make a judgement call your framework was never designed to handle.

Edge cases fall into a few recurring patterns:

Missing criteria: The conversation involves a scenario your scorecard never anticipated - for example, an agent handling a complaint that blends a billing dispute with a fraud suspicion simultaneously.
Contradictory criteria: Two scorecard items conflict for this specific ticket type. A policy on empathy may reward longer responses, while a handle-time criterion penalises them.
Inapplicable criteria: A criterion exists but genuinely cannot be scored - for instance, scoring "greeting format" on a ticket where the customer opened with a mid-conversation screenshot and no text.
Boundary ambiguity: The conversation sits right on the line between a pass and a fail, and the criterion lacks enough specificity to decide ^[2].

"Edge cases don't reveal broken agents - they reveal incomplete frameworks. The conversation is the test your scorecard just failed."

Why Do Edge Cases Matter More Than Teams Realise?

Building on the definition above, the harder question is: how much damage does ignoring edge cases actually cause? The answer is more than most QA programmes account for. Manual QA already reviews only 1-5% of all tickets ^[5], which means the edge cases that do surface are an even smaller subset of the total population. When those rare surfaced cases are then skipped or forced into a mismatched score, the downstream data becomes quietly unreliable.

The specific risks include:

Score inflation or deflation for specific agents: If one agent handles a disproportionate share of complex, edge-case-prone ticket types, and those get skipped, their true performance is masked.
Policy blindspots: Scenarios your scorecard never anticipated are also scenarios your team may be handling inconsistently - and you won't know until a compliance issue surfaces.
Coaching misdirection: Coaching based on forced scores on mismatched criteria trains agents on the wrong behaviour.

How Should You Classify Edge Cases Before Resolving Them?

Not every edge case deserves the same response, and resolving them well starts with categorisation ^[3]. A systematic classification prevents the common mistake of treating every ambiguous ticket as a one-off judgement call rather than a pattern worth fixing.

Edge Case Type	Example	Recommended Initial Action
Missing criterion	Agent handles a refund for a product that no longer exists in the catalogue	Flag for scorecard review; mark criterion as "not applicable" in the interim
Contradictory criteria	Empathy score rewards length; efficiency score penalises it	Escalate to QA lead for criteria priority ruling before scoring
Inapplicable criterion	Greeting format cannot be scored on a callback-initiated ticket	Exempt the criterion for this ticket type; document the exemption
Boundary ambiguity	Agent partially followed the escalation SOP but resolved the issue without escalating	Apply secondary evidence (customer sentiment, resolution outcome) to break the tie

What Is the Right Protocol for Handling an Edge Case in Practice?

A related but distinct question is: once you've classified an edge case, what do you actually do with it? The answer is a four-step protocol that keeps your scoring consistent in the short term while systematically improving the scorecard over time ^[4].

Flag, don't force. Mark the conversation with a dedicated edge case tag in your QA tool. Do not assign a score that the criteria don't support. A blank with a flag is more honest data than a forced score.
Log with context. Record the ticket ID, the specific criterion that failed to apply, and a one-line description of why. This log is your evidence base for the next scorecard iteration.
Review in batch. Edge cases should be reviewed as a group, not individually. Patterns only become visible when you look at ten flagged tickets together, not one at a time.
Update the scorecard. Each resolved batch should produce either a new criterion, a revised definition, or an explicit exemption rule. Edge cases that recur without a scorecard update are a process failure ^[3].

How Does AI Change the Way Teams Discover Edge Cases?

Stepping back from the protocol detail, a separate concern is the discovery problem: how do you even find edge cases at scale? In a manual QA programme reviewing 1-5% of tickets, most edge cases are invisible ^[5]. They exist, but the sample never reaches them.

AI quality assurance changes this fundamentally. When a scoring engine evaluates 100% of conversations, it encounters the full distribution of ticket types - including the rare, complex, and boundary-straddling conversations that manual sampling statistically misses. This is where platforms like RevelirQA add a structural advantage: every ticket is scored, and every score carries a full reasoning trace showing which policy documents were retrieved, what the model considered, and why it scored the way it did. When the scoring engine flags a low-confidence result or surfaces a conversation where criteria conflict, that is a systematic edge case signal rather than a random discovery.

The AI metrics that matter here are not just pass/fail rates, but the volume and pattern of flagged, uncertain, or exempted scores over time. A rising flag rate on a specific contact reason often means that contact reason has outgrown the criteria written for it ^[6].

How Do You Prevent Edge Cases From Accumulating Into a Scorecard Crisis?

The long-term discipline is straightforward but requires consistent practice ^[7]:

Schedule scorecard reviews quarterly at minimum. Edge case logs should be the primary input, not CSAT or NPS data.
Write criteria with boundary conditions explicit. Instead of "agent demonstrated empathy," write "agent acknowledged the customer's stated frustration before proposing a resolution." Specificity reduces the surface area for boundary ambiguity.
Test new criteria against historical tickets before deploying them. A criterion that creates five new edge cases for every one it resolves is making your scorecard less reliable, not more.
Assign ownership. The QA lead, not individual reviewers, should own the edge case log and the scorecard update cycle. Distributed ownership means nothing gets updated.

Frequently Asked Questions

Should I score or skip a conversation if a criterion clearly doesn't apply?

Mark the criterion as exempt and document why, rather than skipping the whole ticket or forcing a score. The rest of the scorecard still contains valid signal.

How many edge cases is too many?

There is no fixed threshold, but if more than roughly 5-10% of a given ticket type is being flagged as edge cases, that criterion or ticket category likely needs a dedicated scoring rule rather than a workaround.

Can AI scoring engines handle edge cases better than human reviewers?

AI scoring engines surface edge cases more consistently and at full volume, because they encounter every ticket rather than a sample. However, resolving what the edge case means - updating the criteria - still requires human judgement from your QA team ^[6].

What's the difference between an edge case and a scorer error?

A scorer error is a case where the criteria were clear and the scorer applied them incorrectly. An edge case is where the criteria themselves are insufficient for the conversation. The distinction matters because they require different fixes.

How do I write a criterion that reduces edge cases from the start?

Write criteria that specify the observable behaviour, the condition under which it applies, and what counts as a pass at the boundary. Vague criteria create edge cases; specific criteria reduce them ^[2].

Should edge case decisions be documented even if they seem obvious?

Yes. What seems obvious to one reviewer is not obvious to the next, and undocumented decisions become inconsistent decisions over time. An auditable log is the minimum.

How often should a QA scorecard be updated?

Quarterly is a practical minimum. If your team is running high ticket volumes or launching new products and contact reasons, a six-week review cycle is more appropriate ^[7].

About Revelir AI

Revelir AI builds AI quality assurance platform for customer service teams that need to move beyond manual sampling and generic benchmarks. Its scoring engine, RevelirQA, evaluates 100% of support conversations against each customer's own policies and QA scorecard, ingested via RAG so that every evaluation reflects the company's actual SOPs rather than a generic standard. Every score is backed by a full reasoning trace, giving compliance-critical teams in fintech and regulated industries a complete audit trail. RevelirQA is trusted by Xendit and Tiket.com for production operations, scoring thousands of conversations per week across multilingual environments including English, Indonesian, Thai, and Tagalog.

Is your QA scorecard ready for what your agents are actually handling?

See how RevelirQA scores 100% of your conversations against your own policies - and surfaces the edge cases your manual reviews never reach.

Learn more at revelir.ai

References

Edge Cases in Programming: Definition and Scenario ... (testomat.io)
Edge Case Testing Explained - What to Test & How to Do It (www.virtuosoqa.com)
Software testing lessons we can learn from edge cases - Qase | Qase (qase.io)
How do you test and plan for edge cases - Archive - The Club: Software Testing & Quality Engineering Community Forum | Ministry of Testing (club.ministryoftesting.com)
QA Trends Report 2026: Market Growth, AI-Driven Testing, ... (thinksys.com)
Top 5 QA Trends for 2026 - What CIOs & VPs Must Know (celticqa.com)
20 Software Quality Assurance Best Practices for 2026 - DeviQA (www.deviqa.com)

How to Handle Edge Cases in Your QA Scorecard: What to Do When a Conversation Doesn't Fit Any Criteria