CamemBERT or GPT-4o mini for classifying support tickets?

The answer depends on volume. Under 5,000 tickets per year, GPT-4o mini in zero-shot or few-shot mode is sufficient: one-week deployment, no annotation required, typical F1 of 0.80 to 0.87. Above 10,000 tickets per year, a fine-tuned encoder model (e.g. CamemBERT/BERT) trained on your own annotated data is the right call: latency under 50 ms, near-zero marginal cost at scale, and a 5 to 15 F1-point gain over zero-shot LLMs. In between, a hybrid architecture (LLM for ambiguous cases, BERT for routine cases) is often the best option.

How many annotated tickets do you need to fine-tune a BERT model?

The minimum viable threshold is 1,000 annotated tickets per main category. In practice, 3,000 to 5,000 examples covering all categories produce a robust model. If some categories are underrepresented in your historical data, you can augment with LLM-generated paraphrases: generate syntactic variants of existing examples to balance the class distribution. Inter-annotator agreement (Cohen's kappa) must exceed 0.70 before annotation starts. If it does not, the taxonomy needs rework, not the data.

Can tickets be classified directly from Zendesk?

Yes. Zendesk exposes a REST API and a webhook system that allow classification to be triggered at ticket creation. The classifier receives the ticket title and body, returns one or more labels with confidence scores, and the Zendesk API applies the tags and team assignment. Total latency is under 500 ms with a cloud-hosted BERT model, imperceptible to the agent. Freshdesk and Intercom follow the same pattern.

What happens when the model is not confident about the category?

The standard mechanism is a confidence threshold. Below 0.80, the ticket is sent to a manual triage queue rather than routed automatically. This is not a failure: it is the safety net that maintains service quality. In practice, roughly 10 to 15 percent of tickets go through this queue at launch. That rate decreases as the model is retrained on the agents' corrections.

How do you handle taxonomy evolution over time?

This is the concept drift problem: tickets evolve with the product, new categories emerge. The solution is a quarterly retraining cycle on recent tickets, combined with monitoring of the output category distribution. An unusual spike on a specific category is often the first signal that a new class is emerging. The agent feedback loop, their corrections on mis-routed tickets, is the primary data source for this retraining.

How much does an end-to-end ticket classification project cost?

A 4 to 6-week POC covering annotation of 1,000 to 3,000 historical tickets, classifier training and evaluation, costs between 5,000 and 10,000 euros. The production MVP, including ticketing platform API integration, human triage workflow for low-confidence cases, and monitoring, runs 12,000 to 20,000 euros over 2 to 3 months. Annual TCO thereafter is 5,000 to 12,000 euros, covering model cost, quarterly retraining, and residual human review.

AI Ticket Classification: Fine-tuned Models vs GPT

AI-powered support ticket classification, helpdesk interface with automatic routing by category

In most support teams, the day starts with the same task: manually sorting incoming tickets, categorizing them, and assigning them to the right team. For a helpdesk receiving 400 tickets a day, that is two hours of work every morning before anyone has solved a single real problem.

AI-powered ticket classification solves this problem, but only when implemented in the right order. The choice between a fine-tuned encoder model (e.g. CamemBERT/BERT) and GPT-4o mini is not the first decision to make. The first decision is the taxonomy. Without a coherent, business-validated taxonomy, no model can be accurate.

This article covers the full picture: from taxonomy design to the continuous improvement pipeline, through model selection by volume, Zendesk/Freshdesk integration, and the metrics that actually matter.

At a glance

Prerequisite number one: a business-validated taxonomy, mutually exclusive, before any annotation
Under 5,000 tickets/year: GPT-4o mini zero-shot, one-week deployment, F1 0.80 to 0.87
Over 10,000 tickets/year: fine-tuned BERT/CamemBERT, F1 above 0.90, latency under 50 ms, near-zero marginal cost
In between: hybrid architecture, BERT for routine cases, LLM for ambiguous ones
POC: 5,000 to 10,000 euros, 4 to 6 weeks. Production MVP: 12,000 to 20,000 euros

Manual ticket routing: an invisible but real cost

Manual ticket triage is one of those costs that never appears on a balance sheet but weighs heavily in practice. One person spending two hours every morning sorting and distributing tickets amounts to 40 working days per year dedicated to a task with zero added value. Multiply by fully-loaded salary cost and you have a concrete number.

But the most significant hidden cost is not the dispatcher's time. It is misrouting. When an urgent ticket lands in the wrong queue, it can lose 48 hours before being redirected. On a 4-hour SLA, that is a contractual penalty. On a VIP client ticket, it is a customer churn prediction problem in disguise: silent dissatisfaction that surfaces in your retention metrics weeks later.

Why FCR degrades without reliable classification

First Contact Resolution (the rate at which issues are resolved on the first interaction) is the king KPI of support operations. It drops for two reasons directly tied to routing:

A ticket sent to the wrong team comes back with a generic response or a request for more information, forcing the customer to re-contact.
Teams that are poorly sized (tier-2 flooded with mis-triaged tier-1 tickets) deprioritize poorly and generate excessive wait times.

Automated classification solves both: precise routing in under 500 ms, correctly assigned priority, SLA triggered immediately. The gains we measure across projects are consistent with industry benchmarks: 40 to 60 percent reduction in average time to first assignment after deployment.

If you are currently scoping your project, our guide to AI auditing for SMBs and mid-market companies explains how to frame this type of initiative before writing a single line of code.

Taxonomy as a hard prerequisite

Here is the rule we repeat at the start of every project: classification is only as good as the taxonomy it targets. If categories overlap, if teams disagree on what a "billing bug" ticket actually means, no model can be accurate. This is not a model limitation: it is a mathematical impossibility.

A model learns to reproduce past human decisions. If those decisions are inconsistent, the model learns inconsistency.

Warning signal

If you ask three agents to categorize the same batch of 50 tickets and inter-annotator agreement (Cohen's kappa) is below 0.70, stop. This is not a data problem, it is a taxonomy problem. Rework the category definitions before doing anything else.

What happens when you skip this step

We took over a project that had already been started by another team at a SaaS vendor. The model had been trained on 18 months of history. Macro precision in production was 0.68, with entire classes the model never predicted.

After diagnosis, the problem was straightforward: the "UI bug" and "functional bug" categories overlapped for 40 percent of tickets. Agents had different practices. The model had learned the chaos. After two taxonomy redefinition workshops, the model retrained on the same data but with corrected labels reached a macro F1 of 0.89.

How to build a taxonomy that holds

A good ticket classification taxonomy meets three criteria: categories are mutually exclusive (a ticket belongs to one main category), collectively exhaustive (every ticket can be categorized), and actionable (each category triggers a different routing decision).

The two-level hierarchical structure

We consistently recommend a two-level taxonomy:

Level 1 (category): 5 to 12 broad, stable categories. Examples: hardware failure, software issue, billing question, feature request, usage question.
Level 2 (subcategory): 3 to 6 subcategories per level-1 category, more granular. Examples under "hardware failure": printer, network, workstation, peripheral.

Level 1 drives team routing. Level 2 drives assignment to a specialist agent or triggers a specific procedure.

A concrete example for an ITSM helpdesk

L1 Category	L2 Subcategories	Target Team
Hardware failure	Printer, network, workstation, peripheral	Hardware N2 support
Access and permissions	Password reset, AD rights, application access, badge	Identity N1 support
Software bug	ERP, email, business tool, browser	Application N2 support
Usage question	Training, procedure, documentation	Functional N1 support
Service request	New hardware, software install, relocation	IT management

The workshop session with operational teams

The taxonomy must not be designed in isolation by the technical team. The first two weeks of a project must include at least two workshops with agents and support managers. These sessions serve to:

Identify edge cases and overlap areas between categories
Define a written decision rule for each ambiguous boundary
Confirm that each category corresponds to a different treatment (if not, the distinction is pointless)
Document canonical examples for each category to guide annotation

This work takes time. That is expected. It is the hard prerequisite that determines the quality of everything that follows.

Choosing the model based on ticket volume

Once the taxonomy is validated, the model question arises. The answer depends almost entirely on annual ticket volume. If you are new to this space, our article on machine learning vs generative AI explains the fundamental distinction between encoder-based classifiers and large language models before diving into the three cases below.

Under 5,000 tickets per year: zero-shot or few-shot LLM

GPT-4o mini or Mistral Large in zero-shot or few-shot mode is the fastest approach to deploy. No annotation required: provide the taxonomy in the prompt, a few examples per category, and the model classifies. Performance is solid on clean taxonomies (F1 of 0.80 to 0.87). Cost is 0.001 to 0.003 euros per ticket.

Advantages: one-week deployment, no annotated data needed, immediate taxonomy updates (just edit the prompt)
Limitations: variable cost, latency of 200 to 800 ms per ticket, weaker performance on rare or ambiguous categories
Use for: POC, concept validation, low volumes, unstable taxonomies

Over 10,000 tickets per year: fine-tuned encoder model

A fine-tuned encoder model such as CamemBERT (for French-language tickets, pre-trained by the CentraleSupelec/Inria team) or BERT variants (for English), trained on your annotated history, is the reference for high-volume deployments. Performance on well-defined categories exceeds F1 0.90. Latency is under 50 ms per ticket. Marginal cost at scale is near zero (model hosted internally or on a dedicated cloud GPU).

Advantages: 5 to 15 F1-point precision gain over zero-shot, excellent latency, predictable cost at scale, data stays within your infrastructure
Limitations: requires 1,000 to 5,000 annotated tickets per main category, retraining needed when the taxonomy evolves
Use for: high-volume helpdesks, sensitive data, strict SLA requirements

Alternative for sensitive data

Mistral 7B fine-tuned via LoRA is a relevant alternative when tickets contain sensitive internal data (clinical IT helpdesks, HR ticketing). LoRA fine-tuning on 2,000 annotated tickets delivers performance comparable to a BERT model on broad categories, with deployment on sovereign infrastructure (OVH, Scaleway, or on-premises).

Between 5,000 and 10,000 tickets per year: hybrid architecture

The most robust intermediate solution combines both approaches. A lightweight BERT classifier handles routine cases with high confidence (roughly 80 percent of tickets). An LLM handles the ambiguous cases where BERT hesitates (below the confidence threshold). The result: the best of both worlds in terms of cost and accuracy.

Criterion	GPT-4o mini zero-shot	Fine-tuned BERT/CamemBERT	Hybrid
Deployment time	1 week	6 to 8 weeks	8 to 10 weeks
Annotations required	0	1,000 to 5,000 per category	500 to 2,000 per category
Typical F1	0.80 to 0.87	0.88 to 0.94	0.87 to 0.92
Latency per ticket	200 to 800 ms	under 50 ms	50 to 300 ms
Cost per ticket	0.001 to 0.003 EUR	near zero at scale	0.0002 to 0.001 EUR
Recommended for	under 5,000 tickets/year	over 10,000 tickets/year	5,000 to 10,000 tickets/year

The continuous improvement pipeline

The classification model is not a fixed deliverable. It is a living system that degrades without maintenance and improves when fed the right data. The feedback loop is the long-term differentiator between a project that still performs 18 months after launch and one that regresses.

The agent feedback loop

Every time an agent corrects a misrouting, they produce a valuable data point: the ticket example with the correct category. If that correction does not feed the next training cycle, you are discarding your primary source of improvement data.

The continuous improvement pipeline runs in four steps:

Correction collection: every manual re-routing by an agent is recorded with the corrected category.
Quality review: once a month, a team lead validates a sample of corrections to ensure consistency with the taxonomy.
Quarterly retraining: validated new data enriches the training dataset. The model is retrained on the consolidated set.
Evaluation on a stratified test set: before deployment, the new model is evaluated on a class-stratified test set to confirm that both overall and per-category performance has improved or been maintained.

Monitoring the output category distribution

Beyond precision metrics, you need to monitor the distribution of predicted categories. An unusual spike on a specific category is rarely a model problem. It is usually the signal of an ongoing incident or a recurring issue emerging in the product.

This monitoring is also the first detector of concept drift: if a new subcategory gradually emerges in the tickets and the model starts routing it to "other" or a nearby category, that is the signal to enrich the taxonomy and retrain.

To understand how this type of system fits into a broader automation workflow, the article on intelligent automation with n8n provides concrete examples of production processing pipelines.

Zendesk, Freshdesk, and Intercom integration

All three major ticketing platforms expose REST APIs and webhook systems that allow real-time classification integration. The architecture is the same in all three cases, with a few platform-specific nuances.

Standard integration architecture

The standard flow works as follows:

A ticket is created on the platform.
A webhook sends the ticket title and body to the classification API.
The classifier returns labels (category, priority, team) with confidence scores.
If confidence exceeds the threshold (typically 0.80), the platform API applies tags and assignment automatically.
If confidence is below the threshold, the ticket is sent to a manual triage queue with the proposed labels visible to the agent.

The JSON output from the classifier looks like this:

{
  "ticket_id": "TKT-2026-04521",
  "category": "hardware_failure",
  "subcategory": "printer",
  "priority": "high",
  "target_team": "support_tier_2",
  "additional_labels": ["vip_client", "recurring"],
  "confidence": {
    "category": 0.94,
    "priority": 0.87
  },
  "requires_human_routing": false
}

Platform-specific notes

Zendesk: the Tickets API allows updating custom fields, tags, and assignment in a single request. Zendesk webhooks are stable and well-documented. Typical integration time: 3 to 5 development days.
Freshdesk: webhooks are configurable from the automation interface. The REST API is consistent but custom fields require upfront configuration. Typical integration time: 3 to 4 days.
Intercom: the API is conversation-oriented rather than ticket-oriented. Concepts need mapping (conversation to ticket, tags to categories). Better suited for intent classification than team routing. Integration time: 5 to 8 days.
Special case: on-premises ServiceNow: REST APIs for on-premises ServiceNow are often poorly documented and require version-specific adaptations. Add 2 to 3 extra weeks to the plan.

On GDPR: tickets can contain customer personal data. Model inference logs must be anonymized or not retained. If you use an external LLM API (OpenAI, Mistral API), verify the data processing terms. For sensitive data, a self-hosted model is preferable.

Our case study on enterprise RAG systems illustrates how these data sovereignty constraints shaped the architecture choice for a production deployment.

Metrics and realistic targets

Two measurement mistakes recur systematically on classification projects. The first is tracking only overall accuracy. The second is looking only at model metrics without connecting them to business indicators.

Model metrics to track

Metric	Realistic target	Why it matters
Macro F1 (all categories)	above 0.87	Measures performance across each class without bias from majority classes
F1 on priority categories	above 0.92	Critical tickets cannot be misrouted
Correct automatic routing rate	above 88%	Share of tickets correctly assigned without human intervention
Cost per classified ticket	below 0.003 EUR	Baseline for ROI calculation
Latency (real-time assignment)	under 500 ms	Imperceptible to agents, compatible with SLA workflows

Why overall accuracy is misleading

On a helpdesk where 70 percent of tickets fall into two main categories, a model that always predicts those two categories achieves 70 percent overall accuracy. It is nonetheless useless for the remaining 30 percent, which are often the most urgent. Macro F1 or class-weighted F1 are the only relevant metrics for evaluating imbalanced classification.

Business metrics to connect

Model metrics only make sense when connected to operational indicators:

Average time to first response: should drop significantly after deployment (target: minus 40 percent)
Manual re-routing rate: tickets moved from one team to another after initial assignment (target: below 8 percent)
FCR: first contact resolution rate, indicator of routing relevance
Time saved on triage: directly measurable in hours per week

For a deeper look at measuring AI project benefits, the article on AI project costs and TCO proposes a structured methodology applicable to classification projects.

Costs, timeline, and TCO

POC (4 to 6 weeks): 5,000 to 10,000 euros

The POC covers taxonomy review and stabilization with the teams, annotation of 1,000 to 3,000 historical tickets, classifier training and evaluation (F1 per class, confusion matrix), and a performance dashboard. It does not cover production integration.

Production MVP (2 to 3 months): 12,000 to 20,000 euros

The MVP adds API integration with the ticketing platform, a human triage workflow for cases below the confidence threshold, production monitoring, and feedback loop setup. This is the stage where business value becomes visible.

Typical timeline

Weeks 1 to 2: taxonomy workshop with teams, historical data extraction and annotation
Weeks 3 to 5: classifier training, class-stratified evaluation set, confidence threshold calibration
Weeks 6 to 8: ticketing API integration, UAT testing, shadow mode deployment (classifier runs but makes no automatic decisions)
Weeks 9 to 12: progressive activation, monitoring, first retraining on new data

Annual TCO: 5,000 to 12,000 euros

The annual TCO after production launch covers the model cost (self-hosted or API), quarterly retraining (1 to 2 days of work per quarter), and residual human review integrated into existing workflows.

This figure should be compared to the annual cost of manual triage. On a helpdesk with 10 agents receiving 15,000 tickets per year, if triage represents 15 percent of each agent's time, automated classification frees the equivalent of 1.5 FTEs, a positive ROI in year one.

For broader context on AI project cost structures, our article on RAG project costs and TCO details comparable cost frameworks across AI initiatives.

Common pitfalls

Pitfall 1: vague or unstable taxonomy

If humans themselves cannot agree on the category of a given ticket, the model cannot learn. A Cohen's kappa below 0.70 on inter-annotator agreement is the signal to rework definitions before annotating. This upfront work avoids having to discard and redo the entire annotation two months later.

Pitfall 2: ignoring class imbalance

In practice, 20 percent of categories represent 80 percent of tickets. Without oversampling or class weighting for minority classes, the model ignores rare categories. Those categories are often the most critical (security incidents, major outages, VIP clients). The technical solution is to weight classes during training and augment rare data with LLM paraphrasing when necessary.

Pitfall 3: forgetting concept drift

A model trained in January can be significantly degraded by September if the product has evolved, if new problem types have appeared, or if the customer base has shifted. Monitoring the output category distribution in production is the first detector of this phenomenon. The quarterly retraining cycle is the operational response.

Pitfall 4: treating classification as a black box

Operational teams must understand why a ticket was misrouted, otherwise they lose confidence in the system and route around it. Adding an explanation, the words or phrases that influenced the decision, rephrased in natural language, is essential for adoption. It also allows agents to correct intelligently, rather than just re-routing without understanding.

Pitfall 5: no structured feedback loop

Agent corrections are your most valuable source of improvement data. If they are not collected, tracked, and fed back into training cycles, the model stagnates or regresses while your tickets evolve. This is the differentiator between a project that performs at 12 months and one that needs to be rebuilt.