"We have anonymized the data." Said with full confidence by technically competent teams, that statement is wrong in roughly 60% of cases. What they have actually done is pseudonymization. The data remains personal data under GDPR. The processing is still regulated. And if that data is then sent to an analytics vendor or a cloud LLM for "processing," the problem is unresolved.
The confusion between anonymization and pseudonymization is not a legal technicality. It is the starting point of most GDPR non-compliance findings. It is also the first issue we address systematically in our AI audits at Tensoria.
This article is written for technical teams and Data Protection Officers who want to build a genuinely compliant anonymization pipeline. It covers the legal distinction (GDPR Article 4 and Recital 26, EDPB tests), the recommended on-premise architecture using Microsoft Presidio and spaCy, the configuration of jurisdiction-specific entities (NIR/SSN, SIRET, IBAN, vehicle plates), and the cloud LLM paradox for this use case. Everything here comes from the field, not from a compliance template.
Key takeaways
- ✓ Pseudonymization still means personal data under GDPR (Art. 4-5): confusing it with anonymization is the number-one compliance trap
- ✓ Genuine anonymization must resist three EDPB tests: singling out, linkability, and inference
- ✓ Recommended on-premise stack: Microsoft Presidio + spaCy fr_core_news_lg + custom business recognizers (NIR, SIRET, IBAN, plate, RPPS)
- ✓ Cloud LLM paradox: sending personal data to a US LLM to anonymize it is itself a transfer subject to GDPR
- ✓ Recall above 97% on critical entities: a false negative is a data leak, not an acceptable metric
- ✓ Proof of concept: EUR 5,000 to 9,000, 4 to 6 weeks. Production MVP: EUR 12,000 to 22,000.
1. Anonymization vs pseudonymization: the legal distinction that changes everything
This is the mandatory starting point, and it is frequently skipped. The distinction between anonymization and pseudonymization is not semantic. It determines whether your data remains subject to GDPR or exits its scope entirely. The threshold is much higher than most teams assume.
Pseudonymization: still within GDPR scope
Article 4(5) of the GDPR defines pseudonymization as the processing of personal data in such a manner that it can no longer be attributed to a specific data subject without the use of additional information. That additional information (the mapping key, the vault) must be kept separately and secured.
The legal consequence is unambiguous: pseudonymized data remains personal data. All GDPR principles continue to apply: lawful basis for processing, data subject rights, security obligations, processing register entries, and retention limits. Pseudonymization is a recognized and encouraged security measure under GDPR Article 25 (data protection by design), but it does not take data outside the regulation.
In practice, replacing "John Smith" with "PERSON_4821" or a SHA-256 hash is pseudonymization. Re-identification remains possible if the mapping table is accessible.
Anonymization: exiting GDPR scope
Anonymization under GDPR is an irreversible process. Recital 26 states that the regulation's principles do not apply to information that does not relate to an identified or identifiable natural person, or to data rendered anonymous in such a way that the data subject is no longer identifiable.
The key phrase is "no longer identifiable," assessed by taking into account all means reasonably likely to be used: cost, time, available technologies, and accessible sources of information. This test is not static: what is "reasonably impossible" to re-identify today may not be in five years as new techniques emerge.
Key point
Correctly anonymized data exits GDPR scope and can be used freely: model training, third-party sharing, open data publication, long-term archiving. That is the objective of any serious anonymization effort. But reaching it is harder than simply replacing names.
Why the confusion is systemic
The confusion stems from tooling terminology. Most anonymization libraries, including Presidio itself, use the terms "anonymization" and "pseudonymization" interchangeably in their documentation. Presidio offers masking, replacement, and hashing operators, all of which are theoretically reversible and therefore technically qualify as pseudonymization under GDPR.
This is not a criticism of Presidio: it is a technical tool whose legal classification depends entirely on how it is used. Replacing an entity with a synthetic, non-reversible label (PERSON_001 with no vault) tends toward anonymization. Hashing with a stored secret key is pure pseudonymization. The legal qualification belongs to the DPO, not the developer.
2. GDPR framework: Article 4, Recital 26, and the EDPB tests
Before writing the first line of code, the team must understand the regulatory framework. These three references define what "correctly anonymized" means under EU data protection law.
Article 4: the reference definitions
Article 4 of the GDPR contains the statutory definitions. Two are central to this topic. The full text is available on the GDPR.eu reference portal.
- Article 4(1): "personal data" means any information relating to an identified or identifiable natural person. The list of identifiers is non-exhaustive: name, identification number, location data, online identifier, or factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that person.
- Article 4(5): definition of pseudonymization. Keeping the additional information (the key) separately is a legal requirement, not a recommendation.
Recital 26: the identifiability test
Recital 26 is the reference text for evaluating whether data is genuinely anonymized. It states that to determine whether a natural person is identifiable, account should be taken of "all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly."
This test is contextual and dynamic. It is not enough that you, as the controller, cannot re-identify. No one must be able to do so using reasonable means, including a malicious third party with access to public sources (LinkedIn, open data, cross-referencing).
The three EDPB anonymization tests
The European Data Protection Board (EDPB), in its Opinion 05/2014 on anonymization techniques, sets out three criteria that data must satisfy simultaneously to be considered anonymized:
| Test | Question | Example of a violation |
|---|---|---|
| Singling out | Can an individual be isolated in the dataset? | In a medical dataset, the only person with "occupation = cardiac surgeon" and "region = South West" is identifiable even without a name |
| Linkability | Can two records belonging to the same individual be linked? | The same individual appearing in two anonymized tables with rare common attributes allows an indirect join |
| Inference | Can an attribute value for an individual be deduced with certainty? | If all members of a cohort share a sensitive attribute (illness, conviction), group membership reveals the attribute |
These three tests apply primarily to structured datasets. For unstructured text (contracts, correspondence, reports), the approach differs: identifying entities are removed or replaced, and contextual re-identification risk is evaluated separately.
3. Recommended on-premise architecture
A PII anonymization pipeline follows a sequential pipeline logic. Each stage has a specific role and metrics to monitor. For teams already thinking about how this connects to their broader data infrastructure, our article on enterprise data readiness for AI covers the foundational groundwork that makes anonymization tractable at scale.
Step 1: extraction and pre-processing
Documents rarely arrive as plain text. Common formats in enterprise environments are native PDFs, scanned PDFs, DOCX, emails, database fields, and audio transcripts. Each format requires a dedicated extractor:
- Native PDFs: pdfplumber or PyMuPDF, with table and header handling.
- Scanned PDFs: OCR is mandatory (Tesseract with a language model, or an on-premise OCR service). OCR quality directly conditions entity detection quality.
- DOCX and emails: python-docx, eml-parser. Watch document metadata, which often contains personal data (document author, revision history).
- Databases: column-by-column processing, with prior identification of free-text columns versus structured fields.
For pipelines that need to extract structured data from complex document types before anonymization, see our deep-dive on PDF data extraction AI architecture, which covers layout-aware parsing, table detection, and multi-format ingestion.
Step 2: personal entity detection
This is the core of the pipeline. Three detection layers combine:
- NER via spaCy fr_core_news_lg: detection of named entities (persons, organizations, locations, dates). This is the baseline semantic layer.
- Presidio recognizers: 40+ types of structured entities (email, phone, IP address, credit card number, generic IBAN, GPS coordinates).
- Custom regex recognizers: jurisdiction-specific entities not covered natively (see next section).
Step 3: substitution strategy
Depending on the downstream use, you select a different strategy:
- Strict anonymization: replacement with a synthetic entity of the same type and no vault (
PERSON_001,ADDRESS_001). Irreversible. Applicable for long-term archiving or external sharing. - Pseudonymization with vault: replacement with an encrypted alias, mapping stored in an isolated vault (HashiCorp Vault or equivalent). Reversible for internal analytics use cases that require re-identification.
- Redaction:
[REDACT]or####. For archived documents that must remain human-readable without exposing personal data.
Step 4: post-processing and audit trail
Two post-processing constraints are critical for compliance:
- Cross-document consistency: "John Smith" must receive the same alias across all documents in the corpus. Without a persistent alias register, cross-document analyses on pseudonymized data become unreliable.
- Audit trail: timestamped logging of every anonymization operation (document processed, entities detected by type, method applied, operator, hash of the original document). This is the compliance evidence required in the event of a supervisory authority audit.
4. Presidio + spaCy: configuration and EU-specific entities
Microsoft Presidio is the open-source framework (MIT license) that has become the standard for this type of project. Its two-component architecture, Analyzer (detection) and Anonymizer (transformation), is clear and auditable. It deploys fully on-premise with zero external network calls.
Configuring French NER with spaCy
Presidio uses spaCy by default with an English model. Processing French documents requires a few adjustments:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
# Configure NLP engine with spaCy French model
configuration = {
"nlp_engine_name": "spacy",
"models": [
{"lang_code": "fr", "model_name": "fr_core_news_lg"}
],
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
supported_languages=["fr"]
)
The fr_core_news_lg (large) model is recommended over fr_core_news_sm for better recall on French named entities, at the cost of a larger memory footprint (~570 MB vs ~16 MB). On a dedicated server, this overhead is negligible relative to the coverage gain.
Custom recognizers for EU-specific entities
France-specific and broader EU identifiers are not covered natively by Presidio. Custom recognizers based on regular expressions and checksums are required:
| Entity | Format | Recognizer complexity | Known pitfalls |
|---|---|---|---|
| NIR (French SSN) | 1 84 06 75 116 042 68 |
Medium: regex + adapted Luhn key verification | Variable spacing, provisional NIR, foreign NIR |
| SIRET | 552 178 639 00143 |
Low: 14-digit regex + Luhn algorithm | Confusion with phone numbers or postal codes |
| French IBAN | FR76 3000 6000 0112 3456 7890 189 |
Low: ISO 13616 regex + mod97 checksum | Foreign IBANs in files, split across multiple lines |
| Vehicle registration plate | AB-123-CD |
Low: regex for SIV post-2009 and legacy formats | Old prefectural plates, diplomatic plates |
| RPPS physician number | 10002345678 (11 digits) |
Low: regex + context ("Dr", "physician") | Confusion with other 11-digit numbers absent context |
from presidio_analyzer import PatternRecognizer, Pattern
# Example: NIR recognizer (French social security number)
nir_pattern = Pattern(
name="NIR_pattern",
regex=r'\b[12][0-9]{2}(0[1-9]|1[0-2]|[2-9][0-9]|[6-9][0-9])'
r'(0[1-9]|[1-8][0-9]|9[0-5]|2[AB])[0-9]{3}[0-9]{3}[0-9]{2}\b',
score=0.85
)
nir_recognizer = PatternRecognizer(
supported_entity="FR_NIR",
patterns=[nir_pattern],
context=["social security", "NIR", "securite sociale",
"carte vitale", "assure", "numero SS"]
)
analyzer.registry.add_recognizer(nir_recognizer)
Adding context words (context) improves precision: Presidio raises the confidence score when the detected entity is preceded or followed by associated terms. This reduces false positives on incidental numeric sequences.
Sample pipeline output
Input:
Mr. Jean Dupont (born 12/03/1978, residing at 14 rue des Lilas,
31000 Toulouse, NIR: 1 78 03 31 116 042 68) reported a claim
on 15 March 2026. His IBAN: FR76 3000 6000 0112 3456 7890 189.
Anonymized output (synthetic entity replacement):
Mr. PERSON_001 (born [DATE_001], residing at [ADDRESS_001],
31000 [CITY_001], NIR: [FR_NIR_001]) reported a claim
on [DATE_002]. His IBAN: [IBAN_001].
Anonymization report (attached):
{
"document_id": "DOC-2026-00341",
"processing_timestamp": "2026-05-18T10:14:00Z",
"entities_detected": {
"PERSON": 1, "DATE_TIME": 2, "LOCATION": 2,
"FR_NIR": 1, "IBAN_CODE": 1
},
"method": "anonymization_synthetic_substitution",
"estimated_coverage": 0.97,
"original_document_hash": "sha256:4a8f2c..."
}
5. Presidio, LLM, or CamemBERT NER: which model to choose?
There is no single right answer. The choice depends on document volume, the nature of the entities to detect, and compliance constraints. This is the decision matrix we apply in the field.
| Approach | Recall on structured entities | Contextual entities | Throughput | Sovereignty |
|---|---|---|---|---|
| Presidio + spaCy fr | Very high on known entities | Low (no contextual understanding) | > 100 docs/min on CPU | 100% on-premise |
| CamemBERT NER fine-tuned | High (if fine-tuned on your domain) | Medium | 10 to 30 docs/min (GPU recommended) | 100% on-premise |
| Presidio + on-premise LLM (Mistral) | Very high | High | 5 to 15 docs/min depending on GPU | 100% on-premise if Mistral is local |
| Cloud LLM only (GPT-4o, Claude) | Very high | Very high | Rate-limited | No (see cloud paradox section) |
Default recommendation
For 80 to 90% of enterprise projects, Presidio + spaCy fr + custom recognizers is the right answer. The framework covers structured entities with high recall, runs on standard CPU hardware with no GPU cost, and remains entirely on-premise.
An on-premise LLM (Mistral 7B or 22B) is added only for contextual entities that escape classic NER: a person mentioned without an explicit surname ("the head of the Toulouse subsidiary"), a bank account described in prose rather than IBAN format, or complex indirectly identifying data. The LLM only processes passages with a low confidence score from the first pipeline, which limits required GPU resources. For teams deploying Mistral on-premise, see our guide on deploying LLMs to production, which covers containerization, inference optimization, and monitoring.
Fine-tuning CamemBERT on an annotated domain corpus is relevant when you have a highly specific domain (medical, legal, HR) with entities that spaCy does not recognize. It requires an annotated dataset of 500 to 2,000 examples, which represents a significant human investment. This is an MVP-stage decision, not a POC one.
6. Pipeline evaluation: precision, recall, and false negatives
Evaluating an anonymization pipeline follows an inverted logic compared to classic classification systems. Here, recall is absolutely paramount over precision. A false negative (a personal entity not detected) is a data leak. A false positive (a non-personal entity incorrectly removed) degrades document readability but does not constitute a GDPR violation.
Production targets
| Metric | Minimum target | Notes |
|---|---|---|
| Global recall | > 97% | Measured on representative annotated test set, not generic benchmark |
| Recall on NIR / IBAN / Full name | > 99% | Critical entities: zero tolerance |
| Global precision | > 90% | False positives acceptable if justified to the DPO |
| F1 on critical entities | > 0.96 | NIR, IBAN, full name, date of birth |
| Cross-document alias consistency | > 99% | Same entity, same alias across entire corpus |
| Throughput (Presidio CPU stack) | > 100 docs/min | Standard 8-core CPU server, 2 to 5-page documents |
| Cost per document | < EUR 0.001 | On-premise Presidio stack, excluding monthly infrastructure cost |
Building the test set
The test set must be representative of your real documents, not generic datasets. The recommended approach:
- Select a stratified sample of 200 to 500 documents covering all document types processed (contracts, correspondence, emails, forms, reports).
- Manually annotate personal entities in that sample (at least two annotators to measure inter-annotator agreement).
- Run the pipeline on the sample and compute metrics by entity type and document type.
- Identify recurring false negative patterns and adjust recognizers accordingly.
This real-data evaluation phase is non-negotiable. Generic benchmarks (CoNLL-2003, WikiNER) do not reflect the specifics of your documents: internal language, business abbreviations, formats specific to your sector. The same principle applies to LLM evaluation at large: our article on LLM-as-judge and custom evaluators explains how to build evaluation grids anchored to real data.
7. Reversibility and the pseudonymization vault
When the downstream use requires the ability to re-identify individuals (internal analytics, responding to a GDPR data access request, correcting an error), reversible pseudonymization is chosen over strict anonymization. The choice depends on the processing purpose, to be defined with the DPO before any code is written.
Vault architecture
The vault is an encrypted register that maps original entities to their pseudonymized aliases. It must be:
- Physically separate from the pseudonymized documents (legal requirement under Article 4(5) GDPR).
- Encrypted at rest (AES-256 minimum) and in transit (TLS 1.3).
- Access-restricted and audited: logging of every vault access with operator identity, timestamp, and reason.
- Time-limited: the vault must be deleted or made inaccessible at the end of the data retention period.
HashiCorp Vault (open source, on-premise deployable) is the reference solution for this requirement. It provides a dedicated secrets engine, granular access policies, and a native audit log.
Cross-document consistency with a vault
The vault also solves the cross-document consistency problem: before assigning an alias to an entity, the pipeline checks whether the entity already exists in the vault. If it does, it reuses the existing alias. This guarantees that "John Smith" is always PERSON_001, regardless of which document is being processed.
Entity normalization before vault lookup is critical: "SMITH John", "John Smith", and "J. Smith" must be recognized as the same entity. This is an entity resolution sub-problem that requires a dedicated normalization layer.
8. The cloud LLM paradox and sovereign alternatives
This is the most counter-intuitive point in the topic, and the one that generates the most debate within technical teams.
The paradox, stated clearly
To anonymize personal data using a cloud LLM (ChatGPT, Claude Anthropic, Gemini), you must first send that personal data to the provider's servers. That transfer is itself a personal data processing operation subject to GDPR. You need:
- A lawful basis for that processing (GDPR Article 6).
- A Data Processing Agreement (DPA) with the LLM provider.
- A Data Transfer Impact Assessment (DTIA) if the servers are located outside the EU/EEA.
- Contractual guarantees that the provider will not use your data to train its models.
In practice, sending medical records, HR data, or client files to OpenAI for "anonymization" is rarely a compliant approach without solid legal groundwork in place first. The EU AI Act adds another compliance layer on top of GDPR for high-risk processing; our guide to EU AI Act compliance covers the intersection of both frameworks.
Acceptable solutions by sensitivity level
| Solution | Sensitivity level | Required conditions |
|---|---|---|
| Presidio + spaCy on-premise | All levels, including health and judicial data | Zero cloud calls: recommended by default |
| Mistral 7B / 22B on-premise | All levels | On-premise GPU or sovereign cloud provider (OVH, Scaleway) |
| Azure OpenAI Service (EU region) | Sensitive data excluding special-category data | Signed Microsoft DPA, EU Data Boundary enabled, no-training guarantee in contract |
| OpenAI API (Enterprise tier) | Non-critical data only | Signed DPA, no-training guarantee, DTIA completed: not recommended for health or judicial data |
Field note from Tensoria
On projects involving health data or HR files, we systematically recommend Presidio + spaCy on-premise as the primary layer, and on-premise Mistral for contextual entities. Cloud architecture only enters the picture for already-anonymized data, never for raw input. This position is validated by our clients' DPOs and documented in the DPIA. If you are evaluating a self-hosted RAG architecture alongside an anonymization pipeline, see our article on self-hosted RAG architecture for a compatible design.
9. Project compliance: DPIA, processing register, and processor contracts
The technical stack alone is not enough. A GDPR-compliant anonymization pipeline requires three documentary deliverables produced in parallel with development.
The Data Protection Impact Assessment (DPIA)
An AI anonymization pipeline typically processes personal data at scale, potentially including special-category data (health, judicial, ethnic origin), through automated large-scale processing. These criteria together trigger the DPIA obligation under GDPR Article 35 in most cases. EU supervisory authorities, including France's CNIL (Commission Nationale de l'Informatique et des Libertés), have published binding lists of processing operations that require a DPIA: large-scale automated processing of special-category data is on every national list.
The DPIA must document:
- The specific purpose of the anonymization processing (why, for whom, with what retention period before anonymization).
- The categories of data processed and their sensitivity level.
- Technical and organizational measures in place (on-premise pipeline, vault access controls, audit trail).
- Residual risks identified and mitigation measures.
- The reasoned decision of the data controller.
The processing register
The anonymization operation itself is a processing activity to be recorded in the register of processing activities (GDPR Article 30). It must appear separately from the original processing that produced the data. Fields to complete: purpose, data categories, data subject categories, recipients, retention periods, security measures, and processors.
The processor contract
If an external service provider (such as Tensoria) is involved in developing or operating the pipeline, a data processing agreement compliant with GDPR Article 28 is mandatory. This agreement must specify: documented instructions from the controller, security obligations, prohibition on unauthorized sub-processing, audit arrangements, and data deletion or return at contract end.
10. Costs, timelines, and common pitfalls
Cost ranges
| Phase | Duration | Range | Deliverables |
|---|---|---|---|
| Proof of concept | 4 to 6 weeks | EUR 5,000 to 9,000 | Presidio + spaCy deployed, custom business recognizers, evaluation on 500 annotated documents, coverage report by entity type |
| Production MVP | 2 to 3 months | EUR 12,000 to 22,000 | Batch and/or real-time pipeline, document flow integration, vault, audit trail, DPO sign-off |
| Annual TCO (excluding development) | Recurring | EUR 5,000 to 12,000/year | CPU infrastructure (< EUR 100/month), recognizer maintenance (2 to 4 days/year), semi-annual coverage audit |
For broader AI project budget benchmarks, our article on RAG project costs and TCO provides comparable budget structures for document-intensive AI builds.
Typical timeline
- Weeks 1 to 2: GDPR audit with the DPO (data scope, processing purpose, register entries), inventory of entity types to cover.
- Weeks 3 to 5: Presidio + spaCy fr deployment, custom recognizer configuration, evaluation on annotated sample.
- Weeks 6 to 9: integration into document flows, vault, audit trail, non-re-identifiability testing.
- Weeks 10 to 12: phased rollout, DPO sign-off, system documentation.
What consistently extends the timeline: DPO or specialist GDPR counsel review (essential but frequently underestimated), document format variety (each new format requires a dedicated extractor), and formal non-re-identifiability requirements (testing with external data, k-anonymization for structured datasets).
The five most common pitfalls
1. Confusing anonymization with pseudonymization. Replacing a name with a reversible identifier is pseudonymization. The data remains within GDPR scope. This is the most frequently raised point in regulatory audits.
2. Not testing non-re-identifiability. Anonymizing names and emails without checking whether the combination of postal code, age, and occupation still enables re-identification. For structured datasets, k-anonymization techniques are essential.
3. Missing indirectly identifying entities. Presidio detects names, emails, phone numbers. It does not detect "the managing director of the Toulouse-based firm specializing in..." which can be identifying. An on-premise LLM is required for these rare contextual entities.
4. No audit trail. Without precise logging of who anonymized what, when, and with which method, GDPR compliance is difficult to demonstrate in the event of a supervisory audit. The audit trail must be generated automatically by the pipeline.
5. Cross-document inconsistency. If "John Smith" becomes PERSON_001 in one document and PERSON_017 in another, cross-document analyses on pseudonymized data produce incorrect results and the re-identification guarantee becomes harder to maintain.
11. Frequently asked questions
Further reading
- EU AI Act Compliance Guide: how GDPR anonymization requirements intersect with the AI Act risk tiers for automated processing systems.
- Self-Hosted RAG Architecture: sovereign document retrieval design that pairs naturally with an on-premise anonymization pipeline.
- Deploying LLMs to Production: on-premise Mistral deployment patterns used as the contextual entity detection layer.
- Enterprise Data Readiness for AI: the data infrastructure foundations that make anonymization at scale tractable.
- PDF Data Extraction AI Architecture: document ingestion and pre-processing patterns for the first stage of the pipeline.
- RAG Project Costs and TCO: comparable budget frameworks for document-intensive AI builds.
- Our RAG systems service: sovereign internal knowledge assistants built on your documents, compliant from the start.
Working on a GDPR anonymization project?
We audit your pipeline end to end: legal framing with your DPO, Presidio configuration for your entity types, evaluation on your real documents, and DPIA documentation.