Why is Presidio recommended for on-premise PII anonymization projects?

Presidio (Microsoft, open source, MIT license) is fully deployable on-premise with zero external network calls. It detects more than 40 entity types, integrates natively with spaCy (fr_core_news_lg for French NER), and is extensible through custom recognizers for jurisdiction-specific identifiers such as the French NIR (social security number), SIRET, IBAN, vehicle registration plates, and RPPS physician numbers. Its architecture is modular and auditable.

Can you use ChatGPT or Claude to anonymize documents containing personal data?

This is the fundamental paradox: anonymizing personal data through a cloud LLM (ChatGPT, Claude, Gemini) requires sending that personal data to the provider's servers first, often in the United States. That transfer is itself a personal data processing operation subject to GDPR. It requires a legal basis, a valid Data Processing Agreement, and a Data Transfer Impact Assessment for transfers outside the EU. Compliant alternatives: deploy an on-premise LLM (Mistral 7B or 22B, Llama), use Azure OpenAI Service in the EU region with a signed DPA and EU Data Boundary enabled, or use a Presidio plus spaCy stack that requires no cloud calls at all.

What is the minimum recall target for an anonymization pipeline in production?

For anonymization, recall (the rate at which personal entities are detected) is more critical than precision. A false negative is a data leak. The minimum production target is recall above 97% overall, and above 99% for critical entities (NIR/SSN, IBAN, full name). This must be measured on an annotated test set representative of your real documents, not on generic benchmarks.

PII Anonymization for GDPR-Compliant AI: Presidio

Q: What is the difference between anonymization and pseudonymization under GDPR?

Anonymization is irreversible: re-identification must be impossible even when all available sources are cross-referenced. Anonymized data falls outside the scope of GDPR (Recital 26). Pseudonymization is reversible with a key: the data remains personal data under Article 4. Replacing a name with a hashed identifier is pseudonymization, not anonymization. Confusing the two is the most common finding in regulatory audits.

Q: Is a Data Protection Impact Assessment required for an AI anonymization project?

In most cases, yes. An AI anonymization pipeline processes personal data at scale, potentially including special-category data (health, judicial, ethnic origin), through automated processing. These criteria together typically trigger the DPIA obligation under GDPR Article 35 and EDPB guidelines. The DPIA must document the processing purpose, residual risks, technical and organizational measures, and the data retention period before anonymization.

On-premise PII anonymization pipeline architecture for GDPR-compliant AI using Presidio and spaCy

"We have anonymized the data." Said with full confidence by technically competent teams, that statement is wrong in roughly 60% of cases. What they have actually done is pseudonymization. The data remains personal data under GDPR. The processing is still regulated. And if that data is then sent to an analytics vendor or a cloud LLM for "processing," the problem is unresolved.

The confusion between anonymization and pseudonymization is not a legal technicality. It is the starting point of most GDPR non-compliance findings. It is also the first issue we address systematically in our AI audits at Tensoria.

This article is written for technical teams and Data Protection Officers who want to build a genuinely compliant anonymization pipeline. It covers the legal distinction (GDPR Article 4 and Recital 26, EDPB tests), the recommended on-premise architecture using Microsoft Presidio and spaCy, the configuration of jurisdiction-specific entities (NIR/SSN, SIRET, IBAN, vehicle plates), and the cloud LLM paradox for this use case. Everything here comes from the field, not from a compliance template.

Key takeaways

✓ Pseudonymization still means personal data under GDPR (Art. 4-5): confusing it with anonymization is the number-one compliance trap
✓ Genuine anonymization must resist three EDPB tests: singling out, linkability, and inference
✓ Recommended on-premise stack: Microsoft Presidio + spaCy fr_core_news_lg + custom business recognizers (NIR, SIRET, IBAN, plate, RPPS)
✓ Cloud LLM paradox: sending personal data to a US LLM to anonymize it is itself a transfer subject to GDPR
✓ Recall above 97% on critical entities: a false negative is a data leak, not an acceptable metric
✓ Proof of concept: EUR 5,000 to 9,000, 4 to 6 weeks. Production MVP: EUR 12,000 to 22,000.

1. Anonymization vs pseudonymization: the legal distinction that changes everything

This is the mandatory starting point, and it is frequently skipped. The distinction between anonymization and pseudonymization is not semantic. It determines whether your data remains subject to GDPR or exits its scope entirely. The threshold is much higher than most teams assume.

Pseudonymization: still within GDPR scope

Article 4(5) of the GDPR defines pseudonymization as the processing of personal data in such a manner that it can no longer be attributed to a specific data subject without the use of additional information. That additional information (the mapping key, the vault) must be kept separately and secured.

The legal consequence is unambiguous: pseudonymized data remains personal data. All GDPR principles continue to apply: lawful basis for processing, data subject rights, security obligations, processing register entries, and retention limits. Pseudonymization is a recognized and encouraged security measure under GDPR Article 25 (data protection by design), but it does not take data outside the regulation.

In practice, replacing "John Smith" with "PERSON_4821" or a SHA-256 hash is pseudonymization. Re-identification remains possible if the mapping table is accessible.

Anonymization: exiting GDPR scope

Anonymization under GDPR is an irreversible process. Recital 26 states that the regulation's principles do not apply to information that does not relate to an identified or identifiable natural person, or to data rendered anonymous in such a way that the data subject is no longer identifiable.

The key phrase is "no longer identifiable," assessed by taking into account all means reasonably likely to be used: cost, time, available technologies, and accessible sources of information. This test is not static: what is "reasonably impossible" to re-identify today may not be in five years as new techniques emerge.

Key point

Correctly anonymized data exits GDPR scope and can be used freely: model training, third-party sharing, open data publication, long-term archiving. That is the objective of any serious anonymization effort. But reaching it is harder than simply replacing names.

Why the confusion is systemic

The confusion stems from tooling terminology. Most anonymization libraries, including Presidio itself, use the terms "anonymization" and "pseudonymization" interchangeably in their documentation. Presidio offers masking, replacement, and hashing operators, all of which are theoretically reversible and therefore technically qualify as pseudonymization under GDPR.

This is not a criticism of Presidio: it is a technical tool whose legal classification depends entirely on how it is used. Replacing an entity with a synthetic, non-reversible label (PERSON_001 with no vault) tends toward anonymization. Hashing with a stored secret key is pure pseudonymization. The legal qualification belongs to the DPO, not the developer.

Before writing the first line of code, the team must understand the regulatory framework. These three references define what "correctly anonymized" means under EU data protection law.

Article 4: the reference definitions

Article 4 of the GDPR contains the statutory definitions. Two are central to this topic. The full text is available on the GDPR.eu reference portal.

Article 4(1): "personal data" means any information relating to an identified or identifiable natural person. The list of identifiers is non-exhaustive: name, identification number, location data, online identifier, or factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that person.
Article 4(5): definition of pseudonymization. Keeping the additional information (the key) separately is a legal requirement, not a recommendation.

Recital 26: the identifiability test

Recital 26 is the reference text for evaluating whether data is genuinely anonymized. It states that to determine whether a natural person is identifiable, account should be taken of "all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly."

This test is contextual and dynamic. It is not enough that you, as the controller, cannot re-identify. No one must be able to do so using reasonable means, including a malicious third party with access to public sources (LinkedIn, open data, cross-referencing).

The three EDPB anonymization tests

The European Data Protection Board (EDPB), in its Opinion 05/2014 on anonymization techniques, sets out three criteria that data must satisfy simultaneously to be considered anonymized:

Test	Question	Example of a violation
Singling out	Can an individual be isolated in the dataset?	In a medical dataset, the only person with "occupation = cardiac surgeon" and "region = South West" is identifiable even without a name
Linkability	Can two records belonging to the same individual be linked?	The same individual appearing in two anonymized tables with rare common attributes allows an indirect join
Inference	Can an attribute value for an individual be deduced with certainty?	If all members of a cohort share a sensitive attribute (illness, conviction), group membership reveals the attribute

These three tests apply primarily to structured datasets. For unstructured text (contracts, correspondence, reports), the approach differs: identifying entities are removed or replaced, and contextual re-identification risk is evaluated separately.

3. Recommended on-premise architecture

A PII anonymization pipeline follows a sequential pipeline logic. Each stage has a specific role and metrics to monitor. For teams already thinking about how this connects to their broader data infrastructure, our article on enterprise data readiness for AI covers the foundational groundwork that makes anonymization tractable at scale.

On-premise anonymization pipeline architecture: from document extraction to audit trail, using Presidio, spaCy, and custom business recognizers — Architecture of an on-premise anonymization pipeline: from document ingestion to audit trail generation.

Step 1: extraction and pre-processing

Documents rarely arrive as plain text. Common formats in enterprise environments are native PDFs, scanned PDFs, DOCX, emails, database fields, and audio transcripts. Each format requires a dedicated extractor:

Native PDFs: pdfplumber or PyMuPDF, with table and header handling.
Scanned PDFs: OCR is mandatory (Tesseract with a language model, or an on-premise OCR service). OCR quality directly conditions entity detection quality.
DOCX and emails: python-docx, eml-parser. Watch document metadata, which often contains personal data (document author, revision history).
Databases: column-by-column processing, with prior identification of free-text columns versus structured fields.

For pipelines that need to extract structured data from complex document types before anonymization, see our deep-dive on PDF data extraction AI architecture, which covers layout-aware parsing, table detection, and multi-format ingestion.

Step 2: personal entity detection

This is the core of the pipeline. Three detection layers combine:

NER via spaCy fr_core_news_lg: detection of named entities (persons, organizations, locations, dates). This is the baseline semantic layer.
Presidio recognizers: 40+ types of structured entities (email, phone, IP address, credit card number, generic IBAN, GPS coordinates).
Custom regex recognizers: jurisdiction-specific entities not covered natively (see next section).

Step 3: substitution strategy

Depending on the downstream use, you select a different strategy:

Strict anonymization: replacement with a synthetic entity of the same type and no vault (PERSON_001, ADDRESS_001). Irreversible. Applicable for long-term archiving or external sharing.
Pseudonymization with vault: replacement with an encrypted alias, mapping stored in an isolated vault (HashiCorp Vault or equivalent). Reversible for internal analytics use cases that require re-identification.
Redaction: [REDACT] or ####. For archived documents that must remain human-readable without exposing personal data.

Step 4: post-processing and audit trail

Two post-processing constraints are critical for compliance:

Cross-document consistency: "John Smith" must receive the same alias across all documents in the corpus. Without a persistent alias register, cross-document analyses on pseudonymized data become unreliable.
Audit trail: timestamped logging of every anonymization operation (document processed, entities detected by type, method applied, operator, hash of the original document). This is the compliance evidence required in the event of a supervisory authority audit.

4. Presidio + spaCy: configuration and EU-specific entities

Microsoft Presidio is the open-source framework (MIT license) that has become the standard for this type of project. Its two-component architecture, Analyzer (detection) and Anonymizer (transformation), is clear and auditable. It deploys fully on-premise with zero external network calls.

Configuring French NER with spaCy

Presidio uses spaCy by default with an English model. Processing French documents requires a few adjustments:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Configure NLP engine with spaCy French model
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "fr", "model_name": "fr_core_news_lg"}
    ],
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=["fr"]
)

The fr_core_news_lg (large) model is recommended over fr_core_news_sm for better recall on French named entities, at the cost of a larger memory footprint (~570 MB vs ~16 MB). On a dedicated server, this overhead is negligible relative to the coverage gain.

Custom recognizers for EU-specific entities

France-specific and broader EU identifiers are not covered natively by Presidio. Custom recognizers based on regular expressions and checksums are required:

Entity	Format	Recognizer complexity	Known pitfalls
NIR (French SSN)	`1 84 06 75 116 042 68`	Medium: regex + adapted Luhn key verification	Variable spacing, provisional NIR, foreign NIR
SIRET	`552 178 639 00143`	Low: 14-digit regex + Luhn algorithm	Confusion with phone numbers or postal codes
French IBAN	`FR76 3000 6000 0112 3456 7890 189`	Low: ISO 13616 regex + mod97 checksum	Foreign IBANs in files, split across multiple lines
Vehicle registration plate	`AB-123-CD`	Low: regex for SIV post-2009 and legacy formats	Old prefectural plates, diplomatic plates
RPPS physician number	`10002345678` (11 digits)	Low: regex + context ("Dr", "physician")	Confusion with other 11-digit numbers absent context

from presidio_analyzer import PatternRecognizer, Pattern

# Example: NIR recognizer (French social security number)
nir_pattern = Pattern(
    name="NIR_pattern",
    regex=r'\b[12][0-9]{2}(0[1-9]|1[0-2]|[2-9][0-9]|[6-9][0-9])'
          r'(0[1-9]|[1-8][0-9]|9[0-5]|2[AB])[0-9]{3}[0-9]{3}[0-9]{2}\b',
    score=0.85
)

nir_recognizer = PatternRecognizer(
    supported_entity="FR_NIR",
    patterns=[nir_pattern],
    context=["social security", "NIR", "securite sociale",
             "carte vitale", "assure", "numero SS"]
)

analyzer.registry.add_recognizer(nir_recognizer)

Adding context words (context) improves precision: Presidio raises the confidence score when the detected entity is preceded or followed by associated terms. This reduces false positives on incidental numeric sequences.

Sample pipeline output

Input:

Mr. Jean Dupont (born 12/03/1978, residing at 14 rue des Lilas,
31000 Toulouse, NIR: 1 78 03 31 116 042 68) reported a claim
on 15 March 2026. His IBAN: FR76 3000 6000 0112 3456 7890 189.

Anonymized output (synthetic entity replacement):

Mr. PERSON_001 (born [DATE_001], residing at [ADDRESS_001],
31000 [CITY_001], NIR: [FR_NIR_001]) reported a claim
on [DATE_002]. His IBAN: [IBAN_001].

Anonymization report (attached):

{
  "document_id": "DOC-2026-00341",
  "processing_timestamp": "2026-05-18T10:14:00Z",
  "entities_detected": {
    "PERSON": 1, "DATE_TIME": 2, "LOCATION": 2,
    "FR_NIR": 1, "IBAN_CODE": 1
  },
  "method": "anonymization_synthetic_substitution",
  "estimated_coverage": 0.97,
  "original_document_hash": "sha256:4a8f2c..."
}

5. Presidio, LLM, or CamemBERT NER: which model to choose?

There is no single right answer. The choice depends on document volume, the nature of the entities to detect, and compliance constraints. This is the decision matrix we apply in the field.

Approach	Recall on structured entities	Contextual entities	Throughput	Sovereignty
Presidio + spaCy fr	Very high on known entities	Low (no contextual understanding)	> 100 docs/min on CPU	100% on-premise
CamemBERT NER fine-tuned	High (if fine-tuned on your domain)	Medium	10 to 30 docs/min (GPU recommended)	100% on-premise
Presidio + on-premise LLM (Mistral)	Very high	High	5 to 15 docs/min depending on GPU	100% on-premise if Mistral is local
Cloud LLM only (GPT-4o, Claude)	Very high	Very high	Rate-limited	No (see cloud paradox section)

Default recommendation

For 80 to 90% of enterprise projects, Presidio + spaCy fr + custom recognizers is the right answer. The framework covers structured entities with high recall, runs on standard CPU hardware with no GPU cost, and remains entirely on-premise.

An on-premise LLM (Mistral 7B or 22B) is added only for contextual entities that escape classic NER: a person mentioned without an explicit surname ("the head of the Toulouse subsidiary"), a bank account described in prose rather than IBAN format, or complex indirectly identifying data. The LLM only processes passages with a low confidence score from the first pipeline, which limits required GPU resources. For teams deploying Mistral on-premise, see our guide on deploying LLMs to production, which covers containerization, inference optimization, and monitoring.

Fine-tuning CamemBERT on an annotated domain corpus is relevant when you have a highly specific domain (medical, legal, HR) with entities that spaCy does not recognize. It requires an annotated dataset of 500 to 2,000 examples, which represents a significant human investment. This is an MVP-stage decision, not a POC one.

6. Pipeline evaluation: precision, recall, and false negatives

Evaluating an anonymization pipeline follows an inverted logic compared to classic classification systems. Here, recall is absolutely paramount over precision. A false negative (a personal entity not detected) is a data leak. A false positive (a non-personal entity incorrectly removed) degrades document readability but does not constitute a GDPR violation.

Production targets

Metric	Minimum target	Notes
Global recall	> 97%	Measured on representative annotated test set, not generic benchmark
Recall on NIR / IBAN / Full name	> 99%	Critical entities: zero tolerance
Global precision	> 90%	False positives acceptable if justified to the DPO
F1 on critical entities	> 0.96	NIR, IBAN, full name, date of birth
Cross-document alias consistency	> 99%	Same entity, same alias across entire corpus
Throughput (Presidio CPU stack)	> 100 docs/min	Standard 8-core CPU server, 2 to 5-page documents
Cost per document	< EUR 0.001	On-premise Presidio stack, excluding monthly infrastructure cost

Building the test set

The test set must be representative of your real documents, not generic datasets. The recommended approach:

Select a stratified sample of 200 to 500 documents covering all document types processed (contracts, correspondence, emails, forms, reports).
Manually annotate personal entities in that sample (at least two annotators to measure inter-annotator agreement).
Run the pipeline on the sample and compute metrics by entity type and document type.
Identify recurring false negative patterns and adjust recognizers accordingly.

This real-data evaluation phase is non-negotiable. Generic benchmarks (CoNLL-2003, WikiNER) do not reflect the specifics of your documents: internal language, business abbreviations, formats specific to your sector. The same principle applies to LLM evaluation at large: our article on LLM-as-judge and custom evaluators explains how to build evaluation grids anchored to real data.

7. Reversibility and the pseudonymization vault

When the downstream use requires the ability to re-identify individuals (internal analytics, responding to a GDPR data access request, correcting an error), reversible pseudonymization is chosen over strict anonymization. The choice depends on the processing purpose, to be defined with the DPO before any code is written.

Vault architecture

The vault is an encrypted register that maps original entities to their pseudonymized aliases. It must be:

Physically separate from the pseudonymized documents (legal requirement under Article 4(5) GDPR).
Encrypted at rest (AES-256 minimum) and in transit (TLS 1.3).
Access-restricted and audited: logging of every vault access with operator identity, timestamp, and reason.
Time-limited: the vault must be deleted or made inaccessible at the end of the data retention period.

HashiCorp Vault (open source, on-premise deployable) is the reference solution for this requirement. It provides a dedicated secrets engine, granular access policies, and a native audit log.

Cross-document consistency with a vault

The vault also solves the cross-document consistency problem: before assigning an alias to an entity, the pipeline checks whether the entity already exists in the vault. If it does, it reuses the existing alias. This guarantees that "John Smith" is always PERSON_001, regardless of which document is being processed.

Entity normalization before vault lookup is critical: "SMITH John", "John Smith", and "J. Smith" must be recognized as the same entity. This is an entity resolution sub-problem that requires a dedicated normalization layer.

8. The cloud LLM paradox and sovereign alternatives

This is the most counter-intuitive point in the topic, and the one that generates the most debate within technical teams.

The paradox, stated clearly

To anonymize personal data using a cloud LLM (ChatGPT, Claude Anthropic, Gemini), you must first send that personal data to the provider's servers. That transfer is itself a personal data processing operation subject to GDPR. You need:

A lawful basis for that processing (GDPR Article 6).
A Data Processing Agreement (DPA) with the LLM provider.
A Data Transfer Impact Assessment (DTIA) if the servers are located outside the EU/EEA.
Contractual guarantees that the provider will not use your data to train its models.

In practice, sending medical records, HR data, or client files to OpenAI for "anonymization" is rarely a compliant approach without solid legal groundwork in place first. The EU AI Act adds another compliance layer on top of GDPR for high-risk processing; our guide to EU AI Act compliance covers the intersection of both frameworks.

Acceptable solutions by sensitivity level

Solution	Sensitivity level	Required conditions
Presidio + spaCy on-premise	All levels, including health and judicial data	Zero cloud calls: recommended by default
Mistral 7B / 22B on-premise	All levels	On-premise GPU or sovereign cloud provider (OVH, Scaleway)
Azure OpenAI Service (EU region)	Sensitive data excluding special-category data	Signed Microsoft DPA, EU Data Boundary enabled, no-training guarantee in contract
OpenAI API (Enterprise tier)	Non-critical data only	Signed DPA, no-training guarantee, DTIA completed: not recommended for health or judicial data

Field note from Tensoria

On projects involving health data or HR files, we systematically recommend Presidio + spaCy on-premise as the primary layer, and on-premise Mistral for contextual entities. Cloud architecture only enters the picture for already-anonymized data, never for raw input. This position is validated by our clients' DPOs and documented in the DPIA. If you are evaluating a self-hosted RAG architecture alongside an anonymization pipeline, see our article on self-hosted RAG architecture for a compatible design.

9. Project compliance: DPIA, processing register, and processor contracts

The technical stack alone is not enough. A GDPR-compliant anonymization pipeline requires three documentary deliverables produced in parallel with development.

The Data Protection Impact Assessment (DPIA)

An AI anonymization pipeline typically processes personal data at scale, potentially including special-category data (health, judicial, ethnic origin), through automated large-scale processing. These criteria together trigger the DPIA obligation under GDPR Article 35 in most cases. EU supervisory authorities, including France's CNIL (Commission Nationale de l'Informatique et des Libertés), have published binding lists of processing operations that require a DPIA: large-scale automated processing of special-category data is on every national list.

The DPIA must document:

The specific purpose of the anonymization processing (why, for whom, with what retention period before anonymization).
The categories of data processed and their sensitivity level.
Technical and organizational measures in place (on-premise pipeline, vault access controls, audit trail).
Residual risks identified and mitigation measures.
The reasoned decision of the data controller.

The processing register

The anonymization operation itself is a processing activity to be recorded in the register of processing activities (GDPR Article 30). It must appear separately from the original processing that produced the data. Fields to complete: purpose, data categories, data subject categories, recipients, retention periods, security measures, and processors.

The processor contract

If an external service provider (such as Tensoria) is involved in developing or operating the pipeline, a data processing agreement compliant with GDPR Article 28 is mandatory. This agreement must specify: documented instructions from the controller, security obligations, prohibition on unauthorized sub-processing, audit arrangements, and data deletion or return at contract end.

10. Costs, timelines, and common pitfalls

Cost ranges

Phase	Duration	Range	Deliverables
Proof of concept	4 to 6 weeks	EUR 5,000 to 9,000	Presidio + spaCy deployed, custom business recognizers, evaluation on 500 annotated documents, coverage report by entity type
Production MVP	2 to 3 months	EUR 12,000 to 22,000	Batch and/or real-time pipeline, document flow integration, vault, audit trail, DPO sign-off
Annual TCO (excluding development)	Recurring	EUR 5,000 to 12,000/year	CPU infrastructure (< EUR 100/month), recognizer maintenance (2 to 4 days/year), semi-annual coverage audit

For broader AI project budget benchmarks, our article on RAG project costs and TCO provides comparable budget structures for document-intensive AI builds.

Typical timeline

Weeks 1 to 2: GDPR audit with the DPO (data scope, processing purpose, register entries), inventory of entity types to cover.
Weeks 3 to 5: Presidio + spaCy fr deployment, custom recognizer configuration, evaluation on annotated sample.
Weeks 6 to 9: integration into document flows, vault, audit trail, non-re-identifiability testing.
Weeks 10 to 12: phased rollout, DPO sign-off, system documentation.

What consistently extends the timeline: DPO or specialist GDPR counsel review (essential but frequently underestimated), document format variety (each new format requires a dedicated extractor), and formal non-re-identifiability requirements (testing with external data, k-anonymization for structured datasets).

The five most common pitfalls

1. Confusing anonymization with pseudonymization. Replacing a name with a reversible identifier is pseudonymization. The data remains within GDPR scope. This is the most frequently raised point in regulatory audits.

2. Not testing non-re-identifiability. Anonymizing names and emails without checking whether the combination of postal code, age, and occupation still enables re-identification. For structured datasets, k-anonymization techniques are essential.

3. Missing indirectly identifying entities. Presidio detects names, emails, phone numbers. It does not detect "the managing director of the Toulouse-based firm specializing in..." which can be identifying. An on-premise LLM is required for these rare contextual entities.

4. No audit trail. Without precise logging of who anonymized what, when, and with which method, GDPR compliance is difficult to demonstrate in the event of a supervisory audit. The audit trail must be generated automatically by the pipeline.

5. Cross-document inconsistency. If "John Smith" becomes PERSON_001 in one document and PERSON_017 in another, cross-document analyses on pseudonymized data produce incorrect results and the re-identification guarantee becomes harder to maintain.

11. Frequently asked questions

Anonymization is irreversible: re-identification must be impossible even when all available sources are cross-referenced. Anonymized data exits GDPR scope (Recital 26). Pseudonymization is reversible with a key: the data remains personal data under Article 4. Replacing a name with a hashed identifier is pseudonymization, not anonymization. Confusing the two is the most common finding in GDPR audits.

Presidio (Microsoft, open source, MIT license) is fully deployable on-premise with zero external network calls. It integrates natively with spaCy fr_core_news_lg, detects more than 40 entity types, and is extensible via custom recognizers for jurisdiction-specific identifiers: NIR (French SSN), SIRET, IBAN, vehicle registration plates, and RPPS physician numbers.

This is the fundamental paradox: anonymizing via a cloud LLM requires sending the personal data to the provider's servers first, often in the United States. That transfer is itself a processing operation subject to GDPR. Compliant alternatives include Presidio + spaCy on-premise, on-premise Mistral, or Azure OpenAI Service in the EU region with a signed DPA and EU Data Boundary enabled.

The EDPB requires that data resist three simultaneous attack types: singling out (impossible to isolate an individual in the dataset), linkability (impossible to link two records belonging to the same individual), and inference (impossible to deduce with certainty the value of an attribute for an individual). If any one of these three conditions can be violated, the data is not considered anonymized under GDPR.

Recall (the rate at which personal entities are detected) takes absolute priority over precision. A false negative is a data leak. The minimum target is recall above 97% globally, and above 99% for critical entities (NIR/SSN, IBAN, full name). This must be measured on an annotated test set representative of your real documents, not on generic benchmarks.

In most cases, yes. An anonymization pipeline processes personal data at scale, potentially including special-category data (health, judicial, ethnic origin), through automated large-scale processing. These criteria together typically trigger the DPIA obligation under GDPR Article 35. The DPIA must document the processing purpose, residual risks, technical and organizational measures, and the data retention period before anonymization.

PII Anonymization for GDPR-Compliant AI: On-Premise Architecture with Presidio