Enterprise AI PII Redaction System for Sensitive Documents

April 21, 2026 · 6 min read · llm, fine-tuning, privacy, sap, dsgvo

Screen demo: the fine-tuned German PII redactor catching what deterministic masking misses — free-text columns in, structured JSON with typed entities and risk level out. Runs on a single consumer GPU. Screen demo: the fine-tuned German PII redactor catching what deterministic masking misses — free-text columns in, structured JSON with typed entities and risk level out. Runs on a single consumer GPU.

For decision-makers, in 20 seconds

Problem: Every copy of production data into test and development systems is a GDPR risk. Existing masking tools miss free-text fields and attachments.

Solution: A filter removes personal data from exactly those gaps before data leaves production. It runs on your hardware. Nothing leaves the building.

Business value: Four data copies a year, two and a half days of manual review each: around 6,000 € annually, and it stays a sample. The filter checks every copy, completely and automatically. The fine risk behind it is calculated in percent of revenue, not person-days.

Frame: Weeks, not months. Fixed price per milestone, each individually stoppable.

.917

Entity F1 Entity F1

war .375 (Basis) was .375 (base)

100%

JSON-Parse valide JSON parse valid

war 81% (Basis) was 81% (base)

.94

Risk-Level Exact-Match Risk-level exact-match

war .50 (Basis) was .50 (base)

Deutsche Entitätstypen German entity types

IBAN, Steuer-ID, KfZ, … IBAN, Steuer-ID, KfZ, …

The problem

Every SAP shop copies production data into dev, QA, and training landscapes. It is how you reproduce customer bugs on real payloads, load-test a release, and train end-users on data that looks like what they will see on Monday.

Every copy is a compliance event. DSGVO, Art. 5 requires personal data to be processed only for legitimate purposes and pseudonymised where practical. Most DACH enterprises have bought a deterministic masking tool — SAP TDMS, Delphix, Informatica TDM, IBM InfoSphere Optim — and wired it into the copy job. The tool rewrites classified columns: KNA1-NAME1 becomes Mustermann, BSEG-IBAN becomes a fake IBAN that still passes checksum, USR02-BNAME becomes USER042. That covers the ~95% of PII that lives in schema-aware, row-level columns.

The remaining 5% is where every program I have seen leaks:

Free-text NOTES columns on customer, vendor, and case tables. Agents type in names, phone numbers, IBANs, and internal IDs that the schema cannot anticipate.
Unclassified Z-tables. Customer-built extensions the data steward never got around to categorising. The masking tool does not know they exist, so it leaves them alone.
OCR’d scan attachments. Invoices, IDs, insurance cards. Full name, date of birth, sometimes bank details — all sitting in the attachment table as raw text.
Long-tail entity types. Deterministic tools know how to mask a German IBAN. They do not know the difference between a street address and a product description unless you classify every column.

Running a general-purpose LLM over 100% of production data to find this is a non-starter — the VP of one mid-sized DACH manufacturer put it plainly: “That is a thousand times the compute of what we already run.” He was right. The 95% does not need an LLM. But the 5% does.

The solution

A 1.5B-parameter fine-tuned German PII redactor that plugs in after the deterministic masker. The copy pipeline looks like this:

Extract production data (SAP table unload, database dump, file export).
Mask structured columns with your existing tool (TDMS, Delphix, Informatica). Unchanged.
Route free-text and unclassified columns into the redactor. A tiny classifier decides which rows need it — most do not.
Redactor emits structured JSON: { redacted_text, entities, risk_level, needs_human_review }.
Reinject the redacted free-text back into the masked export.
Load into dev / QA / training — compliant end-to-end.

The redactor is a LoRA adapter on Qwen2.5-1.5B-Instruct. Small enough to run on a single consumer GPU, fast enough to keep up with a nightly copy job, specific enough to handle the twelve German entity types that matter in enterprise data: Name, Adresse, IBAN, Steuer-ID, Sozialversicherungsnummer, Krankenkassennummer, Geburtsdatum, Telefon, E-Mail, IP, Kfz-Kennzeichen, Kontoinhaber.

Output is structured, not span-based. Downstream tooling gets a JSON document it can validate and consume, not a highlighted blob of text. This is the difference between “we built a demo” and “we shipped a service.”

Pseudonyms are referentially consistent. The same person becomes [PERSON_1] in every free-text column of their record. Dev engineers can still join across tables and reproduce the bug. Referential integrity is the feature enterprise masking tools guard most jealously; the redactor honours it.

Runs entirely on the client’s own hardware. No third-party API, no data egress, no per-token billing. The adapter is distributed as a 150 MB file on HuggingFace. Load it onto the base model inside your landscape and you are done.

The results

The v1 adapter, trained on 75 synthetic German business documents and evaluated on a held-out set of 16, lifts the three metrics that matter:

Metric	Base Qwen-1.5B	Fine-tuned adapter
Entity F1	0.375	0.917
JSON parse validity	81%	100%
Risk-level exact-match	0.50	0.94

Entity F1 is the core quality bar — can the model find and correctly type the PII in a free-text string? A 2.4× lift says the adapter learned the German-specific patterns the base model missed (especially IBAN formatting, Steuer-ID layout, and German address conventions).

JSON parse validity is the “can you actually use this in production” gate. 100% means every inference emits a document the downstream pipeline can load without a try/except fallback. Critical for nightly batch jobs where a single malformed output stalls the whole stream.

Risk-level exact-match is the classifier head that decides whether the document needs human review. Going from coin-flip (.50) to near-perfect (.94) means compliance teams only see records that genuinely need them — not every fourth row.

Review-EM stayed flat at .69, and that is my main focus for v2: double-label the human-review signal for consistency, scale the training corpus, and add an external hand-curated eval slice with a push gate.

What this means commercially (model calculation)

A model calculation with open assumptions, so you can adapt the numbers to your environment: a company pulls 4 data copies per year from production. Reviewing each copy manually for leaked personal data (assumptions: 2.5 person-days per copy, internal full cost of 600 € per day) costs 6,000 € per year, stays sample-based, and depends on individuals. After one-time setup, the redactor runs automatically with every copy, checks every free-text column instead of samples, and manual effort drops to reviewing the few flagged cases. The real value is not the saved days, though: it is the closed GDPR exposure, because the 3 % fine tier is calculated against annual revenue, not person-days.

Positioning

This model is not a replacement for your deterministic masking vendor. TDMS, Delphix, Informatica, and IBM Optim are better than any LLM at column-level masking on structured data — cheaper, faster, auditable, and they already own the referential-integrity graph for your landscape. Do not rip them out.

Position this model strictly as the long-tail layer for the text your schema-aware tools cannot see. That is the honest pitch, and it is the one that lands with data-privacy officers, SAP basis teams, and architects who have been burned by “AI does everything” sales pitches before.

Stack & links

Base model: Qwen2.5-1.5B-Instruct
Training: LoRA via PEFT / TRL, synthetic German business documents
Hardware (training): RunPod A40, ~90 minutes end-to-end
Hardware (inference): single consumer GPU, CPU possible for batch jobs
License: Apache 2.0
Output: Pydantic-validated structured JSON
Adapter on HuggingFace: renezander030/qwen-2.5-1.5b-de-pii-redactor
Full write-up on the blog: 95% of PII redaction doesn’t need an LLM. The other 5% is where your masker leaks.

If you are responsible for prod→non-prod data copies in a DACH landscape and you suspect your masker is leaking the 5% this model targets, I am happy to walk through your specific pipeline. No slide deck, thirty minutes, you bring an example column.

Stack Stack

Qwen2.5-1.5B-Instruct (base)
LoRA adapter (PEFT / TRL)
Structured JSON output (Pydantic schema)
HuggingFace Hub (Apache 2.0)
RunPod A40 (training)

Ähnliches Projekt auf dem Tisch? Similar project on your desk?

Am schnellsten klärt das ein Gespräch. Termin direkt hier wählen: The fastest way to scope it is a conversation. Pick a slot right here:

Scope in 24h · Fixed price before start · Pay per accepted milestone

From pilot to production

Running an AI pilot that is not production-ready yet? That is exactly what I do: audit, fixed-price scope, delivery in 2–6 weeks.

Make your AI pilot production-ready → Production audit ($1,900 fixed)

Enterprise AI PII Redaction System for Sensitive Documents

The problem

The solution

The results

What this means commercially (model calculation)

Positioning

Stack & links

Stack Stack

Ähnliches Projekt auf dem Tisch? Similar project on your desk?

Before you go —

Almost there

Enterprise AI PII Redaction System for Sensitive Documents

The problem

The solution

The results

What this means commercially (model calculation)

Positioning

Stack & links

Stack Stack

Ähnliches Projekt auf dem Tisch? Similar project on your desk?

Scope my automation in 24h

Request received