German PII Redactor: Covering the 5% Blind Spot in SAP Data Masking

Drop-in-Schicht hinter deterministischer Maskierung, die PII aus SAP-Freitextspalten entfernt, bevor Daten in Dev-, QA- oder Trainingssysteme gelangen. Läuft auf der Hardware des Kunden — DSGVO-konform by design. Drop-in layer after deterministic masking that redacts PII from free-text SAP columns before data lands in dev, QA, or training systems. Runs on the client's own hardware — DSGVO-konform by design.
The problem
Every SAP shop copies production data into dev, QA, and training landscapes. It is how you reproduce customer bugs on real payloads, load-test a release, and train end-users on data that looks like what they will see on Monday.
Every copy is a compliance event. DSGVO, Art. 5 requires personal data to be processed only for legitimate purposes and pseudonymised where practical. Most DACH enterprises have bought a deterministic masking tool — SAP TDMS, Delphix, Informatica TDM, IBM InfoSphere Optim — and wired it into the copy job. The tool rewrites classified columns: KNA1-NAME1 becomes Mustermann, BSEG-IBAN becomes a fake IBAN that still passes checksum, USR02-BNAME becomes USER042. That covers the ~95% of PII that lives in schema-aware, row-level columns.
The remaining 5% is where every program I have seen leaks:
- Free-text
NOTEScolumns on customer, vendor, and case tables. Agents type in names, phone numbers, IBANs, and internal IDs that the schema cannot anticipate. - Unclassified Z-tables. Customer-built extensions the data steward never got around to categorising. The masking tool does not know they exist, so it leaves them alone.
- OCR’d scan attachments. Invoices, IDs, insurance cards. Full name, date of birth, sometimes bank details — all sitting in the attachment table as raw text.
- Long-tail entity types. Deterministic tools know how to mask a German IBAN. They do not know the difference between a street address and a product description unless you classify every column.
Running a general-purpose LLM over 100% of production data to find this is a non-starter — the VP of one mid-sized DACH manufacturer put it plainly: “That is a thousand times the compute of what we already run.” He was right. The 95% does not need an LLM. But the 5% does.
The solution
A 1.5B-parameter fine-tuned German PII redactor that plugs in after the deterministic masker. The copy pipeline looks like this:
- Extract production data (SAP table unload, database dump, file export).
- Mask structured columns with your existing tool (TDMS, Delphix, Informatica). Unchanged.
- Route free-text and unclassified columns into the redactor. A tiny classifier decides which rows need it — most do not.
- Redactor emits structured JSON:
{ redacted_text, entities, risk_level, needs_human_review }. - Reinject the redacted free-text back into the masked export.
- Load into dev / QA / training — compliant end-to-end.
The redactor is a LoRA adapter on Qwen2.5-1.5B-Instruct. Small enough to run on a single consumer GPU, fast enough to keep up with a nightly copy job, specific enough to handle the twelve German entity types that matter in enterprise data: Name, Adresse, IBAN, Steuer-ID, Sozialversicherungsnummer, Krankenkassennummer, Geburtsdatum, Telefon, E-Mail, IP, Kfz-Kennzeichen, Kontoinhaber.
Output is structured, not span-based. Downstream tooling gets a JSON document it can validate and consume, not a highlighted blob of text. This is the difference between “we built a demo” and “we shipped a service.”
Pseudonyms are referentially consistent. The same person becomes [PERSON_1] in every free-text column of their record. Dev engineers can still join across tables and reproduce the bug. Referential integrity is the feature enterprise masking tools guard most jealously; the redactor honours it.
Runs entirely on the client’s own hardware. No third-party API, no data egress, no per-token billing. The adapter is distributed as a 150 MB file on HuggingFace. Load it onto the base model inside your landscape and you are done.
The results
The v1 adapter, trained on 75 synthetic German business documents and evaluated on a held-out set of 16, lifts the three metrics that matter:
| Metric | Base Qwen-1.5B | Fine-tuned adapter |
|---|---|---|
| Entity F1 | 0.375 | 0.917 |
| JSON parse validity | 81% | 100% |
| Risk-level exact-match | 0.50 | 0.94 |
Entity F1 is the core quality bar — can the model find and correctly type the PII in a free-text string? A 2.4× lift says the adapter learned the German-specific patterns the base model missed (especially IBAN formatting, Steuer-ID layout, and German address conventions).
JSON parse validity is the “can you actually use this in production” gate. 100% means every inference emits a document the downstream pipeline can load without a try/except fallback. Critical for nightly batch jobs where a single malformed output stalls the whole stream.
Risk-level exact-match is the classifier head that decides whether the document needs human review. Going from coin-flip (.50) to near-perfect (.94) means compliance teams only see records that genuinely need them — not every fourth row.
Review-EM stayed flat at .69, and that is my main focus for v2: double-label the human-review signal for consistency, scale the training corpus, and add an external hand-curated eval slice with a push gate.
Positioning
This model is not a replacement for your deterministic masking vendor. TDMS, Delphix, Informatica, and IBM Optim are better than any LLM at column-level masking on structured data — cheaper, faster, auditable, and they already own the referential-integrity graph for your landscape. Do not rip them out.
Position this model strictly as the long-tail layer for the text your schema-aware tools cannot see. That is the honest pitch, and it is the one that lands with data-privacy officers, SAP basis teams, and architects who have been burned by “AI does everything” sales pitches before.
Stack & links
- Base model: Qwen2.5-1.5B-Instruct
- Training: LoRA via PEFT / TRL, synthetic German business documents
- Hardware (training): RunPod A40, ~90 minutes end-to-end
- Hardware (inference): single consumer GPU, CPU possible for batch jobs
- License: Apache 2.0
- Output: Pydantic-validated structured JSON
- Adapter on HuggingFace: renezander030/qwen-2.5-1.5b-de-pii-redactor
- Full write-up on the blog: 95% of PII redaction doesn’t need an LLM. The other 5% is where your masker leaks.
If you are responsible for prod→non-prod data copies in a DACH landscape and you suspect your masker is leaking the 5% this model targets, I am happy to walk through your specific pipeline. No slide deck, thirty minutes, you bring an example column.
Stack Stack
- Qwen2.5-1.5B-Instruct (base)
- LoRA adapter (PEFT / TRL)
- Structured JSON output (Pydantic schema)
- HuggingFace Hub (Apache 2.0)
- RunPod A40 (training)
Meine Audits richten sich an Teams, die entschlossen sind, die Ergebnisse umzusetzen. I reserve my audits for teams ready to take action on the results.
Book a 30-min call