Field Note #1: 95% of SAP PII redaction does not need an LLM. The other 5% does.
Self-hosted German PII redactor for SAP prod-to-dev copies. Plugs in after TDMS/Delphix/Informatica to cover the free-text 5%. Entity F1 from .375 to .917.
- Subject
- SAP PII redaction
- Industry
- Enterprise / DACH manufacturing
- Stack
Qwen2.5-1.5B + LoRA, Pydantic structured output, on-prem GPU
The case
Every SAP shop copies production data into dev, QA, and training landscapes. It is how you reproduce bugs on real payloads and train end-users on data that looks like Monday. Every copy is a compliance event — DSGVO Art. 5 requires pseudonymization.
Most DACH enterprises have bought a deterministic masking tool — SAP TDMS, Delphix, Informatica TDM, IBM InfoSphere Optim — and wired it into the copy job. The tool rewrites classified columns: KNA1-NAME1 becomes Mustermann, BSEG-IBAN becomes a fake IBAN that still passes checksum. That covers the ~95% of PII that lives in schema-aware, row-level columns.
The remaining 5% is where every program I have seen leaks: NOTES columns where agents type names and IBANs, unclassified Z-tables, OCR’d scan attachments, and long-tail entity types (Kfz, Steuer-ID, Krankenkassennummer) that deterministic rules do not know about.
A VP at a DACH manufacturer put it plainly: “Running an LLM over 100% of our SAP data is a thousand times the compute of what we already run.” He was right. The 95% does not need an LLM. The 5% does.
The numbers
| Metric | Baseline (1.5B base) | Fine-tuned redactor |
|---|---|---|
| Entity F1 | .375 | .917 |
| JSON parse validity | 81% | 100% |
| Risk-level exact-match | .50 | .94 |
| Inference latency (consumer GPU) | — | ~120 ms/row |
Twelve German entity types: Name, Adresse, IBAN, Steuer-ID, Sozialversicherungsnummer, Krankenkassennummer, Geburtsdatum, Telefon, E-Mail, IP, Kfz-Kennzeichen, Kontoinhaber. Model fits on a single RTX 4090 / A40.
What worked
- Plug in after the deterministic masker, not instead of it. Existing TDMS/Delphix/Informatica rules keep working. The LLM only sees columns the deterministic stage skipped.
- Structured JSON output, not span highlighting.
{ redacted_text, entities, risk_level, needs_human_review }. Downstream code validates with Pydantic, not a markup parser. - Pseudonyms passed in as context. The deterministic tool’s mapping (
Person X → Mustermann) goes into the LLM prompt so referential integrity stays consistent across both stages. - Risk-level gates human review.
risk_level: highorneeds_human_review: trueblocks the row from being written to dev.lowflows through.
What failed
- Running general-purpose LLM over 100% of the data. Cost was a thousand-x of the existing pipeline. Stopped the project until we built the classifier-first pattern.
- Span-based outputs (e.g. token-level highlights). Downstream tooling expected JSON. Re-parsing spans into entity objects added bugs at the boundary.
- Naive prompts without anti-duplication. Early versions emitted the same IBAN twice across
redacted_textandentities. Pydantic schema withSettypes fixed it; took a re-train to lock in.
The architecture
SAP production
│ extract (unload / dump / export)
▼
Deterministic masker ──── pseudonym map (context) ─────┐
(TDMS / Delphix / etc.) │
│ ~95% of columns │
▼ │
Classifier (regex + pattern) │
│ flags rows in NOTES / Z-tables / OCR for LLM │
▼ │
LLM redactor ◄─────────────────────────────────────────┘
│ JSON: { redacted_text, entities, risk_level, needs_human_review }
▼
Human review queue (if risk_level: high)
│
▼
Re-inject into masked export ──► Dev / QA / Training
Full deployment stays inside the client’s boundary. No third-party API, no per-token billing. Diagram and configurable architecture brief at /sap-pii-redaction-architecture/.
Next-step checklist
- Inventory NOTES columns + Z-tables + OCR attachment tables in scope. Aim for ≤30 to start.
- Collect 200 rows of free-text from each. Hand-label as gold standard.
- Baseline your existing masker against the gold standard. Find which 5% it misses.
- Pick a 1.5B-3B base model. Train a LoRA adapter on 1k-3k labeled examples. RunPod A40, < $50.
- Eval against gold standard. Target entity F1 ≥ 0.90, JSON parse 100%, risk-level exact-match ≥ 0.90.
- Deploy as a step after your masker, not instead of it. Reuse its pseudonym map.
- Wire human-review queue. Start with
risk_level: highblocking,mediumsampling.
Full case study: Enterprise AI PII Redaction System. Article: 95% of PII Redaction Doesn’t Need an LLM. The Other 5% Does.. Model card: HuggingFace.
Does this shape match what you're building?
If you want me to scope a similar system for you — I respond in 24 hours.
Request a scope