#1 engineer ← all field notes

95% of SAP PII redaction does not need an LLM. The other 5% does.

Self-hosted German PII redactor for SAP prod-to-dev copies. Plugs in after TDMS/Delphix/Informatica to cover the free-text 5%. Entity F1 from .375 to .917.

Subject: SAP PII redaction
Industry: Enterprise / DACH manufacturing
Stack: Qwen2.5-1.5B + LoRA, Pydantic structured output, on-prem GPU

The case

Every SAP shop copies production data into dev, QA, and training landscapes. It is how you reproduce bugs on real payloads and train end-users on data that looks like Monday. Every copy is a compliance event — DSGVO Art. 5 requires pseudonymization.

Most DACH enterprises have bought a deterministic masking tool — SAP TDMS, Delphix, Informatica TDM, IBM InfoSphere Optim — and wired it into the copy job. The tool rewrites classified columns: KNA1-NAME1 becomes Mustermann, BSEG-IBAN becomes a fake IBAN that still passes checksum. That covers the ~95% of PII that lives in schema-aware, row-level columns.

The remaining 5% is where every program I have seen leaks: NOTES columns where agents type names and IBANs, unclassified Z-tables, OCR’d scan attachments, and long-tail entity types (Kfz, Steuer-ID, Krankenkassennummer) that deterministic rules do not know about.

A VP at a DACH manufacturer put it plainly: “Running an LLM over 100% of our SAP data is a thousand times the compute of what we already run.” He was right. The 95% does not need an LLM. The 5% does.

The numbers

Metric	Baseline (1.5B base)	Fine-tuned redactor
Entity F1	.375	.917
JSON parse validity	81%	100%
Risk-level exact-match	.50	.94
Inference latency (consumer GPU)	—	~120 ms/row

Twelve German entity types: Name, Adresse, IBAN, Steuer-ID, Sozialversicherungsnummer, Krankenkassennummer, Geburtsdatum, Telefon, E-Mail, IP, Kfz-Kennzeichen, Kontoinhaber. Model fits on a single RTX 4090 / A40.

What worked

Plug in after the deterministic masker, not instead of it. Existing TDMS/Delphix/Informatica rules keep working. The LLM only sees columns the deterministic stage skipped.
Structured JSON output, not span highlighting. { redacted_text, entities, risk_level, needs_human_review }. Downstream code validates with Pydantic, not a markup parser.
Pseudonyms passed in as context. The deterministic tool’s mapping (Person X → Mustermann) goes into the LLM prompt so referential integrity stays consistent across both stages.
Risk-level gates human review. risk_level: high or needs_human_review: true blocks the row from being written to dev. low flows through.

What failed

Running general-purpose LLM over 100% of the data. Cost was a thousand-x of the existing pipeline. Stopped the project until we built the classifier-first pattern.
Span-based outputs (e.g. token-level highlights). Downstream tooling expected JSON. Re-parsing spans into entity objects added bugs at the boundary.
Naive prompts without anti-duplication. Early versions emitted the same IBAN twice across redacted_text and entities. Pydantic schema with Set types fixed it; took a re-train to lock in.

The architecture

SAP production
   │ extract (unload / dump / export)
   ▼
Deterministic masker  ──── pseudonym map (context) ─────┐
(TDMS / Delphix / etc.)                                 │
   │ ~95% of columns                                    │
   ▼                                                    │
Classifier (regex + pattern)                            │
   │ flags rows in NOTES / Z-tables / OCR for LLM       │
   ▼                                                    │
LLM redactor  ◄─────────────────────────────────────────┘
   │ JSON: { redacted_text, entities, risk_level, needs_human_review }
   ▼
Human review queue (if risk_level: high)
   │
   ▼
Re-inject into masked export ──► Dev / QA / Training

Full deployment stays inside the client’s boundary. No third-party API, no per-token billing. Diagram and configurable architecture brief at /sap-pii-redaction-architecture/.

Next-step checklist

Inventory NOTES columns + Z-tables + OCR attachment tables in scope. Aim for ≤30 to start.
Collect 200 rows of free-text from each. Hand-label as gold standard.
Baseline your existing masker against the gold standard. Find which 5% it misses.
Pick a 1.5B-3B base model. Train a LoRA adapter on 1k-3k labeled examples. RunPod A40, < $50.
Eval against gold standard. Target entity F1 ≥ 0.90, JSON parse 100%, risk-level exact-match ≥ 0.90.
Deploy as a step after your masker, not instead of it. Reuse its pseudonym map.
Wire human-review queue. Start with risk_level: high blocking, medium sampling.

Full case study: Enterprise AI PII Redaction System. Article: 95% of PII Redaction Doesn’t Need an LLM. The Other 5% Does.. Model card: HuggingFace.

Does this shape match what you're building?

If you want me to scope a similar system for you — I respond in 24 hours.

Request a scope