Local LLMs for First-Pass Medical PII Redaction
By wGrow Project Team ·
Medical groups want GPT-4 to summarize unstructured clinical referral notes. Compliance officers block the API key. Stop trying to negotiate. The compliance team is right.
An unredacted patient referral can contain a full name, an NRIC number, a date of birth, a residential address, and a clinical history. Sending that payload to a managed cloud API — even one backed by a Data Processing Agreement — introduces data-transfer and sub-processor obligations that these clinic deployments could not clear under the client’s audit requirements. Patient information in Singapore is classified by sensitivity, and the controls applied to it must match that classification. For the clinic deployments described here, routing raw clinical notes through a US-hosted inference endpoint did not meet the compliance bar set by the client’s data classification and audit requirements.
The conflict here is real, not bureaucratic. Clinical reasoning capability sits in the cloud. Patient data sits on-premise. The conventional response — negotiate an enterprise agreement, sign the DPA, invoke the BAA equivalent — keeps failing local audits. The architecture described here sidesteps that negotiation entirely. A local, quantized language model runs inside the clinic network as a data scrubber; its only job is to locate and replace PII before any text leaves the building. The architecture is designed so identifiable data does not cross the network boundary; in production, the local pass still needs fallback handling for low-confidence spans and OCR-corrupted text. The scrubbed text then proceeds to the cloud model for the clinical work it is actually built for.
Delivery Scars from 2018

In 2018, we delivered a queue management system for a Singapore primary care clinic group. The brief was straightforward: digitize patient intake, reduce waiting time, generate structured records from triage nurses’ freeform notes.
That project changed how we think about healthcare data residency expectations in Singapore — and it has shaped every medical-tech engagement since.
The definition of PII in a medical context is not the same as in a retail loyalty program. An NRIC number is PII. A name paired with a clinic visit date is PII. A phone number sitting adjacent to a medication name is PII. Combine any two of those fields in a single document and the sensitivity classification climbs further.
The first naive approach — one we watched several other vendors take during that period — was manual redaction before uploading. Staff were trained to highlight and delete before export. Under production load, manual redaction is inconsistent; the cognitive burden of reviewing dozens of notes per shift introduces gaps that only surface during audits. The second naive approach was to trust the enterprise agreement. The argument ran: the cloud provider signs a DPA, the DPA covers medical data, therefore we are compliant. That argument does not survive a line-by-line reading of what those agreements actually guarantee about processing location, sub-processor access, and audit rights.
What we internalized in 2018: for these clinic audits, the network boundary was the operative control boundary. If the byte left the building, that transfer had to be defensible on its own merits. A DPA assigned accountability for the transfer; it did not prevent one.
Why Regex Fails the NRIC Problem
Our internal proof-of-concept for automated referral routing started, as these projects tend to, with regular expressions.
Singapore NRICs carry a single-letter prefix — S or T — followed by seven digits and a check letter. Foreign identification numbers (FIN) follow the same structure with different prefixes: F, G, or M. The regex writes itself in thirty seconds. It catches the clean cases.
It does not catch what clinicians actually type.
Doctors and nurses type fast, under cognitive load, often on small screens. In our PoC corpus — drawn from an anonymized clinical archive — we found NRICs embedded directly adjacent to vital-sign readings with no whitespace: S1234567D130/85mmHg. We found NRICs with the check digit transposed. Phone numbers formatted as eight digits with no spaces, as four-plus-four with a hyphen, and as a full +65 international prefix — all in the same dataset. Patient names split across a line break, or abbreviated to initials in one sentence and expanded three sentences later.
A regex keyed to the canonical NRIC pattern misses S1234567 D (space before the check digit) and s1234567d (lowercase). It misses a referral where the nurse typed the patient’s name in all-caps as part of a template header. The miss rate for the regex-only pipeline in our PoC, measured against human-reviewed ground truth, was 11 percent. For patient referral data, 11 percent is not a tuning problem. It is a compliance failure.
Deterministic rules fail on non-deterministic human input. You need semantic understanding just to locate the name.
Deploying Llama 3 8B on an M2 Mac Mini

The fix is a quantized local LLM running inside the clinic network boundary.
We tested two models: Mistral 7B and Llama 3 8B, both quantized to 4-bit precision using GGUF format and served via llama.cpp. Four-bit quantization introduces some precision loss relative to full-weight inference; for a constrained entity-extraction task, that degradation is minimal — the benchmark numbers below confirm it. The hardware is a standard Apple Mac Mini with an M2 chip and 16 GB of unified memory. Under S$1,200. Fits in a clinic rack alongside the existing server stack.
A 4-bit quantized Llama 3 8B model loads into approximately 5.5 GB on the M2. At 16 GB unified, that leaves comfortable headroom for the operating system and the surrounding application stack. The M2’s GPU cores handle inference via Metal without thermal throttling on the referral note sizes we tested, which ranged from 80 to 400 tokens.
The system prompt is deliberately narrow. The local model is not asked to understand the clinical content, diagnose the patient, or produce a summary. It receives one instruction: identify all names, NRIC numbers, phone numbers, dates of birth, and residential addresses in the text, and replace each with a typed placeholder. [PATIENT_NAME]. [PATIENT_ID]. [PATIENT_PHONE]. The model returns the redacted text. The application logs the substitution map locally, encrypted at rest, for audit purposes. Only text that passes the redaction gate proceeds to the external API.
This separation of concerns is the architecture’s load-bearing column. The local 8B model handles compliance. The cloud model handles reasoning. Neither is asked to do the other’s job.
The 500-Millisecond Trade-Off
- node
- — Apple M2 Mac Mini
- memory
- — 16GB Unified
- model
- — Llama 3 8B
- quantization
- — 4-bit (GGUF)
- system prompt
- — Strict NER replacement
We benchmarked three approaches against the same 200-note test set on the M2 Mac Mini, with human-reviewed PII annotations as ground truth. The set is small by production standards; treat these numbers as directional until validated at scale.
The regex pipeline processed each note in under 10 milliseconds. PII recall: 89 percent. At an 11 percent miss rate, it fails the compliance bar.
Mistral 7B at 4-bit added approximately 380 milliseconds per note. PII recall: 96 percent. Better — but a 4 percent miss rate on this data category is still a concern.
Llama 3 8B at 4-bit added approximately 500 milliseconds per note. PII recall: 99 percent. The remaining 1 percent were edge cases where NRICs were embedded inside OCR transcription artifacts — characters garbled beyond recognition. Those defeat any automated approach short of human review.
Whether 500 milliseconds is acceptable overhead depends on context. For a referral routing pipeline that processes notes asynchronously — the common case in clinic back-office workflows — it is. The note arrives, the local model scrubs it, the scrubbed version goes to the cloud API for summarization, and the summary returns. The clinician sees a result within two to three seconds total. That additional half-second of local inference is invisible at the workflow level. A synchronous, patient-facing context might demand different latency targets; this architecture is optimized for background processing.
The ROI framing is direct. Five hundred milliseconds of local compute per referral buys a data transfer the compliance team can defensibly sign off on. Under the Personal Data Protection (Amendment) Act 2020, financial penalties for a data breach can reach S10 million; for those exceeding that threshold, the cap rises to 10 percent of annual local turnover (Personal Data Protection (Amendment) Act 2020) — well past the cost of this hardware stack in either case. Below that threshold, the reputational cost of a notifiable data breach in a medical context is not recoverable quickly.
Save Your Cloud Budget for Clinical Reasoning

Local LLMs are not toys for privacy hobbyists. They are utility-grade compliance components for enterprise architectures that need both data sovereignty and AI capability.
The pattern generalizes beyond medical referrals. Any workflow combining sensitive on-premise data with cloud AI reasoning faces this boundary problem — and the structural answer is the same each time. A local, quantized model at the edge handles the compliance-sensitive pass on raw data. The cloud model handles the cognitively expensive pass on clean data.
For a medical group evaluating cloud AI for clinical documentation, the practical rule is this: do not pay cloud API rates to find a name. A 4-bit 8B model on commodity hardware finds the name in 500 milliseconds, inside your network; the defensible compliance argument comes from pairing that pass with confidence thresholds, local audit logs, and human fallback for the spans it cannot clear — not from the recall number alone. What actually requires a hosted large model is interpreting what the patient was referred for, cross-referencing against clinical guidelines, and drafting a structured summary a specialist can act on.
Build the pipeline in two stages. Stage one: local inference, PII removed, audit log written. Stage two: cloud inference, clinical reasoning applied, structured output returned. The network boundary is crossed only after the redaction gate passes — with fallback handling covering the spans the local model cannot clear.
That is not a trade-off between privacy and capability. It is a correct separation of concerns. The compliance officer and the CTO can both say yes to it.