Compliance 23 March 2026 · 7 min

The Right to be Forgotten in a Vector Database

By wGrow Project Team · 23 March 2026

The Auditor, the Float Array, and the PDPA

Two professionals reviewing a laptop screen in a modern corporate meeting room.

An auditor sits across the table and asks you to prove that a specific user’s personal data has been erased from your system. You open your vector database console and show them a dense index filled with records like [0.123, -0.456, 0.890, ...]. The auditor stares at you. You stare at the float arrays. Nobody is happy.

Singapore’s Personal Data Protection Act (PDPA) requires, under Section 25, that organisations cease retaining personal data or anonymise it once retention no longer serves a legal or business purpose. The obligation is clear in statute. The mechanism is anything but.

In a traditional SQL database, DELETE FROM users WHERE id = X is a complete, auditable answer. In a Retrieval-Augmented Generation (RAG) system, it isn’t. Chunks of user-generated documents, HR profiles, or support tickets have been sliced, embedded via a model like OpenAI’s text-embedding-3-large, and pushed into a vector store as high-dimensional float arrays. The original text is gone from that layer. What remains is a mathematical representation of it — and compliance officers need deterministic proof of destruction while solution architects are staring at black-box geometry. That gap cannot be closed with semantic search.

The Cost of Brute-Force Compliance

Bulk namespace wipes forced a full re-embedding pass for every deletion request. Cost scaled with total corpus size, not with the deleted user’s footprint — a distinction that matters the moment your corpus grows past a few thousand documents.

We first ran into this on an enterprise knowledge base built for a local SME. The ingestion pipeline chunked uploaded documents, embedded them via OpenAI, and pushed them to a managed vector store. We did not implement strict chunk lineage metadata at the time. That was a mistake — one that only became visible once compliance pressure arrived.

When the client determined there was no remaining basis to retain an ex-employee’s uploaded documents — performance reviews, project briefs, a personal statement — we couldn’t isolate the specific vectors. We had document hashes in Postgres. We did not have the corresponding vector IDs mapped to those hashes.

Our fallback was brute-force bulk re-indexing: wipe the entire vector namespace, delete the source files, trigger a full rebuild from the remaining documents. It works at small scale. It absolutely does not scale operationally. Over roughly two weeks, a handful of deletion requests drove material embedding API charges and significant compute overhead. As the corpus grows, rebuild cost scales proportionally. Run that calculation against a client with 200,000 documents and you will revisit every architecture decision you made in haste.

We needed something surgical. The SME use case forced one.

Why You Cannot Search for PII to Delete It

The engineering difficulty in vector deletion is not the deletion itself. Vector stores expose a delete-by-ID API; deletion is a single call. Finding the right nodes is where compliance fails.

A tempting but unreliable pattern is using semantic similarity search to locate chunks containing Personal Identifiable Information. You cannot query Pinecone or Qdrant with “Find all chunks mentioning NRIC S1234567A” and expect a complete result set. It won’t work — not because the tools are bad, but because the retrieval model is structurally wrong for this job.

Semantic search surfaces nearest neighbours based on mathematical proximity in embedding space. Production vector stores are built for relevance and latency, not exhaustive recall — and top-k similarity search is the wrong primitive for proving that every matching record has been found. It will miss outliers — chunks where PII appears in a tabular context, heavily fragmented records, passages that are contextually dissimilar to your query vector but contain the exact target string. A single missed vector is a PDPA failure. There is no margin here.

Rely on semantic search to locate data for an erasure request and you will leave orphaned personal data in your index. This is not a theoretical risk. It is a structural property of approximate nearest-neighbour retrieval.

Strict Metadata Tagging and the Hard Drop

Vector Object Payload

1	{
2	"id": "vec_8f72b1",
3	"values": [0.12, -0.45, 0.89],
4	"metadata": {
5	"tenant_id": "tn_992",
6	"user_id": "usr_401",	← ①
7	"document_id": "doc_77a",	← ②
8	"chunk_index": 4
9	}
10	}
11

① Enables surgical deletion of a single user's footprint
② Required for dropping specific deleted files

The most auditable approach is deterministic metadata filtering. We implemented this on a subsequent project — a compliance chatbot for a public-sector client, where tenant and user data segregation was contractual, not just regulatory. The discipline that context forced was instructive.

During ingestion, every text chunk is tagged with strict metadata before the embedding is computed. A minimal required payload looks like this:

{
  "tenant_id": "t_abc123",
  "user_id": "u_xyz789",
  "document_id": "doc_20250501_001",
  "chunk_index": 3,
  "ingested_at": "2025-05-01T09:12:00Z"
}

This metadata travels with the vector into the index. When a PDPA deletion request arrives, the application queries Postgres for all document_id values associated with that user_id. It then issues a metadata-filtered query against the vector store — a hard filter, not a similarity search — to retrieve the exact vector IDs tagged to those document IDs. Then it issues a hard delete by ID via the provider’s API.

The distinction between logical and hard deletion matters here. Flipping an active=false flag in the vector metadata is logical deletion: it hides the vector from RAG query results, but the bytes remain in the index. Under PDPA Section 25, the organisation must cease retaining personal data or remove the means by which it can be associated with a specific individual — leaving bytes in the index behind a suppression flag satisfies neither condition. Hard delete by ID, followed by provider-level verification that the IDs are no longer queryable, is the minimum evidence you can put in the audit trail. Some vector databases expose a list-and-verify endpoint you can call post-deletion to confirm the IDs no longer exist. Use it. Log the response. Attach it to the deletion audit trail in Postgres. Backups, replicas, and provider retention windows are a separate matter — closing those gaps requires the provider’s contractual deletion terms, not just the API response.

Geographic Isolation in Multi-Tenant Indices

Isometric technical diagram showing separate database nodes divided by solid walls.

Data residency is a parallel constraint that most teams underweight until a procurement officer raises it at contract review. The PDPA’s transfer limitation obligation does not ban offshore processing — it requires a comparable standard of protection for personal data transferred outside Singapore. In our first provisioning pass on the public-sector deployment, the contract specified Singapore-region hosting; the managed serverless vector tier came up in a non-Singapore region by default, so it had to be explicitly provisioned in ap-southeast-1 on AWS or the equivalent GCP region. This is not administrative overhead. It is a contractual line item.

For multi-tenant systems, namespace segregation compounds metadata filtering. Mixing tenant data in a shared namespace increases the blast radius of any metadata tagging error. On the public-sector deployment, each tenant received a dedicated namespace. When a tenant leaves the platform, there is no deletion query to reconstruct — you drop the entire namespace. An auditor can verify that a namespace no longer exists, which is a considerably more legible answer than trying to explain float array geometry.

One practical caveat: namespace provisioning overhead is real. For small deployments with many tenants, the operational cost of managing hundreds of namespaces must be weighed against the compliance benefit. For that deployment — a small number of high-trust tenants — dedicated namespaces were the right call. For a B2C product with thousands of end-users, strict per-user metadata tagging within shared namespaces is more operationally sane, provided the tagging discipline holds from day one.

The Vector Index Is a Cache, Not a Source of Truth

Architecture Flow

step 01

DELETE in Postgres

step 02

Trigger Webhook / Sync

step 03

API Hard Drop by Metadata

The architectural shift that makes all of this tractable is treating the vector database as what it actually is: a disposable cache built on top of your relational data. Not a system of record. Never a system of record.

The relational database holds the source documents, the metadata, and the deletion state. When a record is deleted in Postgres, a webhook or synchronous API call must trigger a hard delete of the corresponding vector IDs from the index without delay. Letting that sync run as a background job carries compliance risk. Do not let the scheduler define your retention window. Once the retention basis is gone, vector deletion should run on the same control path as the source-record delete; any operational delay must be grounded in a DPO-approved retention procedure, not scheduling convenience.

Design your RAG architecture under the assumption that the entire vector index could be destroyed and rebuilt from your relational store at any moment. If your metadata tagging is correct — tenant_id, user_id, document_id, chunk_index on every ingested chunk — a full rebuild is a scheduled job, not an incident. More importantly, surgical per-user deletion becomes a deterministic, auditable, five-line operation.

That is the answer you hand the auditor.

← All field notes Brief a crew →