Protostar AI

Validation and Benchmarks

How Protostar AI proves its privacy guarantees

Last updated: June 29, 2026. This report is kept deliberately honest. We state what is measured (real corpus or synthetic, open or access-gated, and which categories), and we do not claim certifications or perfect detection. Every detection number below is reproduced by an automated test suite.


The guarantee: what is genuinely 100%

De-identification recall on real free text is never perfect for any system, and any vendor who claims otherwise is overstating. Protostar’s safety does not depend on perfect detection. It rests on fail-closed architecture, and these results are genuinely 100% or 0%:

Our claim is “100% of regulated data stays inside your boundary,” never “100% of identifiers detected.” Detection is defense in depth on top of the guarantee.


Detection results (open, published corpora)

Detection has two parts that perform very differently, and we report both honestly. Entity-level recall is the safety metric: a missed identifier is a potential leak. These run on public corpora and are reproducible.

Deterministic identifiers: about 100%

Structured identifiers are matched by exact patterns, so recall is effectively complete. Across the clinical (ASQ-PHI) and legal (TAB) corpora these run at or near 1.00:

IdentifierRecall
Social Security numbers~1.00
Phone numbers~1.00
Email addresses~1.00
Dates1.00
Account and record numbers (patterned)~1.00

Free-text entities: at the published state of the art

Names, organisations, and locations live in free text and need named-entity recognition. No de-identification system reaches 100% here: the published state of the art on these corpora is about 0.96 to 0.98, and even the human annotators who built them do not agree 100% with each other. With the transformer model our targeted categories sit at the top of that range on real legal text (TAB):

CategoryRecall
Names0.96
Locations and facilities0.96
Organisations0.95

On the clinical corpus (ASQ-PHI) names run near 1.00. Overall corpus recall, counting every category including ones outside PHI scope such as legal case codes and monetary amounts: TAB 0.83, ASQ-PHI 0.93, Gretel 0.83; ai4privacy targeted classes strong (SOCIALNUMBER 0.97, URL 1.00).

We target every identifying category and prioritise recall over precision: over-redaction only costs utility and is recovered inside the boundary, while a missed identifier is a leak.

Why not 100%?

Detection recall on real free text cannot be 100% for any system. A tool that claims it is either masking everything, which destroys utility, or measuring against a gold standard it overfits. That is why the 100% we stand on is the structural guarantee at the top of this report: regulated data never leaves your boundary, regardless of detection. Detection is defense in depth on top of it.


Access-gated benchmarks (run on credential)

The canonical clinical de-identification corpora require a data-use agreement. Our loaders are built and tested, and we run them the moment access is granted. We do not claim these results until they are produced on the real corpus.

BenchmarkSourcePublished barStatus
i2b2 / n2c2 2014Harvard / UTHealthabout 0.96 F1 (top systems)Loader ready; pending data-use agreement
MIMIC de-identificationMIT / PhysioNetabout 0.97 recallLoader ready; pending credentialed access

Methodology


Standards

Protostar AI is designed to HIPAA, SOC 2, ISO/IEC 42001 and 27001, IEC 62304, ISO 14971, and MiFID II RTS 6 controls, and operates under Business Associate Agreements for regulated data.