Validation and Benchmarks
How Protostar AI proves its privacy guarantees
Last updated: June 29, 2026. This report is kept deliberately honest. We state what is measured (real corpus or synthetic, open or access-gated, and which categories), and we do not claim certifications or perfect detection. Every detection number below is reproduced by an automated test suite.
The guarantee: what is genuinely 100%
De-identification recall on real free text is never perfect for any system, and any vendor who claims otherwise is overstating. Protostar’s safety does not depend on perfect detection. It rests on fail-closed architecture, and these results are genuinely 100% or 0%:
- Zero external egress for regulated (C1) data. Architecturally guaranteed, not dependent on detection. Proven by a test that disables detection entirely: regulated data still never reaches an external model. The no-egress posture is verified on live infrastructure (no outbound route).
- 159 of 159 automated tests passing.
- 2,000-request Monte Carlo simulation: 100% pass (regulated data never leaves, real values are restored inside the boundary, the tamper-evident audit chain verifies).
- Synthetic de-identification corpus: 100% recall, 100% fully masked.
Our claim is “100% of regulated data stays inside your boundary,” never “100% of identifiers detected.” Detection is defense in depth on top of the guarantee.
Detection results (open, published corpora)
Detection has two parts that perform very differently, and we report both honestly. Entity-level recall is the safety metric: a missed identifier is a potential leak. These run on public corpora and are reproducible.
Deterministic identifiers: about 100%
Structured identifiers are matched by exact patterns, so recall is effectively complete. Across the clinical (ASQ-PHI) and legal (TAB) corpora these run at or near 1.00:
| Identifier | Recall |
|---|---|
| Social Security numbers | ~1.00 |
| Phone numbers | ~1.00 |
| Email addresses | ~1.00 |
| Dates | 1.00 |
| Account and record numbers (patterned) | ~1.00 |
Free-text entities: at the published state of the art
Names, organisations, and locations live in free text and need named-entity recognition. No de-identification system reaches 100% here: the published state of the art on these corpora is about 0.96 to 0.98, and even the human annotators who built them do not agree 100% with each other. With the transformer model our targeted categories sit at the top of that range on real legal text (TAB):
| Category | Recall |
|---|---|
| Names | 0.96 |
| Locations and facilities | 0.96 |
| Organisations | 0.95 |
On the clinical corpus (ASQ-PHI) names run near 1.00. Overall corpus recall, counting every category including ones outside PHI scope such as legal case codes and monetary amounts: TAB 0.83, ASQ-PHI 0.93, Gretel 0.83; ai4privacy targeted classes strong (SOCIALNUMBER 0.97, URL 1.00).
We target every identifying category and prioritise recall over precision: over-redaction only costs utility and is recovered inside the boundary, while a missed identifier is a leak.
Why not 100%?
Detection recall on real free text cannot be 100% for any system. A tool that claims it is either masking everything, which destroys utility, or measuring against a gold standard it overfits. That is why the 100% we stand on is the structural guarantee at the top of this report: regulated data never leaves your boundary, regardless of detection. Detection is defense in depth on top of it.
Access-gated benchmarks (run on credential)
The canonical clinical de-identification corpora require a data-use agreement. Our loaders are built and tested, and we run them the moment access is granted. We do not claim these results until they are produced on the real corpus.
| Benchmark | Source | Published bar | Status |
|---|---|---|---|
| i2b2 / n2c2 2014 | Harvard / UTHealth | about 0.96 F1 (top systems) | Loader ready; pending data-use agreement |
| MIMIC de-identification | MIT / PhysioNet | about 0.97 recall | Loader ready; pending credentialed access |
Methodology
- Entity-level precision, recall, F1 and F2 with overlap matching, following the i2b2 / UTHealth 2014 (n2c2) shared task, the Text Anonymization Benchmark, and Microsoft Presidio evaluation guidance.
- Recall is the headline safety metric. Precision is reported on the categories each benchmark actually annotates, so masking a category a corpus does not label (for example ages, which HIPAA Safe Harbor does not treat as identifiers) is not scored against us.
- Detection combines deterministic patterns with named-entity recognition. The transformer model is the reference and the target for the production tier.
Standards
Protostar AI is designed to HIPAA, SOC 2, ISO/IEC 42001 and 27001, IEC 62304, ISO 14971, and MiFID II RTS 6 controls, and operates under Business Associate Agreements for regulated data.