Hello HN,
In high-conflict litigation, "black box" redactions are often a disaster waiting to happen. I realized that many people (and even law firms) use civilian-grade tools that leave "Ghost Layers"—original text layers or metadata underneath the digital ink.
I built the Shadow Report as a free forensic tool to prove this. You can upload a redacted page, and it scans for:
Ghost Text Layers: Checking if the searchable PDF layer still exists beneath the redaction blocks (using pypdf layer inspection).
Metadata Leaks: Extracting Author/Producer info that reveals who actually drafted the document.
Image Fingerprints: Scraping EXIF data that can geo-locate or time-stamp "anonymous" evidence.
Backstory: This is a component of a larger project called Exit Protocol. I started it after a friend was quoted $50k for a forensic accountant to trace "separate property" in a divorce. The math they use—the Lowest Intermediate Balance Rule (LIBR)—is deterministic, but accountants do it manually in Excel. I automated the LIBR math to handle 10k+ transactions via Celery/Postgres.
Stack:
Django 5.0 (Monolith) / Postgres
pypdf & Pillow for the forensic scanning
Celery for async processing of massive bank discoveries
Air-gapped "BYOK" model for law firms (Docker)
I'd love feedback on:
Are there other "Ghost Layer" detection methods I should implement (e.g., color-space delta analysis)?
For those in LawTech: How do you handle "PDFs from hell" (scanned, rotated, handwritten notes)? I'm currently using a custom OC-3 implementation.
Try the Redaction Check: https://exitprotocols.com/redaction-check/
Main Site: https://exitprotocols.com/
I took a simpler approach and built a small browser-only audit tool that just answers one question: is this PDF still leaking extractable content at all?
It doesn’t try to unredact or guess text, just flags whether text layers, hidden characters, or metadata are still present so you know whether the redaction actually worked.
https://audit.reactpdf.app
Curious if you’ve run into cases where PDFs look clean at the layer/metadata level but still leak via other mechanisms.
reply