Synthetic Corpus Generator
Generates forensically realistic DOCX and PDF document sets for testing eDiscovery tools and workflows.
The Problem
Testing eDiscovery tools — review platforms, processing engines, analytics features — requires realistic document sets. Real case documents can't be used for testing (privilege, confidentiality). Placeholder "Lorem ipsum" documents don't exercise the edge cases that matter: threaded email chains, metadata inconsistencies, cross-custodian document families, near-duplicates.
There was no easy way to generate a synthetic corpus that felt like a real case.
Approach
The Synthetic Corpus Generator uses a fictional corporate scenario — Veridian Analytics, an IP theft investigation — as the narrative backbone. Node.js handles document templating and corpus structure; Python handles metadata injection and PDF generation.
The corpus includes:
- Multi-custodian email chains with realistic threading and reply-all patterns
- DOCX business documents (memos, contracts, presentations) with embedded metadata
- Near-duplicate document families for testing deduplication logic
- Intentional metadata anomalies (date inconsistencies, author/last-modified mismatches) to test processing edge cases
- A case narrative that makes document relationships coherent and reviewable
Outcome
Used as the primary test corpus for eDiscovery tool evaluation. The Veridian scenario produces a corpus realistic enough to use in demos and training without any confidentiality concerns.
What I Learned
Realistic document metadata is harder to fake than realistic content. Getting email headers, MIME structures, and DOCX core property timestamps to be internally consistent requires more engineering than the actual content generation.