← All Projects
ShippedData Engineering

Synthetic Corpus Generator

Generates forensically realistic DOCX and PDF document sets for testing eDiscovery tools and workflows.

Fictional Veridian Analytics narrative
Forensically realistic metadata
Multi-custodian, multi-thread document sets

The Problem

Testing eDiscovery tools — review platforms, processing engines, analytics features — requires realistic document sets. Real case documents can't be used for testing (privilege, confidentiality). Placeholder "Lorem ipsum" documents don't exercise the edge cases that matter: threaded email chains, metadata inconsistencies, cross-custodian document families, near-duplicates.

There was no easy way to generate a synthetic corpus that felt like a real case.

Approach

The Synthetic Corpus Generator uses a fictional corporate scenario — Veridian Analytics, an IP theft investigation — as the narrative backbone. Node.js handles document templating and corpus structure; Python handles metadata injection and PDF generation.

The corpus includes:

  • Multi-custodian email chains with realistic threading and reply-all patterns
  • DOCX business documents (memos, contracts, presentations) with embedded metadata
  • Near-duplicate document families for testing deduplication logic
  • Intentional metadata anomalies (date inconsistencies, author/last-modified mismatches) to test processing edge cases
  • A case narrative that makes document relationships coherent and reviewable

Outcome

Used as the primary test corpus for eDiscovery tool evaluation. The Veridian scenario produces a corpus realistic enough to use in demos and training without any confidentiality concerns.

What I Learned

Realistic document metadata is harder to fake than realistic content. Getting email headers, MIME structures, and DOCX core property timestamps to be internally consistent requires more engineering than the actual content generation.