ShippedData Engineering

Synthetic Corpus Generator

Node.js and Python scripts that generate synthetic eDiscovery corpora using a fictional corporate narrative with forensically realistic metadata, timestamps, and custodian attribution. Built for QA teams and platform engineers who need large, realistic document sets to test eDiscovery review tools and processing workflows.

◆Fictional Veridian Analytics narrative

◆Forensically realistic metadata

◆Multi-custodian, multi-thread document sets

The Problem

Testing eDiscovery tools (review platforms, processing engines, analytics features) requires realistic document sets. Real case documents can't be used for testing (privilege, confidentiality). Placeholder "Lorem ipsum" documents don't exercise the edge cases that matter: threaded email chains, metadata inconsistencies, cross-custodian document families, near-duplicates.

There was no easy way to generate a synthetic corpus that felt like a real case.

Approach

The Synthetic Corpus Generator uses a fictional corporate scenario (Veridian Analytics, an IP theft investigation) as the narrative backbone. Node.js handles document templating and corpus structure; Python handles metadata injection and PDF generation.

The corpus includes:

Multi-custodian email chains with realistic threading and reply-all patterns
DOCX business documents (memos, contracts, presentations) with embedded metadata
Near-duplicate document families for testing deduplication logic
Intentional metadata anomalies (date inconsistencies, author/last-modified mismatches) to test processing edge cases
A case narrative that makes document relationships coherent and reviewable

Outcome

Used as the primary test corpus for eDiscovery tool evaluation. The Veridian scenario produces a corpus realistic enough to use in demos and training without any confidentiality concerns.

What I Learned

Realistic document metadata is harder to fake than realistic content. Getting email headers, MIME structures, and DOCX core property timestamps to be internally consistent requires more engineering than the actual content generation.

← Back to all projects EmailVault →