Datasets — Evolving Programs

Datasets

Open data for ancient languages

We publish manuscripts and corpora through the Ancient Languages Project on Hugging Face — curated resources for training and evaluating models on historical and esoteric texts.

Ancient Languages

Latin-CC-170M

A Latin corpus distilled from Common Crawl, mirrored from Kaggle, for training and evaluating language models on classical Latin.

Common Crawl · Latin

Ancient Languages

Voynich

Structured transcriptions and metadata for the Voynich Manuscript, prepared for sequence modeling and cryptographic analysis.

31k rows

Ancient Languages

CIL

A machine-readable subset of the Corpus Inscriptionum Latinarum, the canonical collection of Latin epigraphic inscriptions.

Corpus Inscriptionum Latinarum

View all datasets on Hugging Face