Evolving Programs
  • Research
  • Datasets
  • Projects
  • Art
  • Careers
  • Contact
Get in Touch

Datasets

Open data for ancient languages

We publish manuscripts and corpora through the Ancient Languages Project on Hugging Face — curated resources for training and evaluating models on historical and esoteric texts.

Ancient Languages

Latin-CC-170M

A Latin corpus distilled from Common Crawl, mirrored from Kaggle, for training and evaluating language models on classical Latin.

Common Crawl · Latin

Ancient Languages

Voynich

Structured transcriptions and metadata for the Voynich Manuscript, prepared for sequence modeling and cryptographic analysis.

31k rows

Ancient Languages

CIL

A machine-readable subset of the Corpus Inscriptionum Latinarum, the canonical collection of Latin epigraphic inscriptions.

Corpus Inscriptionum Latinarum

View all datasets on Hugging Face
Evolving Programs
PrivacyTerms

© 2026 Evolving Programs, Inc.