Datasets
Open data for ancient languages
We publish manuscripts and corpora through the Ancient Languages Project on Hugging Face — curated resources for training and evaluating models on historical and esoteric texts.
Ancient Languages
Latin-CC-170M
A Latin corpus distilled from Common Crawl, mirrored from Kaggle, for training and evaluating language models on classical Latin.
Common Crawl · Latin
Ancient Languages
Voynich
Structured transcriptions and metadata for the Voynich Manuscript, prepared for sequence modeling and cryptographic analysis.
31k rows
Ancient Languages
CIL
A machine-readable subset of the Corpus Inscriptionum Latinarum, the canonical collection of Latin epigraphic inscriptions.
Corpus Inscriptionum Latinarum