Ancient Greek OCR

Demonstrative screenshot

Ancient Greek OCR is free software to accurately convert scans of printed Ancient Greek into unicode text and PDF files, which can be easily searched, copied, archived, and transformed. It uses the excellent Tesseract OCR engine, tailored for Ancient Greek typography, syntax and vocabulary.

It works with Windows, OS X, Linux and Android, and works on personal computers, mobile devices, and large server clusters.

Download

Ancient Greek OCR v2.0 (2014-05-01)

Instructions: Windows | OS X | Linux | Android

How it works

Training Tesseract for Ancient Greek OCR article published in The Eutypon 28-29.

Release history

2.0 (2014-05-01)
Use new Tesseract tools to generate training images. Sample characters at different exposure levels. Remove rare characters (†/ϙ/ʹ). Add speech marks (“/”). Replace accented characters in modern Greek unicode set (U+0370) with Ancient Greek (U+1F00) variants. Improve wordlists by properly registering upper / lower case complements. Improve wordlist generation from Perseus corpus. Improve punctuation rules. Add rules to convert some apostrophe detections into breathing marks. Build is completely deterministic.
1.4 (2013-04-04)
Significantly improve line segmentation. Improve diphthong breathing mark correction rules.
1.3 (2013-02-28)
Add noise to training texts to improve recognition for lower quality scans. Improve diphthong breathing mark correction rules. Improve accent ambiguity rules. Add several miscellaneous ambiguity rules.
1.2 (2012-10-24)
Add rules to correct rho breathing mark errors
1.1 (2012-10-23)
Improve dictionary scoring.
1.0 (2012-09-13)
Initial release.

The code

The code to generate and test the Ancient Greek OCR training data is in several small git repositories. It is all free software under the Apache License 2.0.

git clone http://ancientgreekocr.org/grc.git
The final training process, hopefully soon to be part of the main Tesseract codebase.
git clone http://ancientgreekocr.org/grctraining.git
Rules and tools to deterministically generate all prerequisites for the final training process.
git clone http://ancientgreekocr.org/grctestfodder.git
Ancient Greek page scans and ground truth text for testing OCR accuracy.
git clone http://ancientgreekocr.org/ocr-evaluation-tools.git
Tools to test OCR accuracy.

Contact

For comments, bugs, criticisms, code, help, or anything else, contact Nick White at ancientgreekocr@njw.name.

Thanks

This project was made possible in part by the Institute of Museum and Library Services, LG0611032611; the National Endowment for the Humanities: Exploring the human endeavor; and the Perseus Digital Library Project, as well as the ERC funded Living Poets Project. The Tesseract OCR engine makes this all possible, doing all of the hard work behind the scenes.

Related projects

Lace is a project publishing high quality OCR on scans of Ancient Greek from archive.org.