200 likes | 471 Views
Ancient Greek OCR w ith Gamera and the Google/ Perseus Greek and Latin Collection. Bruce Robertson, Mount Allison University. ἀλήθεια truth Ἀ λήθεια. ‘Breathing’ marks on vowels at beginning of a word Accents possible on all vowels. Diversity of Greek Fonts in 19 th C. Other Examples.
E N D
Ancient Greek OCR with Gamera and the Google/PerseusGreek and Latin Collection Bruce Robertson, Mount Allison University
ἀλήθειαtruth Ἀλήθεια • ‘Breathing’ marks on vowels at beginning of a word • Accents possible on all vowels
Greek OCR With Gamera • Dalitz and Brandt provide an experimental framework • I added splitting, grouping, sql output, etc. • Teams of undergraduates making multiple classifiers • Based on families of fonts • Comparing strategies of composite characters, splitting, etc. • Must also train for Latin scripts used • Not yet working on post-processing
Systematic Approach to Automated Greek OCR • Remove the curator from the loop – especially important for journals, monographs, etc. • Assign classifier by computation means • Using: • Federico Boschetti’s ground-truth-less Greek text evaluator • Atlantic Computational Excellence Network, Atlantic Canada’s parallel computing network
Process • 160 Greek-heavy texts chosen • Of these, random samples of 10 pages were taken • Each was processed with each of the 20 classifiers made this summer • The result were evaluated and given a ‘Boschetti score’ from 0 – 1
Future Work • Combining and re-optimizing classifiers? • Assign classifier based on Latin text • Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output? • Align with Google’s output, and provide Google with corrected Greek • Implement line-splitting from other OCR engines • Discover badly OCR’d Greek in others’ output • Implement OCR correction frameworks described here
Common Problems • Assessments of pre-processing strategies and tools • Schemas for page description
Thanks • Colleagues in Dynamic Variorum Editions: • Greg Crane at Perseus / Tufts • Brian Fuchs at Imperial College • Federico Boschetti • AceNet, especially tech. support of Sergiy Khan