110 likes | 246 Views
Comparable Corpora BootCat (CCBC). Adam Kilgarriff, Avinesh PVS Lexical Computing Ltd. BootCaT. Bootstrapping Corpora and Terms Translators Know the language Not domain experts Can interpret domain terms but can’t guess them Instant domain corpus from the web
E N D
Comparable Corpora BootCat(CCBC) Adam Kilgarriff, Avinesh PVS Lexical Computing Ltd
BootCaT • Bootstrapping Corpora and Terms • Translators • Know the language • Not domain experts • Can interpret domain terms but can’t guess them • Instant domain corpus from the web • Marco Baroni and Silvia Bernardini (2004)
BootCaT method • Piggyback on a search engine • Google, Yahoo, Bing • Set of seed terms • Repeat • Take random 3 seeds • Send to search engine • Gather ‘search hits’ pages • Remove, duplicates, find terms • Can iterate
WebBootCaT • Web interface • Improved cleaning, duplicate removal • Integrated with corpus tool (Sketch Engine)
Going multilingual • Google-translate • English: volcanologyvolcanologist "volcanic eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphictephrochronologygeochronological "volcanic ash" ablation rhyolitic • French:vulcanologuevolcanologie "éruptionvolcanique" sismographesEyjafjallajokull "surveillance de la déformation" géodiquestephra magma téphrochronologiestratigraphiquegéochronologiques "de cendresvolcaniques" ablation rhyolitiques • And do the same thing for French
By July 2011 • All steps integrated • Propose bilingual terminology