260 likes | 356 Views
Knowledge Center for Processing Hebrew. Alon Itai – CS Technion. Tools for underrepresented languages. Computer tools and especially the Internet are Anglophile. Search engines are not tooled for morphologically rich languages. Search “ dog ” “ dogs ” “ and dogs ”. כלבים מאולפים מחפשים בית
E N D
Knowledge Center for Processing Hebrew Alon Itai – CS Technion
Tools for underrepresented languages Computer tools and especially the Internet are Anglophile. • Search engines are not tooled for morphologically rich languages.
Search “dog” “dogs” “and dogs” כלבים מאולפים מחפשים בית רוני אילוף כלבים אתרי קטגורית כלבים הב-הב אתר חיות המחמד של ישראל! קובי חזן אילוף כלבים היחידה המיוחדת לאילוף כלבים • כלב • כלב - ויקיפדיה • כלבים מאולפים מחפשים בית • כלב(יונק( • כלבים | כלב • אוגר זהבכלב הבית מכונה בלשון המדע – כלב זאב ביתי • עמותת SOS חיות - בחירת כלב מתאים • לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירה - כלב • כלבים | כלבאתר המציע שידוכים בין גזעים, בייביסיטרים, תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעות. • Dogאתר הכלבים מכיל הרבה מידע, מאמרים, קורסים, תמונות וקטעי וידאו של כלבים וכל הקשור בהם • dogגזעי כלבים · תמונת החודש · הכלב והחוק · רפואה וטיפול · קורסים · מאמרים · לוח מודעות · כלבי הצלה · קטעי וידאו · תמונת השנה · פינת האימוץ ... זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלב על אלמנה וכלב ניופאונדלנד, כלבי רועים וכלב רועים בלגי - PETNET.co.il ליווי, עזרת זולת רפואית וכלב נחייה
Tools for underrepresented languages. • Computer tools and especially the Internet are Anglophile. • Search engines are not tooled for morphologically rich languages. • Email and chats do not cope well with strange alphabets • use (pidgin) English for communication,… • The local language is used less and less. אבגדהוזחטיכלמנסעפצקרשת
The problem • Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools. • Even when tools are available – no open source • Tools developed at Universities are not fit for general use:not robust enough no standard interfacelack of documentation
Duplication of Effort • Every researcher has to redevelop her own tools, before conducting original research • For example: In Hebrew, there are many morphological analyzers: • Choueka and Shapira 1964, • Ornan 1987, Lavie et al. 1988, • Bentur et al. 1992, • Segal 1999, • HSPELL • Yona and Wintner 2005
The Knowledge Center • In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew. • Its aim to develop products (software and databases) for processing Hebrew and make them available to the public, both in academia and industry. • Researchers from four universities are involved in the Center's activities.
The researchers • Yoad Winter (Technion), • Shuly Wintner (Haifa University), • Michael Elhadad (Ben Gurion University), • Arnon Cohen (Ben Gurion University), • Yoram Singer (Hebrew University) • Eli Shamir (Hebrew University) • Alon Itai (Technion)
The model • The ministry provides initial funds. • The Center should be self-sustainable – it should finance itself by selling products. The problems: • The market is too small, had it been large then there would have been no need for the center. • Contradicts our philosophy of open research and open code.
Licensing Policy • Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL. • Payments only for special services. • Can get a non-exclusive license for commercial use.
XML • All products are represented by XML. • Readable both by machines and by humans • Enables using off-shelf tools for on screen presentation and validation EXAMPLE -<item id=“17580” script=“formal” transliterated=“bwqr” undotted=“בוקר“ dotted=“בֹּקֶר“ > <noun gender=“masculine” number=“singular” plural=“im”> <replace gender=“masculine” number=“plural” script=“formal” transliterated=“bqarim” undotted=“בקרים“/> </noun> </item> Info for the morphological parser
XML (2) • Facilitates interface between tools: • For example, the output of the morphological analyzer is the input for the morphological disambiguator. • Thus one can match different morphological analyzers with different disambiguators and compare their results
Products • Morphological analyzers • Morphological disambiguators • Lexicon • Corpora • Speech data base • Tools for editing lexicons and tagging corpora. • PR: forum,…
The lexicon by part of speech Total : 21,417
Morphological disambiguators • Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%. • Erel Segal combined a Brill-like method with a priori occurrence probabilities . • Meni Adler used HMM on whole words. • All three disambiguators are available at the Center.
Corpora (2) • 6000 sentences of manually tagged corpus (12,000 tokens).
Tree bank • 6000 syntactically parsed sentences. • Used for automatic parsing.
Conclusions • The Center is an example of cooperation between researchers in several universities. • Many users have downloaded the products. • 10 companies have purchased licenses.
Conclusions (2) • Money is running out, … • The model requires money, experts, and commitment. • Not suitable for languages with very few speakers, or for poor communities.
Modern Hebrew • Official Language of the State of Israel • Spoken by 7 M people • Related, but linguistically distinct, from Biblical Hebrew. • Morphologically rich
Semitic Word Formation root + pattern word pattern CaCaC yiCCoC root ktb katab (he wrote( yiktob (he will write) šabar (he broke) yišbor (he will break) šbr
Writing System • Most vowels are omitted • Particles are prepended to words, Example: h – definite article, b – preposition (in) w – conjunction (and) wbbyt = w + b + ha +byt and in the house
Morphological Ambiguity • Most words are morphologically ambiguous • Example: šbth שבתה • šavta = šbt + CaCCa = stopped working • šavta = šbh + CaCCa = took prisoner • šabatah = her Saturday • še-b-te = that in tea • še-b-ha-te = that in the tea • še-bit-h = that her daughter…