310 likes | 403 Views
Automatic Hebrew Vocalization. By: Eran Tomer Advisor: Prof. Michael Elhadad. Natural Language Processing. The computational linguistics field attempts to model and study languages using computational techniques.
E N D
Automatic Hebrew Vocalization By: Eran Tomer Advisor: Prof. Michael Elhadad
Natural Language Processing • The computational linguistics field attempts to model and study languages using computational techniques. • The diverse challenges confronted by computational linguistics researchers include: • Machine translation • Automatic text-summarization • Speech-to-text • Text-to-speech • Etc.
Hebrew Natural Language Processing • Accomplishing NLP tasks for Hebrew is made difficult by 2 factors: • Lack of large-scale, annotated resources Supervised learning is generally hard to apply • High ambiguity rate A given Hebrew word may have an astonishing number of different meanings and pronunciations. e.g. ספר, שלט, שערה, משנה
Related Work • The Hebrew TreeBank • 5,000 segmented and morphologically tagged sentences • Mila • Various corpora, lexicons and some NLP tools • Word Segmentation • Morphological tagging
Motivation • Development of a Hebrew Text-To-Speech system • A vocalized and syllabified word may be used as a normalized-form for a Hebrew TTS system • Generation of vocalized text for teaching • Vocalized inflected words are difficult to obtain (do not exist in dictionaries), and are widely used for teaching • Improving automatic translation systems
Objectives • Generation • Automatically producing fully vocalized verb inflections with the corresponding morphological attributes. • Syllable segmentation • Automatically segmenting vocalized words into syllables • Unknown verb classification • Classifying verbs to their corresponding patterns • Automatically selecting an inflection schema for an un-known verb
Research Questions • The Hebrew verb • How complex must be the computational model for verb full morphological and vocalization generation? • How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon? • Syllable segmentation • How complex is syllable segmentation? • What level of knowledge is required for successful segmentation?
Previous Work • Vocalization • Syllable segmentation • Generation
Background – Hebrew Vocalization • Vowels vs. Consonants • Consonant letters are either vocalized by Shva (ְ), or non-vocalized at the end of a word. There exist two types of Shva, Na and Nach • A letter that functions as a vowel will be vocalized by the following vowel and semi-vowel signs Nach Na • O • ׂHolam • וֹHolam Male • ָKamatsKatan • ֳHatafKamats • I • ִHirik • A • ָKamats • ַPatah • ֲHataf Patah • E • ֶSegol • ֵTsere • ֱHataf Segol • U • ֻKubuts • וּShuruk
Background – Hebrew Vocalization • Diacritic signs may change pronunciation of letters • Dagesh • Dagesh (ּ) emphasizes letters, yet in modern Hebrew it affects בּ/ב,כּ/כ and פ/פּ only • Mapik • Mapik (ּ) denotes a constant (emphasized) Hey at the end of the word • Shin dots • Shin dots distinguish the pronunciation of ש as SH (שׁ) or S (שֹ) • Dagesh Kal vocalizes ב, ג, ד, כ, פ, ת • At the beginning of a word • After a Shva Nach • Dagesh Hazakvocalizes any letter other than א, ה, ח, ע, ר • Followingcertain linguistic phenomena, • In some noun/verb patterns
Background – Hebrew Vocalization • Syllables • Hebrew words are composed of syllables, a syllable is a phonological entity that is pronounced in one effort • Stress • Hebrew words are stressed by two stress schemes • Milel (מלעיל) denotes the syllable prior to the last is stressed • Milra (מלרע) denotes the last syllable is stressed • Deficient spelling vs. Plene spelling • In many cases there exist more than one valid ways to spell a given Hebrew word
Background – Hebrew Vocalization • The syllables and vowels rule (כלל ההברות והתנועות) • Require: A stressed/non-stressed syllable (s) • if s is a non-stressed syllable then • if s is an open syllable vocalize s with a long vowel • else vocalize s with a short vowel • else • In most cases s should be vocalized with a long vowel, yet the number of exceptions is considerable
Background – Hebrew Vocalization • Examples
Background – Hebrew Vocalization • Verbs • Morphological attributes • Patterns • Gender • Masculine • Feminine • Both • Tense • Past • Beinoni (Participle) • Present • Future • Imperative • Person • First • Second • Third • Number • Singular • Plural
Background – Hebrew Vocalization • The Hebrew paradigms • Hebrew verbs are clustered into several paradigms that are characterized by the manner they inflect verbs • Complete paradigms (גזרות השלמים) • Crippled paradigms (גזרות נחות) • Defective paradigm (גזרות חסרות) • Etc. • Inflection tables • Paradigms are further partitioned into about 300 specific inflection tables which describe inflections of specific verb families
Background – Hebrew Vocalization • Inflection tables - example
Datasets • Verbs list • Over 4k manually gathered verbs • Morphology - deficient, past, masculine, singular, 3rd person • Shin dots are indicated • The corresponding inflection table is indicated for each verb • Morphologically analyzed corpora • About 50 million fully morphologically disambiguated words • Material from “Haaretz” newspaper, “Tapuz” website, the “Knesset” discussions and other resources
Generation • Method • We implemented 264 inflection tables which: • Take: • A verb (v) from our verb list dataset • A corresponding inflection table • Return: • Vocalized inflections of v with appropriate morphological tags • Results • A list with over than 240,000 vocalized verbs with appropriate morphological attributes • Evaluation • A sample of over 15,000 inflected verbs were manually validated with 99.4% accuracy
Generation – results sample • C-20, פצפץ: • פִּצְפַּצְתִּי,PAST+FIRST+MF+SINGULAR+COMPLETE • פִּצְפַּצְתָּ,PAST+SECOND+M+SINGULAR+COMPLETE • פִּצְפַּצְתְּ,PAST+SECOND+F+SINGULAR+COMPLETE • פִּצְפֵּץ,PAST+THIRD+M+SINGULAR+COMPLETE • פִּצְפְּצָה,PAST+THIRD+F+SINGULAR+COMPLETE • פִּצְפַּצְנוּ,PAST+FIRST+MF+PLURAL+COMPLETE • פִּצְפַּצְתֶּם,PAST+SECOND+M+PLURAL+COMPLETE • פִּצְפַּצְתֶּן,PAST+SECOND+F+PLURAL+COMPLETE • פִּצְפְּצוּ,PAST+THIRD+M+PLURAL+COMPLETE • פִּצְפְּצוּ,PAST+THIRD+F+PLURAL+COMPLETE • …
Syllable segmentation • Method • Syllable segmentation requires Shva classification • Shva Na marks syllable start* • Shva Nach denotes syllable end* • Each syllable includes exactly one vowel* * According to Even-Shoshan dictionary • We implemented two Shvaclassificationschemes • Heuristic approach - Rabbi-Eliyahu-Behor • Shva classification according to the base tense form
Syllable segmentation • Heuristic approach • By Behor - a Shva is a Shva Na if: • It vocalizes the first letter of the word • It follows another Shva and it is not at the word end • It follows a long, stressed vowel (stress is needed) • It vocalizes a letter with Dagesh Hazak(Dagesh type is needed) • It vocalizes the first among two identical letters (many exceptions) • By our (adapted) Heuristic: • A Shva is a Shva Na if: • It vocalizes the first letter of the word • It follows another Shva and it is not at the word end • It follows a long vowel • A Shva is a ShvaNach if: • It is followed by another Shva • In any other case, we use Shva Nach as default
Syllable segmentation • Shva classification according to the base tense form • Through our generation mechanism, we can correlate verb inflections to their corresponding base-tense form • A Shva present in the base-tense form is a Shva Nach • Otherwise the Shva is a Shva Na • Matching inflection to base-tense forms • We use a dynamic programming string matching algorithm • Operations costs were customized to be character dependent, respecting the Hebrew inflectional model • I I C R C C C C C C C C I R • ת ּ ִ ז ְ ד ּ ַ ק ּ ְ ק ִ י • י - ִ ז ְ ד ּ ַ ק ּ ֵ ק - -
Syllable segmentation • Results • Thanks to our generation model, we obtain 240k of highly accurate vocalized verbs • We applied our two approaches to receive two lists of verbs segmented into syllables: • By our heuristic approach (based on Behor’s heuristic) • By our customized string matching algorithm • Evaluation • A sample of 300 segmented verbs were validated for: • 81% word accuracy and 85.92% syllable accuracy by the heuristic • 99.33% word accuracy and 99.5% syllable accuracy by the string matching approach
Syllable segmentation – results sample גֻּלַּם • -גּוּ-לַּמְ-תִּי • -גֻּ-לַּמְ-תִּי • -גּוּ-לַּמְ-תָּ • -גֻּ-לַּמְ-תָּ • -גּוּ-לַּמְתְּ • -גֻּ-לַּמְתְּ • -גּוּ-לַּמְ-תֶּם • -גֻּ-לַּמְ-תֶּם • -גּוּ-לַּמְ-תֶּן • -גֻּ-לַּמְ-תֶּן • -גּוּ-לְּמוּ • -גֻּ-לְּמוּ • -גּוּ-לַּם • -גֻּ-לַּם • -גּוּ-לְּמָה • -גֻּ-לְּמָה • -גּוּ-לַּמְ-נוּ • -גֻּ-לַּמְ-נוּ
Verb classification to patterns • Method • We implemented a classifier (SVM) which: • Take: • A non-vocalized verb (v) • Return: • The pattern corresponding to v • The SVM uses: • Dataset: • Over 2,700 verbs from our verb list • 70% are used for training and 30% for testing • Features: • Word length • letters positions • Guttural letters positions • Evaluation • 90.25% of the verbs were classified correctly to their corresponding Hebrew pattern
Unknown verb classification to inflection tables • Method • We implemented a classifier (SVM) which: • Take: • A non-vocalized verb (v) • Return: • The inflection table corresponding to v • The SVM uses: • Dataset: • Over 2,700 verbs from our verb list • 70% are used for training and 30% for testing • Features: • Word length • letters positions • Guttural letters positions • Corpus level features (50M morphologically disambiguated corpus) • Evaluation • Without corpus level features - 68.63% accuracy • With corpus level features - 70.08% accuracy
Discussion • The Hebrew verb inflectional model • Q: How complex must be the computational model for verb full morphological and vocalization generation? • A: By implementing 264 inflection tables we achieve 99.4% accuracy • Q: How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon? • A: The 260 implemented inflection tables include many exception tables which describe the inflectional model for only several verbs • Our more general, unknowns classification, model, yields 70% accuracy (selecting 1 inflection table out of the total 264 tables) • For comparison the baseline for the most frequent inflection table yields only 34% accuracy • A rough estimation shows over 93% of the verbs in a large corpora exist in our dataset, moreover most unknown verbs are either miss-spelled or falsely tagged as verbs
Discussion • Syllable segmentation • Q: How complex is syllable segmentation? • In contradiction to traditional grammars, few simple rules do not provide highly accurate segmentation • We achieved 99.3% word accuracy and 99.5% syllable accuracy through Shva classification • Q: What level of knowledge is required for successful syllable segmentation? • A: By using the vocalized word only we achieve correct word segmentation with 81% accuracy • Using the base tense form as well, improves word accuracy to 99.3% • This improvement suggests: • Hebrew phonology uses a constructive process, which derives inflections from base tense forms • Inflections are not generated in a pipeline process, in which morphology would first generate inflections that are later segmented into phonological units
Future work • Generation • Implementing rare inflection tables • Implementing inflection tables for nouns • Syllable segmentation • Searching for optimal Hebrew string matching weights • Machine learning of syllable segmentation
Future work • Unknown verbs classification • Using vocalized corpora to extract corpus level features • Performing feature selection • Classification of vocalized verbs into inflection tables • Classification of inflections into inflection tables • Exploring the SVM parameters • Automatic vocalization • We hope to obtain a substantial vocalized corpora (the Aviv encyclopedia), which will enable: • Setting a base line for automatic vocalization using a modern vocalized corpora • Improving the baseline through supervised learning