1 / 31

Automatic Hebrew Vocalization

Automatic Hebrew Vocalization. By: Eran Tomer Advisor: Prof. Michael Elhadad. Natural Language Processing. The computational linguistics field attempts to model and study languages using computational techniques.

eithne
Download Presentation

Automatic Hebrew Vocalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Hebrew Vocalization By: Eran Tomer Advisor: Prof. Michael Elhadad

  2. Natural Language Processing • The computational linguistics field attempts to model and study languages using computational techniques. • The diverse challenges confronted by computational linguistics researchers include: • Machine translation • Automatic text-summarization • Speech-to-text • Text-to-speech • Etc.

  3. Hebrew Natural Language Processing • Accomplishing NLP tasks for Hebrew is made difficult by 2 factors: • Lack of large-scale, annotated resources Supervised learning is generally hard to apply • High ambiguity rate A given Hebrew word may have an astonishing number of different meanings and pronunciations. e.g. ספר, שלט, שערה, משנה

  4. Related Work • The Hebrew TreeBank • 5,000 segmented and morphologically tagged sentences • Mila • Various corpora, lexicons and some NLP tools • Word Segmentation • Morphological tagging

  5. Motivation • Development of a Hebrew Text-To-Speech system • A vocalized and syllabified word may be used as a normalized-form for a Hebrew TTS system • Generation of vocalized text for teaching • Vocalized inflected words are difficult to obtain (do not exist in dictionaries), and are widely used for teaching • Improving automatic translation systems

  6. Objectives • Generation • Automatically producing fully vocalized verb inflections with the corresponding morphological attributes. • Syllable segmentation • Automatically segmenting vocalized words into syllables • Unknown verb classification • Classifying verbs to their corresponding patterns • Automatically selecting an inflection schema for an un-known verb

  7. Research Questions • The Hebrew verb • How complex must be the computational model for verb full morphological and vocalization generation? • How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon? • Syllable segmentation • How complex is syllable segmentation? • What level of knowledge is required for successful segmentation?

  8. Previous Work • Vocalization • Syllable segmentation • Generation

  9. Background – Hebrew Vocalization • Vowels vs. Consonants • Consonant letters are either vocalized by Shva (ְ), or non-vocalized at the end of a word. There exist two types of Shva, Na and Nach • A letter that functions as a vowel will be vocalized by the following vowel and semi-vowel signs Nach Na • O • ׂHolam • וֹHolam Male • ָKamatsKatan • ֳHatafKamats • I • ִHirik • A • ָKamats • ַPatah • ֲHataf Patah • E • ֶSegol • ֵTsere • ֱHataf Segol • U • ֻKubuts • וּShuruk

  10. Background – Hebrew Vocalization • Diacritic signs may change pronunciation of letters • Dagesh • Dagesh (ּ) emphasizes letters, yet in modern Hebrew it affects בּ/ב,כּ/כ and פ/פּ only • Mapik • Mapik (ּ) denotes a constant (emphasized) Hey at the end of the word • Shin dots • Shin dots distinguish the pronunciation of ש as SH (שׁ) or S (שֹ) • Dagesh Kal vocalizes ב, ג, ד, כ, פ, ת • At the beginning of a word • After a Shva Nach • Dagesh Hazakvocalizes any letter other than א, ה, ח, ע, ר • Followingcertain linguistic phenomena, • In some noun/verb patterns

  11. Background – Hebrew Vocalization • Syllables • Hebrew words are composed of syllables, a syllable is a phonological entity that is pronounced in one effort • Stress • Hebrew words are stressed by two stress schemes • Milel (מלעיל) denotes the syllable prior to the last is stressed • Milra (מלרע) denotes the last syllable is stressed • Deficient spelling vs. Plene spelling • In many cases there exist more than one valid ways to spell a given Hebrew word

  12. Background – Hebrew Vocalization • The syllables and vowels rule (כלל ההברות והתנועות) • Require: A stressed/non-stressed syllable (s) • if s is a non-stressed syllable then • if s is an open syllable vocalize s with a long vowel • else vocalize s with a short vowel • else • In most cases s should be vocalized with a long vowel, yet the number of exceptions is considerable

  13. Background – Hebrew Vocalization • Examples

  14. Background – Hebrew Vocalization • Verbs • Morphological attributes • Patterns • Gender • Masculine • Feminine • Both • Tense • Past • Beinoni (Participle) • Present • Future • Imperative • Person • First • Second • Third • Number • Singular • Plural

  15. Background – Hebrew Vocalization • The Hebrew paradigms • Hebrew verbs are clustered into several paradigms that are characterized by the manner they inflect verbs • Complete paradigms (גזרות השלמים) • Crippled paradigms (גזרות נחות) • Defective paradigm (גזרות חסרות) • Etc. • Inflection tables • Paradigms are further partitioned into about 300 specific inflection tables which describe inflections of specific verb families

  16. Background – Hebrew Vocalization • Inflection tables - example

  17. Datasets • Verbs list • Over 4k manually gathered verbs • Morphology - deficient, past, masculine, singular, 3rd person • Shin dots are indicated • The corresponding inflection table is indicated for each verb • Morphologically analyzed corpora • About 50 million fully morphologically disambiguated words • Material from “Haaretz” newspaper, “Tapuz” website, the “Knesset” discussions and other resources

  18. Generation • Method • We implemented 264 inflection tables which: • Take: • A verb (v) from our verb list dataset • A corresponding inflection table • Return: • Vocalized inflections of v with appropriate morphological tags • Results • A list with over than 240,000 vocalized verbs with appropriate morphological attributes • Evaluation • A sample of over 15,000 inflected verbs were manually validated with 99.4% accuracy

  19. Generation – results sample • C-20, פצפץ: • פִּצְפַּצְתִּי,PAST+FIRST+MF+SINGULAR+COMPLETE • פִּצְפַּצְתָּ,PAST+SECOND+M+SINGULAR+COMPLETE • פִּצְפַּצְתְּ,PAST+SECOND+F+SINGULAR+COMPLETE • פִּצְפֵּץ,PAST+THIRD+M+SINGULAR+COMPLETE • פִּצְפְּצָה,PAST+THIRD+F+SINGULAR+COMPLETE • פִּצְפַּצְנוּ,PAST+FIRST+MF+PLURAL+COMPLETE • פִּצְפַּצְתֶּם,PAST+SECOND+M+PLURAL+COMPLETE • פִּצְפַּצְתֶּן,PAST+SECOND+F+PLURAL+COMPLETE • פִּצְפְּצוּ,PAST+THIRD+M+PLURAL+COMPLETE • פִּצְפְּצוּ,PAST+THIRD+F+PLURAL+COMPLETE • …

  20. Syllable segmentation • Method • Syllable segmentation requires Shva classification • Shva Na marks syllable start* • Shva Nach denotes syllable end* • Each syllable includes exactly one vowel* * According to Even-Shoshan dictionary • We implemented two Shvaclassificationschemes • Heuristic approach - Rabbi-Eliyahu-Behor • Shva classification according to the base tense form

  21. Syllable segmentation • Heuristic approach • By Behor - a Shva is a Shva Na if: • It vocalizes the first letter of the word • It follows another Shva and it is not at the word end • It follows a long, stressed vowel (stress is needed) • It vocalizes a letter with Dagesh Hazak(Dagesh type is needed) • It vocalizes the first among two identical letters (many exceptions) • By our (adapted) Heuristic: • A Shva is a Shva Na if: • It vocalizes the first letter of the word • It follows another Shva and it is not at the word end • It follows a long vowel • A Shva is a ShvaNach if: • It is followed by another Shva • In any other case, we use Shva Nach as default

  22. Syllable segmentation • Shva classification according to the base tense form • Through our generation mechanism, we can correlate verb inflections to their corresponding base-tense form • A Shva present in the base-tense form is a Shva Nach • Otherwise the Shva is a Shva Na • Matching inflection to base-tense forms • We use a dynamic programming string matching algorithm • Operations costs were customized to be character dependent, respecting the Hebrew inflectional model • I I C R C C C C C C C C I R • ת ּ ִ ז ְ ד ּ ַ ק ּ ְ ק ִ י • י - ִ ז ְ ד ּ ַ ק ּ ֵ ק - -

  23. Syllable segmentation • Results • Thanks to our generation model, we obtain 240k of highly accurate vocalized verbs • We applied our two approaches to receive two lists of verbs segmented into syllables: • By our heuristic approach (based on Behor’s heuristic) • By our customized string matching algorithm • Evaluation • A sample of 300 segmented verbs were validated for: • 81% word accuracy and 85.92% syllable accuracy by the heuristic • 99.33% word accuracy and 99.5% syllable accuracy by the string matching approach

  24. Syllable segmentation – results sample גֻּלַּם • -גּוּ-לַּמְ-תִּי • -גֻּ-לַּמְ-תִּי • -גּוּ-לַּמְ-תָּ • -גֻּ-לַּמְ-תָּ • -גּוּ-לַּמְתְּ • -גֻּ-לַּמְתְּ • -גּוּ-לַּמְ-תֶּם • -גֻּ-לַּמְ-תֶּם • -גּוּ-לַּמְ-תֶּן • -גֻּ-לַּמְ-תֶּן • -גּוּ-לְּמוּ • -גֻּ-לְּמוּ • -גּוּ-לַּם • -גֻּ-לַּם • -גּוּ-לְּמָה • -גֻּ-לְּמָה • -גּוּ-לַּמְ-נוּ • -גֻּ-לַּמְ-נוּ

  25. Verb classification to patterns • Method • We implemented a classifier (SVM) which: • Take: • A non-vocalized verb (v) • Return: • The pattern corresponding to v • The SVM uses: • Dataset: • Over 2,700 verbs from our verb list • 70% are used for training and 30% for testing • Features: • Word length • letters positions • Guttural letters positions • Evaluation • 90.25% of the verbs were classified correctly to their corresponding Hebrew pattern

  26. Unknown verb classification to inflection tables • Method • We implemented a classifier (SVM) which: • Take: • A non-vocalized verb (v) • Return: • The inflection table corresponding to v • The SVM uses: • Dataset: • Over 2,700 verbs from our verb list • 70% are used for training and 30% for testing • Features: • Word length • letters positions • Guttural letters positions • Corpus level features (50M morphologically disambiguated corpus) • Evaluation • Without corpus level features - 68.63% accuracy • With corpus level features - 70.08% accuracy

  27. Discussion • The Hebrew verb inflectional model • Q: How complex must be the computational model for verb full morphological and vocalization generation? • A: By implementing 264 inflection tables we achieve 99.4% accuracy • Q: How much lexical knowledge and exceptions are required to cover the Hebrew verbs lexicon? • A: The 260 implemented inflection tables include many exception tables which describe the inflectional model for only several verbs • Our more general, unknowns classification, model, yields 70% accuracy (selecting 1 inflection table out of the total 264 tables) • For comparison the baseline for the most frequent inflection table yields only 34% accuracy • A rough estimation shows over 93% of the verbs in a large corpora exist in our dataset, moreover most unknown verbs are either miss-spelled or falsely tagged as verbs

  28. Discussion • Syllable segmentation • Q: How complex is syllable segmentation? • In contradiction to traditional grammars, few simple rules do not provide highly accurate segmentation • We achieved 99.3% word accuracy and 99.5% syllable accuracy through Shva classification • Q: What level of knowledge is required for successful syllable segmentation? • A: By using the vocalized word only we achieve correct word segmentation with 81% accuracy • Using the base tense form as well, improves word accuracy to 99.3% • This improvement suggests: • Hebrew phonology uses a constructive process, which derives inflections from base tense forms • Inflections are not generated in a pipeline process, in which morphology would first generate inflections that are later segmented into phonological units

  29. Future work • Generation • Implementing rare inflection tables • Implementing inflection tables for nouns • Syllable segmentation • Searching for optimal Hebrew string matching weights • Machine learning of syllable segmentation

  30. Future work • Unknown verbs classification • Using vocalized corpora to extract corpus level features • Performing feature selection • Classification of vocalized verbs into inflection tables • Classification of inflections into inflection tables • Exploring the SVM parameters • Automatic vocalization • We hope to obtain a substantial vocalized corpora (the Aviv encyclopedia), which will enable: • Setting a base line for automatic vocalization using a modern vocalized corpora • Improving the baseline through supervised learning

  31. The End

More Related