460 likes | 822 Views
CRBLP’s (Center for Research on Bangla Language Processing) Activities and Achievements on Bangla Language Processing, January 2007. Naushad UzZaman CRBLP, BRAC U, Bangladesh http://www.naushadzaman.com. CRBLP’s Activities.
CRBLP’s (Center for Research on Bangla Language Processing) Activities and Achievements on Bangla Language Processing, January 2007 Naushad UzZaman CRBLP, BRAC U, Bangladesh http://www.naushadzaman.com CRBLP's activities and achievements, January 2007
CRBLP’s Activities • Center for Research on Bangla Language Processing, CRBLP working on Bangla Language Processing since 2004 • 11 Research staff (9 Computer Science background, 2 linguistics background) • Students working part-time, doing internship • 13 Summer 2006 Interns and 7 former members • Motivation of open source • Academic • Offered course on language processing (CSE 431: Natural Language Processing, offered at Spring 2006 and Spring 2007 in BRAC U) • Thesis on NLP • Summer Internship CRBLP's activities and achievements, January 2007
CRBLP Members (Full Time Staff Members) • Dr. Mumit Khan [email] [website] Head, CRBLP and Associate Professor, CSE Department • Matin Saad Abdullah [email] Program Manager, CRBLP and Senior Lecturer, CSE Department • Naira Khan [email] Linguist, CRBLP and Lecturer, English and Humanities (On Leave) • Zahurul Islam [email] [website] Research Programmer, CRBLP and Part-time Faculty Member, CSE • Naushad UzZaman [email] [website] Research Programmer, CRBLP and Part-time Faculty Member, CSE • Md. Abul Hasnat, Research Programmer [email] [website] • S. M. Murtoza Habib, Research Programmer [email] [website] • Firoj Alam, Research Programmer [email] [website] CRBLP's activities and achievements, January 2007
CRBLP Members (Part-time and Interns) • Part Time Staff Members • Kamrul Hayder, Language Consultant • M. Abdur Rahman, Research Assistant • Maruf Muqtadir, Research Assistant • Summer 2006 Research Interns • Fahim Muhammad Hasan • M. Hammad Ali • Ayesha Binte Mosaddeque • Nafid Haque • Yeasir Arafat • Nizam Uddin • M. Abdur Rahman • Fahim Tawfique Chowdhury • Munirul Mansur • Md. Jahangir Alam • Annajiat Alim Rasel • Munshi Asadullah • Salman Zaman CRBLP's activities and achievements, January 2007
Areas of Research • Document Authoring • Information Retrieval • Optical Character Recognition • Pronunciation Generator • Speech Processing • Morphology • Parts of Speech Tagging • Syntax • And also few other small research projects CRBLP's activities and achievements, January 2007
Document Authoring, BanglaPad The current version of the BanglaPad includes the following features: 1. Platform independent. (Current version tested on Windows and Linux). 2. Edit Bangla and English text in the same document. 3. Rich text editing with pictures and tables. 4. Export document as HTML. (You can develop web contents in Bangla using this feature!) 5. Support character encoding including UTF8 and UTF16. 6. Bangla and English Spell checking. (Bangla spelling checker uses Puspa Speller) 7. Bangla and English Search and replace. 8. Printing formatted document. 9. Three different skins for the editor. 10. Built-in keyboard driver for easy Bangla typing. (No need to install a keyboard driver). 11. Customizable Key-Maps for Bangla. 12. Easy to use Installer for Windows CRBLP's activities and achievements, January 2007
Spelling Checker in BanglaPad CRBLP's activities and achievements, January 2007
Rich Text Editing in BanglaPad and exporting to HTML CRBLP's activities and achievements, January 2007
BanglaPad Download and Team Members • Download: http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180246 • Developers: Zahurul Islam [August 2005 - December 2005] Naushad UzZaman [August 2005 - December 2005] Abdur Rahman [January 2006 - present] Maruf Muqtadir [January 2006 - present] • Advisors: Matin Saad Abdullah [August 2005 - present] Mumit Khan [August 2005 - present] CRBLP's activities and achievements, January 2007
English to Bangla Transliteration • Type phonetically in English, you will get similar sounding dictionary word. Can be used for Bangla text input with English keyboard. • Developed by Naushad UzZaman • Relevant Publication: 1. Naushad UzZaman, Arnab Zaheen and Mumit Khan, A Comprehensive Roman (English) to Bangla Transliteration Scheme, Proc. International Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006. 2. Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005. CRBLP's activities and achievements, January 2007
Pata, English to Bangla Transliteration CRBLP's activities and achievements, January 2007
Spelling Checker • Bangla Speller Sandbox: Bangla Phonetic Speller (Puspa). Gives suggestion for misspelling words based on similarities in pronunciation. Implemented based on Double Metaphone phonetic encoding • Developed by Naushad UzZaman • Download: http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180247 CRBLP's activities and achievements, January 2007
Publications on Spelling Checker 1. Naushad UzZaman and Mumit Khan, A Bangla Phonetic Encoding for Better Spelling Suggestions, Proc. 7th International Conference on Computer and Information Technology (ICCIT 2004), Dhaka, Bangladesh, December 2004. 2. Naushad UzZaman and Mumit Khan, A Double Metaphone Encoding for Bangla and its Application in Spelling Checker, Proc. 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 705-710, Wuhan, China, October 30 - November 1, 2005. 3. Naushad UzZaman and Mumit Khan, A Comprehensive Bangla Spelling Checker, Proc. International Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006. 4. Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005. 5. Munshi Asadullah, Md. Zahurul Islam, and Mumit Khan, Error-tolerant Finite-state Recognizer and String Pattern Similarity Based Spell-Checker for Bengali, to appear in the Proc. of International Conference on Natural Language Processing, ICON 2007, January 2007. CRBLP's activities and achievements, January 2007
Puspa Spelling Checker CRBLP's activities and achievements, January 2007
Search Engine • Bangla search engine based on open-source search engine Nutch. • Developed by M. Hammad Ali and Nafid Haque • Relevant Publications: 1. Nafid Haque, Hammad Ali, Mumit Khan, and Matin Saad Abdullah, Infrastructure for Bangla Information retrieval in the context of ICT for Development, to appear in the Proc. of International Conference on Systems, Computing Sciences and Software Engineering (SCSS 06) of International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 06), December 4 - 14, 2006. 2. M. Hammad Ali, Nafid Haque, A Decentralised Approach to Information Retrieval for a developing country like Bangladesh, Education Without Borders 2007, Abu Dhabi, February 25 - 27, 2007. CRBLP's activities and achievements, January 2007
Search Engine example CRBLP's activities and achievements, January 2007
Optical Character Recognition • BanglaOCR is the Optical Character Recognizer for Bangla Script. It takes scanned images of a printed page or document as input and converts them into editable Unicode text. BanglaOCR allows users to train the data set from any document and observe the recognition performance. • BanglaOCR developed by Md. Abul Hasnat and S M Murtoza Habib. • Download: http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=215908 • Another OCR implemented using Kohonen Network, developed by Shoeb Shatil. • Download: http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180249 CRBLP's activities and achievements, January 2007
OCR Status • OCR Application Status: Version 0.1, Release candidate 1 • Status of Different Segments of OCR –Document skew correction Bangla document skew corrector based on Radon transform. Status: Complete. –Segmentation Bangla line segmentation. Status: Complete Bangla word segmentation. Status: Complete Bangla character segmentation. Status: Work in progress. The large number of combinations (consonant clusters and the non-spacing marks) complicates this task. This is omnifont, so must work with any typeface. –Character/Symbol recognition Neural net based recognizer: Fairly complete for the basic alphabet and a subset of the consonant clusters. The non-spacing marks pose a significant challenge. Hidden Markov Model (HMM) based recognizer: Status: First demo available. -Post Processing for OCR Post processing spelling checker for OCR: corrects spelling mistakes due to unsuccessful recognition. Status: First demo available. CRBLP's activities and achievements, January 2007
BanglaOCR CRBLP's activities and achievements, January 2007
OCR Related Publications 1. Md. Abul Hasnat, S M Murtoza Habib and Mumit Khan, Segmentation free Bangla OCR using HMM: Training and Recognition, to appear in the Proc. of 1st International Conference on Digital Communications and Computer Applications (DCCA2007), Irbid, Jordan, 2007. 2. A. M. Shoeb Shatil and Mumit Khan, Minimally Segmenting High Performance Bangla OCR using Kohonen Network, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. 3. S. M. Murtoza Habib, Nawsher Ahmed Noor and Mumit Khan, Skew correction of Bangla script using Radon Transform, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. CRBLP's activities and achievements, January 2007
Automated Pronunciation Generator • Pronunciation Generator: Input any Bangla word, this application will give the pronunciation of that word in IPA (International Phonetic Alphabet). • Demo available online at: http://student.bu.ac.bd/%7Eu02201011/g2pweb/g2p1.htm • Source code available online at: http://student.bu.ac.bd/%7Eu02201011/g2pweb/ • Developed by, Ayesha Binte Mosaddeque • Relevant Publication: Ayesha Binte Mosaddeque, Naushad UzZaman and Mumit Khan, Rule based Automated Pronunciation Generator, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. CRBLP's activities and achievements, January 2007
Bangla Pronunciation Generator CRBLP's activities and achievements, January 2007
Speech Processing • Text-to-speech • Voice for Festival. • Status: First demo available, Developed by Firoj Alam • Automatic Speech Recognition • Isolated Speech Recognition, Developed by A K M Mahmudul Hoque • Continuous Speech Recognition. Status: First demo available. Developed by Md. Abul Hasnat CRBLP's activities and achievements, January 2007
Speech Related Publications 1. Firoj Alam and Promila Kanti Nath, Bangla Text to Speech using Festival, Undergraduate Thesis (Computer Science), BRAC University, May 2006. Supervisor: Mumit Khan. 2. A K M Mahmudul Hoque, Bangla Speech Recognition, Undergraduate Thesis (Computer Science), BRAC University, May 2006. Supervisor: Mumit Khan. 3. Firoj Alam, Promila Kanti Nath and Mumit Khan, Text To Speech for Bangla Language using Festival, to appear in the Proc. of 1st International Conference on Digital Communications and Computer Applications (DCCA2007), Irbid, Jordan, 2007. CRBLP's activities and achievements, January 2007
Morphology • Morphology: The branch of grammar which studies the structure or forms of words. • Work done on Bangla Morphology: • Generative verb morphology using two-level rules • Basic concatanative noun morphology with features • Software developed: Jkimmo, A Multilingual Computational Morphology Framework for PC-KIMMO. Developed by Md. Zahurul Islam. • Download Jkimmo: http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180248 CRBLP's activities and achievements, January 2007
Morphological Analyzer Jkimmo CRBLP's activities and achievements, January 2007
Morphology Related Publications 1. Sajib Dasgupta and Mumit Khan, Morphological Parsing of Bangla Words using PC-KIMMO, Proc. 7th International Conference on Computer and Information Technology, Dhaka, Bangladesh, December, 2004. 2. Sajib Dasgupta and Mumit Khan, Feature Unification for Morphological Parsing in Bangla, Proc. 7th International Conference on Computer and Information Technology, Dhaka, Bangladesh, December, 2004. 3. Sajib Dasgupta, Dewan Shahriar Hossain Pavel, Asif Iqbal Sarkar, Naira Khan and Mumit Khan, Morphological Analysis of Inflecting Compound Words in Bangla, Proc. 8th International Conference on Computer & Information Technology (ICCIT), Islamic University of Technology (IUT), Dhaka, Bangladesh, 2005. 4. Md. Zahurul Islam and Mumit Khan, JKimmo: A Multilingual Computational Morphology Framework for PC-KIMMO, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. CRBLP's activities and achievements, January 2007
Bangla Parts of Speech (POS) Tagging • This application tags words in a sentence with the parts of speech of that word. Implemented and compared HMM, n-gram and Transformation based Brill’s POS Tagging for Bangla, Hindi and Telegu on different sized corpus. For Bangla it was compared on different sized tagset too. • Developed by Fahim Muhammad Hasan. • Relevant Publications: Fahim Muhammad Hasan, Naushad UzZaman and Mumit Khan, Comparison of different POS Tagging Techniques (n-gram, HMM and Brill's tagger) for Bangla, to appear in the Proc. of International Conference on Systems, Computing Sciences and Software Engineering (SCSS 06) of International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 06), December 4 - 14, 2006. CRBLP's activities and achievements, January 2007
POS Tagging example CRBLP's activities and achievements, January 2007
Syntax • Syntax: the grammatical arrangement of words in sentences • Bangla syntactic analysis using • Lexical Functional Grammar (LFG) formalism • Head-driven Phrase Structure Grammar (HPSG) formalism • Work done by Naira Khan, Ayesha Binte Mosaddeque, M Hammad Ali and Nafid Haque. • Relevant Publications: 1. Md. Nasimul Haque and M. Khan, Parsing Bangla using LFG: An Introduction, BRAC University Journal, Vol 2, No. 2, 2005. 2. Naira Khan and Mumit Khan, Developing a Computational Grammar for Bengali using the HPSG Formalism, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. 3. Ayesha Binte Mosaddeque, M. Hammad Ali and Nafid Haque, Design of Head-Driven Phrase Structure Grammer for Bangla, Undergraduate Thesis (Computer Science), BRAC University, December 2006. Supervisor: Mumit Khan. CRBLP's activities and achievements, January 2007
Small Research Projects CRBLP's activities and achievements, January 2007
Bangla Grammar Checker • Implemented a statistical Bangla grammar checker based on n-gram analysis. • Developed by Md. Jahangir Alam. • Relevant Publications: Md. Jahangir Alam, Naushad UzZaman and Mumit Khan, N-gram based Statistical Grammar Checker for Bangla and English, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. CRBLP's activities and achievements, January 2007
Bangla Text Categorization • Implemented Bangla Text categorization based on n-gram analysis. Trained on Prothom Alo newspaper corpus on 6 different categories. • Developed by Munirul Mansur. • Relevant Publications: 1. Munirul Mansur, Naushad UzZaman and Mumit Khan, Analysis of N-gram based text categorization for Bangla in a newspaper corpus, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. 2. Munirul Mansur, Analysis of n-gram based text categorization for Bangla in a newspaper corpus, Undergraduate Thesis (Computer Science), BRAC University, August 2006. Supervisor: Mumit Khan. CRBLP's activities and achievements, January 2007
Analysis of Prothom-Alo newspaper Corpus • Frequency analysis of 1 year Prothom-Alo newspaper corpus. • Relevant Publications: 1. Yeasir Arafat, Analysis and Observations From a Bangla news corpus, Undergraduate Thesis (Computer Science), BRAC University, August 2006. Supervisor: Mumit Khan. 2. Yeasir Arafat, Md. Zahurul Islam and Mumit Khan, Analysis and Observations From a Bangla news corpus, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. CRBLP's activities and achievements, January 2007
Language Modeling, forward and backward n-gram • Investigating the prospect of backward n-gram compared to forward n-gram for Bangla. • Relevant Publication: Naira Khan, Md. Tarek Habib, Md. Jahangir Alam, Rajib Rahman, Naushad UzZaman and Mumit Khan, History (forward n-gram) or Future (backward n-gram)? Which model to consider for n-gram analysis in Bangla?, to appear in the Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006. CRBLP's activities and achievements, January 2007
Font Converter • Converts different TTF fonts to Unicode encoding. Status: Completed for Ullash, Prothoma, Bangsi Alpona fonts. • Developed by Md. Zahurul Islam. • Download: http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180250 CRBLP's activities and achievements, January 2007
Stemming • Stemming: Stemming is an algorithm developed to reduce a search query to its stem or root form, in other words, variations of particular words such as past tense and plural and singular usage are taken into account when performing a search, For example, applies, applying & applied matches apply. • Relevant Publications: Md. Zahurul Islam, Md. Nizam Uddin and Mumit Khan, A Light Weight Stemmer for Bengali and Its Use in Spelling Checker, to appear in the Proc. of 1st International Conference on Digital Communications and Computer Applications (DCCA2007), Irbid, Jordan, 2007. CRBLP's activities and achievements, January 2007
Text Summarization • Text summarization is the technique which automatically creates an abstract or summary of a text. In this study we investigate what works have been done in this area and implement an extraction based text summarizer for Bangla language. • Relevant work: Md. Nizam Uddin, "A Study on Text Summarization Techniques and an Approach for Bangla Text Summarization", Independent Study, Computer Science, BRAC University, December 2006, Supervisor: Md. Zahurul Islam, Mumit Khan CRBLP's activities and achievements, January 2007
Language Resources • Lexicon • Wordlist of 160 thousands words with 1st step parts of speech tags. • Corpus • 1 year Prothom alo newspaper corpus • Charjapad and Boru Chandi Dash er kabbo corpus (Edited by Md. Abdul Hai and Anwar Pasha) CRBLP's activities and achievements, January 2007
CRBLP Publications • 2004: ICCIT 2004 (Bangladesh): 3 (Morphology, Spelling Checker) Total: 3 • 2005: IASTED CI 2005 (Canada): 1 (Name Searching) IEEE NLP KE 2005 (China): 1 (Spelling Checker) IEE Mobility 2005 (China): 1 (Text Input System for Mobile) ICCIT 2005: 2 (Morphology, Compiler) BU Journal: 1 (Morphological Parsing) Undergraduate Thesis: 1 (Phonetic Encoding) Total: 7 CRBLP's activities and achievements, January 2007
CRBLP Publication cont. • 2006: ICCPB 2006 (Bangladesh): 4 (Corpus, Lexicon, Spelling Checker, Transliteration) ICCIT 2006 (Bangladesh): 11 (HPSG, Corpus Analysis, Text Categorization, Pronunciation Generator, Backward n-gram, Grammar Checker, Skew Correction, Traveler Information System, OCR using Kohonen Network, Mobile Messaging, Morphology) CISSE 2006 (Online): 2 (comparison of POS tagging, Bangla Information Retrieval) Undergraduate Thesis: 9 (Skew Correction, Mobile Messaging, Speech Recognition, OCR using Kohonen network, Text to Speech, Corpus Analysis, Text Categorization, POS Tagging, HPSG) Total: 24 • 2007: ICON 2007 (India): 1 (Spelling Checker) DCCA 2007 (Jordan): 5 (Stemming, OCR, Text to Speech, Semantics, wireless LAN) EWB 2007 (Abu Dhabi): 2 (Information Retrieval, Localization) Total: 8* * Till January 2007 CRBLP's activities and achievements, January 2007
CRBLP website • http://www.bracu.ac.bd/research/crblp/ CRBLP's activities and achievements, January 2007