290 likes | 575 Views
INDRADHANUSH WORDNET DEVELOPMENT FOR PUNJABI LANGUAGE. Dr. Suman Preet Department of Linguistics and Punjabi Lexicography, Punjabi University, Patiala. Nature of Task. Synset Creation for Nouns, Adjectives, Verbs and Adverbs Creation of Language Specific Synsets Sense Marking
E N D
INDRADHANUSH WORDNET DEVELOPMENTFOR PUNJABI LANGUAGE Dr. Suman Preet Department of Linguistics and Punjabi Lexicography, Punjabi University, Patiala
Nature of Task • Synset Creation for Nouns, Adjectives, Verbs and Adverbs • Creation of Language Specific Synsets • Sense Marking • Validation • New Synset Creation for Hindi WordNet
Goals Set in the Last PRSG • To complete the linking of 36,534 Synsets. • Validation of 36,534 Synsets. • To create 1000 LSS. • Creation and maintenance of Individual WordNet Group Websites • To complete sense marking on 1,00,000 words.
Presentation Outline • Financial Details • Sense Marking Details • Synset Creation Details • Validation Details • Problems and Suggestions
Financial Details Total grant sanctioned Rs 22,14,000/- Total grant released Rs 20,23,974/- 1st year (released) Rs 11,44,000/- 2nd year (released) Rs 08,79,974/- Recently released Rs 1,86,833/-
Sense Marking Details • Target: 1,00,000 words • Division of Target between Punjabi University and Thapar University The sense marking task was divided into two parts with mutual understanding as shown above. The Punjabi University Wordnet Group has achieved its target.
Record of Sense Marking work by Punjabi University Actions taken during Sense Marking Words added in Punjabi Synset File by action one and two = 1132
Status of synset completed till 28 April 2013 The synset creation task was divided into two parts with mutual understanding as shown above. Punjabi University Group has completed its synset creation task.
INTERNAL VALIDATION DETAILS The validation task is being done by Punjabi University WordNet Group.
PUNJABI LANGUAGE SPECIFIC SYNSETS Total: 1010 Noun: 961 Adjective: 16 Verb: 33
New Synset Creation for Hindi WordNet • New common synsets created by Punjabi University which were not present in Hindi WordNet (Total 50) • भाजपाई, अकाली, बाबा बंदा सिंह बहादुर, समध्धर, सुलग्ग, दुआनी, कणकवंना, भागवान, मलटी ब्रांड, फिरकू, सामाजवादी पार्टी, तृणमूल कांग्रेस, पी .जी .आई. ऐम. ई. आर., नुक्कड़ नाटक, हफ़ीज़ाबाद, ब्यूटी पार्लर, लालपरी, समैक, मार्क्सवाद, मार्क्सवादी, बरनाला जिला, बरनाला शहर, असंवेदनशील, कॉल सेंटर, पीज़ा, राष्ट्रीय सुरक्षा परिषद, पीली नदी, बेसबॉल, डिस्पेंसरी, स्टॉकटन शहर, परमवीर चक्र, महावीर चक्र, कीर्ति चक्र, शौर्य चक्र, फासट फूड, स्ट्रीट फूड, जंक फूड, गुरू नानक देव युनिवर्सिटी, पद्म श्री, डिप्टी इनसपैकटर जनरल आफ़ पुलिस, डिप्टी जनरल आफ़ पुलिस, सब - डिवीजनल अफ़सर, लहिंदा पंजाब, पूर्बी पंजाब, किला लोहगड़्ह, अटारी, बादल, ग़दर पार्टी, ग़दर अख़बार, कर्तार सिंह सराभा • These words are taken from the Different Punjabi online Newspapers like DailyAjit, PunjabiTribune, Charhdikala
Problems occurring in Sense Marking • Problems related to English words • Problems related to compound words • Problems related to adjective • Problems related to proverbs • Problems related to verbs
Problems Related to English Words • Borrowed or Accepted English Words • Comparative alternative present in the WordNet • Not found in WordNet • Proper sense not present • Abbreviations
PROBLEMS RELATED TO COMPOUND WORDS • Most of the common compound words do not exist in the WordNet. If we mark these compounds separately, the actual sense they infer is lost. For example: • ਧੁੱਪ-ਛਾਂ,ਜੋੜ-ਤੋੜ, ਚੁਸਤ-ਦਰੁਸਤ, ਗੰਢ-ਤੁਪ, ਤੁੱਥ-ਮੁਥ, ਮੁੰਡੇ-ਕੁੜੀਆਂ, ਚੰਗੇ-ਭਲੇ, ਮੈਲੇ-ਕੁਚੈਲੇ, ਚੰਗੇ-ਭਲੇ, ਢੰਗ-ਤਰੀਕੇ, ਪੂਰਾ-ਪੂਰਾ, ਸ਼੍ਰੇਣੀ-ਵੰਡ, ਮਾਣ-ਸਤਿਕਾਰ, ਛੋਟੇ-ਛੋਟੇ, ਸੋਚੇ-ਸਮਝੇ, ਕੱਚ-ਸੱਚ, ਅੱਖੋਂ-ਪਰੋਖੇ, ਸੱਚੇ-ਸੁੱਚੇ, ਬਾਗੋ-ਬਾਗ, ਰੋਕ-ਟੋਕ, ਬੁਰਾ-ਭਲਾ, ਦੂਰ-ਨੇੜੇ, ਦਿਨ-ਰਾਤ, ਪੁੱਛਣ-ਦੱਸਣ, ਕਹਿਣ-ਸੁਣਨ, ਹਾਂ-ਨਾ • Translation • धूप-छाया, जोड़ - तोड़ , दिन - पदिन , चुस्त - दुरुस्त , गाँठ - तुप , तुथ्थ - मुथ , लड़के - लड़कियाँ , अच्छे - भले , मैले - कुचैले , अच्छा - भला , ढंग - तरीक़े , पूरा - पूरा , श्रेणी - विभाजन, गर्व - सत्कार , छोटे - छोटे , सोचे - समझे , काँच - सत्य , आँखों - परोखा , सत्य - सुच्चा , बागो - बाग , (with trans.tool assistance) विघ्न – टोक, बुरा - भला , दूर – निकट(में), दिन - रात , पूछना - बताने , कहने - सुनना , हाँ - न
PROBLEMS RELATED TO ADJECTIVE • Feminine Gender: Feminine forms of adjective are not included in the WordNet, but these occur frequently in the text and reference materials. Some examples: • ਸੋਹਣੀ,ਲੰਬੀ,ਛੋਟੀ,ਫ਼ੁਰਤੀਲੀ,ਸਾਂਝੀ, ਸੁਨਹਿਰੀ, ਸੌਖੀ, ਮੋਟੀ, ਨਿੱਕੀ, ਉੱਚੀ • सोहनी,लंबी,छोटी,फुरतीली,सांझी, सुनहरी, सौखी, मोटी, निक्की, उच्ची
Problems related to proverbs Proverbs are not included in the Hindi WordNet. We are marking them word by word. How we can mark them? • 1. ਨੀਮ ਹਕੀਮ ਖਤਰਾ ਏ ਜਾਨ • 2.ਕੱਲਾ ਇਕ ਦੋ ਗਿਆਰਾਂ • 3. ਕਦਮ ਦਾ ਖੁੰਝਿਆ ਕੋਹਾਂ ਤੇ ਪੈਂਦੈ Transliteration 1. नीम हकीम खतरा ए जान 2.कल्ला इक दो गिआरां 3. कदम दा खुंझिआ कोहां ते पैंदै English Translation • little knowledge is a dangerous thing • two heads are better than one • a miss by an inch is a miss by a mile
FOLLOWING FIELDS OF WORDS ADDED IN SYNSET FILE DURING SENSE MARKING TASK • Sports: ਚੈਂਪੀਅਨ, ਕਿਟ, ਫਾਈਨਲ, ਸੀਰੀਜ਼, ਵਿਸ਼ਵ ਕੱਪ, ਕੁਮੈਂਟੇਟਰ, ਬੇਸਬਾਲ, ਏਸ਼ੀਅਨ ਖੇਡ, ਏਸ਼ੀਆ ਕੱਪ, ਚੈਂਪੀਅਨ ਟਰਾਫੀ, ਜਾਫੀ, ਧਾਵੀ, ਰੇਡ, ਕਾਮਨਵੈਲਥ ਖੇਡ • Business: ਮਲਟੀ ਬਰਾਂਡ, ਸਰਵਿਸ ਟੈਕਸ, ਟੈਂਡਰ, ਪੈਕੇਜ, ਵੈਟ, ਕਰੰਸੀ, ਸਵੈ-ਰੁਜ਼ਗਾਰ, ਆਊਟ-ਸੋਰਸਿੰਗ, • Politics: ਅਕਾਲੀ ਦਲ, ਕੁਰਸੀ, ਭਾਜਪਾਈ, ਕੈਬਨਿਟ, ਤ੍ਰਿਣਮੂਲ ਕਾਂਗਰਸ, ਜਨਤਾ ਦਲ, ਹਾਈਕਮਾਨ, ਸੰਵਿਧਾਨਕ, ਐਂਮ.ਐਂਲ.ਏ., ਅਸੈਂਬਲੀ
SUGGESTIONS-I • There should be a separate button on the IndoWordNet Website for common vocabulary (words that has same sense in all languages) of all the languages. • There should be a separate button on the IndoWordNet Website for the word frequency list of word for each language. • There should be a separate button on the IndoWordNet Website for the borrowed word list of each language. • There should be a separate button on the IndoWordNet Website for the Great Personalities names of all the languages.
SUGGESTIONS-II We should prepare some parametres about entries of: • Places • Institutions • Famous personalities • Famous creations: books, films, paintings, music etc. • Famous incidents and dates • Scientific vocabulary • And words from other special fields • Etc. These parametres, help us in creating new synsets and Language Specific Synsets (LSS).
Team Composition • P.I. details • Dr. Suman Preet, Associate Professor & Head, Dept of Linguistics and Pbi. Lexicography, Punjabi University, Patiala. • Co-P.I. details • Dr. Harjeet Gill, Professor Eminence, Pbi. Uni., and Prof. Emeritus JNU.
Details of the Manpower associated with the Project Staff details Miss Balwinder Kaur, M.A. (Pbi.), PhD (in cont.) Designation: Senior Linguist Work Details: Linking synsets, Validating synsets, Creating & monitoring Language Specific Synsets Salary : 22,000/- p.m. Mr. Satpal Singh, M.A. (Eng, Linguistics), Diploma in Persian, B.Ed. Designation: Lexicographer Work Details : Linking synsets, Validating synsets, Sense Marking Salary : 16,500/- p.m.
Details of the Manpower associated with the Project (contd.) Mr. Vinay Hasija, B. Tech. (Computer Engg) Designation: Lexicographer Work Details: Validating synsets, Website creation, Sense Marking Salary: 16,500/- p.m.