190 likes | 347 Views
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson. VOCABULARY TERM MAPPING. Problem Overview Implementation Details Deliverables Results. OUTLINE. Problem Overview : vocabulary term mapping. Document A. Document B. Parser to parse document format and extract terms.
E N D
VDC/TWG Meeting August 09 By: NamrataLele Mentors: Dave Vieglais Bruce Wilson VOCABULARY TERM MAPPING
Problem Overview Implementation Details Deliverables Results VDC/TWG Meeting August 09 OUTLINE
VDC/TWG Meeting August 09 Problem Overview :vocabulary term mapping Document A Document B Parser to parse document format and extract terms Measure Semantic Relatedness of terms from documents A and B
Implemented parsers to parse the following metadata documents : • DublinCore • DarwinCore • EML • Deliverables : • DublinCore Parser • DarwinCore Parser • EML Parser VDC/TWG Meeting August 09 Implementation
Measure Semantic Relatedness of terms using the following libraries : • Lucene : It is a full-featured text search engine written entirely in Java, and it is an open source project available for free download • GTM (General Text Matcher) :GTM measures the similarity between texts.GTM is written in Java, and is open source, released under the BSD license. VDC/TWG Meeting August 09 Implementation
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query Idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query VDC/TWG Meeting August 09 Lucene scoring
VDC/TWG Meeting August 09 Using lucene and wordnet ( vocabulary) Vocabulary Term Mapper Parsed input document A Xml file 1) Read term-description from document A 2) Stem, stop-word filter and expand description (query) for synonyms using Wordnet index Parsed input document B Create xml output file Xml file Output Lucene Index Builder (Stemming and stop-word filtering) Expand description for synonyms using Wordnet index Lucene index Get matching terms (lucene documents) 3) Search term and description in lucene index for document B
Lucene Index Builder Lucene Vocabulary Term Mapper Wordnet Index Builder VDC/TWG Meeting August 09 deliverables
Original Description : the full, unabbreviated name of the country Stop word filtered ,stemmed and synonym expanded description : the full entire fully good total undivided wax wide unabbrevi name advert appoint call cite constitute describe diagnose discover distinguish epithet figure gens identify key list make mention nominate refer of countri VDC/TWG Meeting August 09 Example
10 VDC/TWG Meeting August 09 Using General Text Matcher ( without thesauri) Vocabulary Term Mapper Parsed input document A Xml file Read term-description from document A (Stem and stop-word filter) Read term-description from document B (Stem and stop-word filter) Get similarity score using GTM Parsed input document B Xml file Output Score, Term , Desc , Xpath to xml file Maintain top 5 scores for every term-description in DocumentA
Modified GTM Library GTM Vocabulary Term Mapper VDC/TWG Meeting August 09 deliverables
The results obtained from various mappings are as follows : • DublinCore – DarwinCore • DarwinCore – DublinCore • EML – DarwinCore • DarwinCore – EML • EML – DublinCore • DublinCore – EML VDC/TWG Meeting August 09 Results from lucene vocabulary term mapping
Following is a list of resulting terms obtained from Lucene Vocabulary Term Mapper which matches with the existing mapping between DublinCore and EML • Title – Title • Creator – Creator • Publisher – Publisher • Format – Physical • Coverage- Coverage • Rights – Intellectual Rights VDC/TWG Meeting August 09 Dublin core elements in eml
The results obtained from various mappings are as follows : • DublinCore – DarwinCore • DarwinCore – DublinCore VDC/TWG Meeting August 09 Results from GTM vocabulary term mapping
VDC/TWG Meeting August 09 Sample output xml file
Fix a bug in EML parser. Provide two versions of EML parser: One that has description for all terms in the hierarchy and one that has description for only the current term. VDC/TWG Meeting August 09 To-dos
Got a chance to learn new libraries (Lucene and GTM) Learnt new concepts about semantic similarity Honed my XML skills Enjoyed working with this team Conclusion
General Text Matcher http://nlp.cs.nyu.edu/GTM/ http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/package-summary.html VDC/TWG Meeting August 09 references