Project 3

Project 3 CS652 Information Extraction and Information Integration

Project3Presented by: Reema Al-Kamha

Results • Name Matcher 1)Base Line: 2)Improvements: Adding many synonyms for the word.

Results • NB Model 1)Base Line: I treated the continents of each row as one token . 2)Improvements:

Comments • I do not figure out how to distinguish start _time and end_time. • I parse each row in XML to tokens. • I got ride from all stop words (also got ride from .,;#.in vocabulary vector • I get ride from suffix like Introduction to Intro. • I do not insert the files that are in source but not in target. • Sometimes I extract the key words in the documents and treat the document as if it only contains these words like in ward attribute. • For some files attributes like code Attribute, I separate the numeric part from the letter part to let the code match subject in course application, and then I drop the numeric part. • I had a lot of difficulties in using Java for this project because it was very slow.

Muhammed Al-Muhammed • Two schema matching techniques were implemented, Name-matching and NB in Java. • In general the type of the data help in achieving a good matching results. • Two improvements done. More in the conclusion.

Name Matching * After doing some improvement

NB *One element wrongly mapped to different one

conclusions • In general NB is better than NM • Two small improvements - Numerical ratio for the name matching - Building expected patterns for the data. “ help in improving NB matching” • Combining the two methods was helpful but the results still not significant enough to argue for the combination.

Tim – Project 3 Results Name Matcher Improvements • Word Similarity Function • Convert to lower case • Combine: • Levenshtien edit distance – normalized to give % • similar_text() – % of characters the same • Soundex • Longest Common Subsequence • Checks for substring • Normalized to give %

Naïve Bayes Improvements • Classify data instances • Use regular expession classifiers • 24 general classes • Correspond to datatypes • No domain specific classes • long_string, small_int, big_int, short_all_caps, med_all_caps, init_cap, init_caps, …, short_string • Used only Course data to create REs

Course Results

Faculty Results

Schema Matching Helen Chen CS652 Project 3 06/14/2002

Results from Name Matcher * The number in () is the # of matched before improvement

Results from Naïve Bayes

Comments • Name matcher works fine in the given two domains with appropriate dictionary • Add stemming words, synonyms, etc. in the dictionary, make the words case insensitive • Naïve Bayes is not a good schema matching method in the given domains • Use words instead of tuples as token • Use thesaurus (count stemming words and synonyms as one token, ignore cases) • Improvements can be done • Use value characteristics (String length, numeric ratio, space ratio) • Use Ontology

Yihong’s Project 3 • Course Domain: • Rice, 11  Washington, 12; (11/11 directly mapped) • Rice, 11  WSU, 16; (9/11 directly mapped, 1/11 indirectly mapped, 1/11 not mapped) • Faculty Domain • Cornell, 10  Washington, 10; (10/10 directly mapped) • Cornell, 10  Michigan, 10; (10/10 directly mapped)

Name Matcher • Base line situation • Synonym list for each attribute name by training • Add most common synonyms and abbreviations • Compare with case-insensitive • Improvement situation • Add more synonyms using WordNet • String similarity computation • Add a new category as “UNKNOWN”

Naïve Bayes • Base line situation • Each entry in Raw_text as a training unit • Improvement situation • Remove stopwords • Cluster special strings • String similarity computation • Add a new category as “UNKNOWN” • Training size experiment

Results Conclusion Combination: random selection weighted by experimental accuracies

David Marble CS 652 Project 3

Baseline Results

Improved Results Name Matcher: Word Stemming. NB: Improved precision by tokenizing, separating text/numbers, removing leading 0’s in numbers.

Comments • WSU happened to be the “weird” one. • Building names completely different • Faculty with odd last names, only a few first names matched (not a lot of training names) • Telephone #’s only matched when changing digits to “digit” instead of value. • Start time, end time dilemma – why can’t schools run their schedule like BYU?

Craig Parker

Baseline Results • Course 1 • Recall = .6 • Precision = 1 • Course 1 • Recall = .66 • Precision = 1 • Faculty • Recall = .8 • Precision = 1

Modified Results • Course 1 • Recall = .7 • Precision = 1 • Course 1 • Recall = .78 • Precision = 1 • Faculty • Recall = .8 • Precision = 1

Discussion • Modification of Name Matching involved a number of substring comparisons. • Modifications improved results for both Course tests. • Modifications did not change results for Faculty tests. • Naïve Bayesian Classifier not well suited for all types of data (buildings, sections, phone numbers)

Schema Matching results Lars Olson

Baseline test data • Test 1 (Course: Washington  Reed) • R = 3/9 (33%), P = 3/3 (100%) • room, title, days • Test 2 (Course: Washington  Rice) • R = 4/9 (44%), P = 4/4 (100%) • room, credits, title, days • Test 3 (Faculty: Washington  Berkley) • R = 8/10 (80%), P = 8/8 (100%) • name, research, degrees, fac_title, award, year, building, title • Test 4 (Faculty: Washington  Cornell) (identical to Test 3)

After Improvements • Test 1 (Course: Washington  Reed) • Name matcher: R = 8/9 (89%), P = 8/8 (100%) (missed schedule_line  reg_num) • Bayes: R = 4/9 (44%), P = 4/12 (33%) (also missed schedule_line) • Test 2 (Course: Washington  Rice) • Name matcher: R = 9/9 (100%), P = 9/9 (100%) • Bayes: R = 4/9 (44%), P = 4/12 (33%) • Test 3 (Faculty: Washington  Berkley | Cornell) • Name matcher: R = 10/10 (100%), P = 10/10 (100%) • Bayes: R = 8/10 (80%), P = 8/10 (80%)

Comments • Improvements made: • Name matcher: • Remove all symbols (e.g. ‘_’) from string • Build thesaurus based on training set • Bayes learner: • Attempt 1: classify all numbers together • Attempt 2: replace all digits with ‘#’ • Idea: FSA tokenizer (to recognize phone numbers #######, times ##:##) • Difficulties: • What are the correct matches? (e.g. restrictions  comments) • Aggregate matches were not included in recall measures

Jeff Roth Project 3

Basic Results Course - Target = Reed Training = Rice, uwm, Washington Source = wsu Naïve Bayes: 7 / 12 correct, 6 / 16 FP Name Classifier: 12 / 15 correct, 0 / 19 FP Course - Target = Rice Training = Reed, uwm, Washington Source = wsu Naïve Bayes: 7 / 10 * correct, 5 / 16 FP Name Classifier: 12 / 13 correct, 0 / 19 FP Faculty - Target = Berkley Training = Cornell, Texas, Washington Source = Michigan Naïve Bayes: 6 / 10 correct, 3 / 10 FP Name Classifier: 14 / 14 correct, 0 / 14 FP Faculty - Target = Cornell Training = Berkley, Texas, Washington Source = Michigan Naïve Bayes: 5 / 10 correct, 3 / 10 FP Name Classifier: 14 / 14 correct, 0 / 14 FP

“Improved” Naïve Bayes Course - Target = Reed Training = Rice, uwm, Washington Source = wsu Naïve Bayes: 7 / 12 correct, 7 / 16 FP Course - Target = Rice Training = Reed, uwm, Washington Source = wsu Naïve Bayes: 7 / 10 * correct, 5 / 16 FP Faculty - Target = Berkley Training = Cornell, Texas, Washington Source = Michigan Naïve Bayes: 6 / 10 correct, 3 / 10 FP Faculty - Target = Cornell Training = Berkley, Texas, Washington Source = Michigan Naïve Bayes: 5 / 10 correct, 3 / 10 FP Improvements: 1. Classification = argmax (Log(P(vj) + Σ log(P(ai | vj))) - included in basic 2. If a word in classification doc has no match, classification = 1 / (2 * |vocabulary|) - no help 3. Divide by number of words in test doc and find global max - scratched

Combination Course - Target = Reed Training = Rice, uwm, Washington Source = wsu Name Classifier: 13 / 15 correct, 0 / 19 FP Course - Target = Rice Training = Reed, uwm, Washington Source = wsu Name Classifier: 12 / 13 correct, 0 / 19 FP Faculty - Target = Berkley Training = Cornell, Texas, Washington Source = Michigan Name Classifier: 14 / 14 correct, 0 / 14 FP Faculty - Target = Cornell Training = Berkley, Texas, Washington Source = Michigan Name Classifier: 14 / 14 correct, 0 / 14 FP Combination algorithm: 1. Match source to target if both Naïve Bayes and name matcher agreed 2. Match remaining unmatched target elements to source by name matcher 3. Match any remaining unmatched target elements to source by Naïve Bayes

Schema Matching by Using Name Matcher and Naïve Bayesian Classifier (NB) Cui Tao CS652 Project 3

Heuristic name matching (Cupid) • Tokenization of names • SectionNr  Section, Nr; Start_time  Start, time • Expansion of short-forms, acronyms • nr  number, bldg building, rm  room, sect section • crse or crs  course • Thesaurus of synonyms, hypernyms, acronyms • Nr Code, restriction  limit, etc • Ignore cases Name Matcher

Naïve Bayesian Classifier • Problems: • Names, building, etc • Numbers: room, time, • code • Keyword confusions: • research, award, title • Different systems: room, • section number, etc • Phone numbers (Can not • match by NB, but easy to • find the match by using • pattern recognition) • Improvement: • Use tokens instead of tuples • Name: • “Richard Anderson”, “Thomas Anderson”, “Thomas F. Coleman”; • “Thomas”, “Richard”, “Anderson”, “F.”, “Coleman”. • Building, degree, research, etc • Eliminate stopwords • Stemming words: shared substring at least 80% long in the whole word • Ignore case

Conclusion • Combine them together: • How: conflict  follow name matcher • Result: all 100% • Name matcher: works better for this application • NB: may work better in indirect mappings

Project 3: Schema Matching Alan Wessman

Baseline Results • Course test set: UWM • Faculty test set: Texas

Improved Results • Name matcher improvements: • Lower case, trim whitespace • Remove vowels • Match if exact, prefix, or edit distance = 1 • Naïve Bayes improvements: • Lower case, trim whitespace • Consider only first 80 chars • Consider only first alphanumeric token in string

Commentary • Improved name matcher effective • But performance decreases if too general • Naïve Bayes not very useful • Fails when different attributes have similar values (start_time, end_time, room, section_num) • Fails when same attribute has different values or formats across data sources (room, comments) • “Sophisticated” string classifier for NB failed miserably; worse than baseline so I threw it out!

CS 652 Project #3Schema Element Mapping -- By Yuanqiu (Joe) Zhou Base Line Experimental Results

CS 652 Project #3Schema Element Mapping -- By Yuanqiu (Joe) Zhou Improvement (at least tried) • Name Matcher • Using simple text transformation functions, such as sub-string, prefix and abbreviation • NB Classifier • Positive Word Density ( does work at all,  ) • Regular expressions for common data types, such as time, small integers and large integers • Combination • Favor name matcher over NB classifier • NB classifier can be used to break the tie by name matcher (such as sect  section, sect  section_note)

CS 652 Project #3Schema Element Mapping -- By Yuanqiu (Joe) Zhou Experimental Results with Improvements

CS 652 Project #3Schema Element Mapping -- By Yuanqiu (Joe) Zhou Comments • High precisions and recalls result mostly from improvements to Name Match • Improvements to NB classifier did not contribute too much (only correct one missed mapping for one course application) • NB classifier is not suited to distinguish the elements with similar data type (such as time and number) or the elements sharing many common values • Reducing the size of training data can achieve the same precision and recall with less running time

Project 3

Project 3

Presentation Transcript

Project 3

project 3

Project 3

Word Project 3

Project 3 Discussion

Project 3

Project 3

Project 3

Project 3

Project 3

Project 3

Project 3

Project 3

Project 3

Project #3

Project #3

Project 3

Project 3

Project 3

Project 3

Project 3

Project 3