170 likes | 289 Views
Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang. Problem Definition. -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class ). Name. Personal Dictionary. Domain Dictionary. Heuristics. Weighting function .
E N D
Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang
Problem Definition -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class )
Name Personal Dictionary Domain Dictionary Heuristics Weighting function To distinguish personal homepages from common sites To distinguish Database-related webpages from the rest Homepage Architecture
Domain Dictionary A set of words that are common in the database community. Our approach: DBWorld DBConference Contrast Area Our Dictionary (Virtual) + = -
Domain Dictionary DBWorld DBConference Contrast Area Our Dictionary (Virtual) + = - Dictionary Building: parse documents from each source into 2-word phrases and calculate their frequency data mine 4.47E-03 dbworld messag 4.38E-03 paper submiss 3.78E-03 program committe 3.10E-03 import date 2.98E-03 state univers 2.74E-03 intern confer 2.73E-03 comput scienc 2.70E-03 hong kong 2.65E-03 camera readi 2.56E-03 data manag 2.33E-03 queri process 1.63E-02 mobil databas 1.36E-02 languag featur 1.09E-02 data manag 1.09E-02 xqueri implement 0.008174387 queri languag 8.17E-03 queri optim 0.005449591 process data 0.005449591 data mine 0.005449591 research prototyp 0.005449591 databas architectur 0.005449591 program committe 0.019085487 mathemat scienc 0.007952286 mathemat physic 0.006361829 intern confer 0.0055666 date june 0.005168986 intern institut 0.004373758 schr dinger 0.003976143 erwin schr 0.003976143 dinger intern 0.003976143 degli studi 0.003578529
Domain Dictionary (cont.) Similarity Measuring: • Parse the webpage into 2-word phrases, and calculate their frequency • Use cosine similarity measure based on phrase frequency to get a score from each dictionary: Sdbworld, Sdbconf, Scontrast • Combine Sdbworld, Sdbconf, (1- Scontrast) using geometric average.
Personal Dictionary A set of words related to the specific person that we are looking for. Our approach: use DBLP to find information about co-authors, keywords of research, and conferences
Personal Dictionary • Given a researcher’s name, find his/her DBLP page • Build the personal dictionary, using Term Frequency and Entry Frequency (#publication entries where a term appears) • Use cosine measure to evaluate the similarity between a webpage and this personal dictionary
Heuristics Rules to distinguish a homepage from other websites. Our Heuristics: • In title: Name, “Homepage”, “DBLP”, “eventseer”, • In URL: A version of person’s name, “citeseer” • In body: Visual cues, specific keywords {University, Department, Professor, Research, Homepage} • Co-occurrence of “publication” and person’s name.
Name Personal Dictionary Domain Dictionary Heuristics Weighting function Homepage Recall…
Combining Scores Experimentally assign weights for the previous scoring functions. Return the URL with the highest score.
Strengths • Disambiguating between people with the same name, given that there is only one of them in the databases field. • Fits well in the DBLife architecture, since our algorithm run offline for the whole researchers list that we get from DBLP.
Strengths (cont) • Incremental architecture: • Finds new researchers through DBLP • Finds new domain related words through DBWorld • Modular architecture: we can add more scoring functions.
Limitations • Can’t distinguish between pages that look like the homepage that we are looking for. • Can’t distinguish between people with the same name, working in the same area (databases). • Google, DBLP, DBWorld dependent.
Demo …