320 likes | 437 Views
Harvesting useful information on researchers' home pages. Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan. Motivation . Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink How about the authors of those publications? Publication-centric.
E N D
Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan TIM
Motivation • Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink • How about the authors of those publications? • Publication-centric. TIM
Motivation • Researcher-centric database? • Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only • Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences • Some other similar databases: manual update, specific to certain organization TIM
Goal: Automated system to build researchers database, for multiple disciplines • Where to get the information? Their home pages. • Basic information • Contact information • Educational history • Publications TIM
Challenges • Different layouts • Templates • Personal pages • Different content • Pages introducing researchers • CV-like • Personal pages • Different content structures • Tables / lists • Natural language text TIM
Challenges • Different data presentations • hangli at microsoft dot com • cs.duke.edu, junyang • ASJMZheng@ntu.edu.sg • erafalin(at)cs.tufts.edu • <Image src=’email.jpg’/> • Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk • wmt then the at-sign then uci dot edu TIM
System Architecture • Fields Identification (Tagging Core) • Home page Identification • Post Processing TIM
Fields Identification - Purpose • To identify data in the page contents to corresponding fields in a pre-defined set of desired information. • Current set includes: • Name – Position – Affiliation • Address – Phone – Fax - Email • BS year – BS major – BS university • MS year – MS major – MS university • PhD year – PhD major – PhD university • Research Interest – Publications TIM
Fields Identification - Related works • Tang et al (2007), (2008) – ArnetMiner • Prepocessing: tokenize text into 5 categories • Tagging of tokens by using Conditional Random Field (CRF) • F1 = 83.37% (~1,000 researchers) • Set of features used: • + Content features (word, morphological, image features) • + Pattern features (positive word, special token, reseacher name features) • + Term features (term, dictionary features) TIM
Fields Identification - Related works • Tang et al (2007), (2008) – ArnetMiner • Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM. • Based only on text of the page. Stylistic information can be of use. TIM
Fields Identification - Related works • Cai et al (2003) • VIsion-based Page Segmentation (VIPS) algorithm to produce visual-based content structure of a web page • Make use of DOM tree and visual cues on web pages • May help in narrowing down relevant sections • Drawback: need a browser to get the visual information TIM
Fields Identification - Related works • Lee (2004) PARCELS Stylistic Engine • Made use of some heuristics proposed by Cai et al (2003) • Parse the DOM tree for text-only and stylistic properties • Text-only data passed to another engine for further process • Stylistic data is stored in vector for machine learning, to classify sections with a set of domain-specific tags. • The domain used was the news domain TIM
Fields Identification - Method • Input: a researcher home page • CRF is employed as the automated learning model • Features used • Global features • Lexicon features • Context features • Dictionaries features • Stylistic features TIM
Fields Identification - Method • Global features: apply for current token • Morphological features • Initials • Number • Punctuation • Lexicon features: apply for current token • Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email TIM
Fields Identification - Method • Context features: apply for whole line • Name context • Address context • Phone context: 'phone', 'tel', 'mobile' • Fax context: 'fax', 'facsimile' • Email context: 'email', 'e-mail' • Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor' • Master (MS) context: appearance of 'M.S' or 'MS' or 'Master' • Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy' • Research-interest context: multiple line property • Publication context: multiple line property • Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line. TIM
Fields Identification - Method • Dictionaries • Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature • Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests • Research dictionary: classified into high/mid/low confidence. • Universities dictionary: of names of most of universities, according to Open Directory TIM
Fields Identification - Method • Stylistic features • List feature • Table features • Section feature: based on html tags like <div>, <p>, <title>, header tags, list elements, table TIM
Fields Identification - Performance • Data set of 40 home pages, cross validation • processed 29271 tokens with 29271 phrases; found: 29271 phrases; correct: 23444.accuracy: 80.09%; precision: 80.09%; recall: 80.09%; FB1: 80.09 • address: precision: 78.90%; recall: 74.57%; FB1: 76.67 327 • affiliation: precision: 30.27%; recall: 59.47%; FB1: 40.12 1110 • bs-major: precision: 88.89%; recall: 78.05%; FB1: 83.12 36 • bs-uni: precision: 68.67%; recall: 57.00%; FB1: 62.30 83 • bs-year: precision: 90.00%; recall: 72.00%; FB1: 80.00 20 • email: precision: 79.31%; recall: 70.77%; FB1: 74.80 58 • fax: precision: 47.73%; recall: 72.41%; FB1: 57.53 88 • misc: precision: 85.23%; recall: 92.35%; FB1: 88.65 22888 • ms-major: precision: 71.43%; recall: 32.26%; FB1: 44.44 14 • ms-uni: precision: 52.94%; recall: 52.94%; FB1: 52.94 85 • ms-year: precision: 77.78%; recall: 56.00%; FB1: 65.12 18 • name: precision: 75.66%; recall: 51.34%; FB1: 61.17 152 • phd-major: precision: 83.33%; recall: 73.17%; FB1: 77.92 36 • phd-uni: precision: 74.56%; recall: 72.03%; FB1: 73.28 114 • phd-year: precision: 100.00%; recall: 74.07%; FB1: 85.11 20 • phone: precision: 53.38%; recall: 89.25%; FB1: 66.80 311 • position: precision: 79.46%; recall: 64.49%; FB1: 71.20 112 • publications: precision: 71.05%; recall: 43.27%; FB1: 53.79 3240 • research-interest: precision: 48.48%; recall: 36.04%; FB1: 41.34 559 TIM
Fields Identification - Discussion • Data fields to be annotated similar to those from ArnetMiner. • Extra: Name, Research Areas, Publications • Missing: Image • Stylistic feature used is minimal TIM
Fields Identification - Discussion • F1 value is slightly lower than that of ArnetMiner’s • ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. TIM has absolutely no prior knowledge about the page to be parsed. • Identifying ‘Research Interest’ and ‘Publications’ is the most challenging. Not always presented. If presented, in various styles TIM
Home page Identification - Purpose • Add-on component • To complete automation of the system: finding home pages to input to the Fields Identification component. TIM
Home page Identification – Related works • Ahoy! • Input: Researcher name and institution name (optional) • Use MetaCrawler as a 'reference source', cross filter by email database • Heuristic-based filter: based entirely on reference's tile, URL, short textual extract (if supplied by the search engine) • Ranking: based on 1/ person name match, 2/ institution URL match, 3/ page appears to be a homepage • URL Pattern Extraction and Generation: extract and learn the pattern if a success, else generate URL from database of URL patterns TIM
Home page Identification – Related works • Ahoy! • Dynamic search, high performance reported, URL patterns usage a good feature • Does not serve the same purpose as my Home page Identification: should not take researcher name as input. • Definition of ‘home page’ is not the same. Ahoy! classifies based on URL patterns, TIM classified based on page contents. TIM
Home page Identification – Method • Collect a list of Universities domains • Use Yahoo! BOSS to search for professors in the institutions • For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’. • Count the number of appearance. • #phone < 3 && #mail < 2 && #professor < 5 Home page • Home pages will be passed to Fields Identification component. TIM
Home page Identification – Discussion • Query to Yahoo! BOSS is not optimal. But this covers the majority • Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records. • Need high confidence in overall system performance. But researcher names are not unique. • Best if can eliminate duplication by analyzing URLs. But domain hierachies differ within department, between departments, and between institutions. TIM
Post-processing - Purpose • Input: CRF++ output file from Fields Identification. • Group neighboring tokens identified with the same annotation tag • Deduplication • Store into database TIM
Contribution • Produced an automated system for fetching researchers’ information from the world wide web. • Introduced a number of features for Fields Identification machine learning. TIM
Future improvements • Fields Identification • Introduce more features, especially stylistic features • Strengthen features targeting Name, Research Interest and Publications tags • Cater for the <image> tag • Be able to handle pages using HTML frames • Be able to follow links on the page if necessary • Home page Identification • Improve heuristics • Post-processing • Be able to refine output from Fields Identification • A new component to facilitate front end for user to query the database TIM
THANK YOU! Question? TIM