ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

ML: Classical methods from AI • Decision-Tree induction • Exemplar-based Learning • Rule Induction • TBEDL

RuleInduction Rule Induction • We will follow (again): ACL’99 Tutorial on: Symbolic Machine Learning for NLP (Mooney & Cardie 99) • Sequential Covering • Greedy Covering • Strategies for Learning a Single Rule: • Top-Down vs. Bottom-Up

RuleInduction Rule Induction • Propositional FOIL • Relational Learning and Inductive Logic Programming (ILP) • FOIL • Applications: • Text Categorization • Information Extraction

RuleInduction Rule Induction and NLP • Text Categorization(Cohen 95,96; Craven et al. 98; Slattery & Craven 98) • Semantic Parsing(Zelle & Mooney 93,94,96) • Information Extraction(Soderland 95,96,99; Freitag 98a,98b,98c) (Califf & Mooney 97,99; Turmo & Rodríguez 01) • Generation (Radev 98)

IE Information Extraction (Turmo & Rodríguez, 01)

IE Information Extraction (Turmo & Rodríguez, 01) “Vira a marrón oscuro al corte”

IE (Turmo & Rodríguez, 01) Information Extraction

IE Information Extraction (Turmo & Rodríguez, 01) • Basic concepts • Colour: <n3, n4> • Derived concepts • Color_state: <n5, n9>

IE Information Extraction (Turmo & Rodríguez, 01) Resultats globals? isa_color (A, A) :- pos_s_adj(A), has_hypernym_03464977n(A), ancestor(A, C), pos_s_adj(C). isa_color (A, A) :- has_hypernym_03460270n(A), brother(C,A), pos_nc(C), has_hypernym_00009919n(C). … UsingFOIL(First Order Induction Learner, Quinlan, 1990) as basic learner 38 rules were learned by FOIL for color only 1 was illformed

IE Information Extraction (Turmo & Rodríguez, 01) Drawbacks of the learning process • Insufficient amount of positive examples • Active Learning • Artificial examples • Relevance of negative examples • Use of empirical observations • Freitag’s baseline • Use of a distance measure between examples • Use of clustering techniques

Internet IE Information Extraction • The WebÞKB Project • CMU Text Learning Group(Tom Mitchell, Andrew McCallum, Mark Craven, etc.) • Situation: >350 million Web pages available from a personal workstation. However none of them are understandable for your computer • Goal: To automatically create a computer-understandable knowledge base whose content mirrors that of the WWW • Utility: Allowing much more effective information retrieval and supporting knowledge-based inference and problem solving on the World Wide Web • How: Using machine learning to create information extraction methods for each of the desired types of knowledge

WebKB architecture Entities Person department_of projects_of name_of ... Student advisors_of courses_TAed_by Faculty projects_led_by students_of Internet IE

WebKB architecture Web Pages Fundamentals of CS Home Page Instructors: Jim Tom Jim’s Home Page I teach several courses: Fundamentals of CS Intro to AI My research includes: Intelligent web agents Human computer interaction Internet IE

WebKB architecture KB Instances Fundamentals-of-CS instructors_of: jim, tom home_page: Jim courses_taught_by: fundamentals-of-CS, intro-to-AI home_page: Internet IE

Web pages Ontology INPUT Learning algorithm Learning algorithm Learning algorithm ... TRAINING RESULT Classification rules Relation extraction rules Extraction rules ... WebKB WWW Internet IE WebKB architecture TEST

Internet IE Learning Tasks • Recognizing class instances by classifying bodies of text • Recognizing relation instances by classifying chains of hyperlinks • Recognizing class and relation instances by extracting small fields of text from Web pages

Internet IE Learning Tasks • Recognizing class instances by classifying bodies of text • Bayesian text categorization • Several text representations • Exploiting hyperlink relations • relational text categorization • clustering of documents • Exploiting combination of several classifiers

course(A) Ù person(B) Ù link_to(B,A) Þinstructor_of(A,B) research_project(A) Ù person(C) Ù link_to(L1,A,B) Ù link_to(L2,B,C)Ù neighbour_word_people(L1)Þmember_proj(A,C) Internet IE Learning Tasks • Recognizing relation instances by classifying chains of hyperlinks • Discovering hyperlink paths of unknown and variable size. • First order representation • Induction of relational rules (FOIL)

length(F,<,3) Ù in_title(A) Ù prev_word(A,”GMT”) Ù unknown(A) Ù not(length(A,=,4)) Ù follow_word(A,B) Ù length(B,>,4) Þownername(F) Internet IE Learning Tasks • Recognizing class and relation instances by extracting small fields of text from Web pages • Sequence Ruleswith Validation (Freitag, 98; 99): • FOIL-based general-purpose relational learner for IE • Rules for extracting names of home page owners: • 77.4% accuracy!

Internet IE Evaluation • Training corpora(hand labelled according to the prescribed ontology): • 8,000 Web pages • 1,400 Web-page pairs • From the computer science department Web sites at four universities: Cornell, University of Texas at Austin, University of Washington, and University of Wisconsin. • Experimental test on the Web site of the computer science department at Carnegie Mellon University

Evaluation Internet IE

Internet IE Evaluation Class instances Relation instances

RuleInduction Rule Induction: Summary • Connection to DanRoth’s work at the Cognitive Computation Group (Univ. of Illinois at Urbana-Champaign)

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL