950 likes | 968 Views
Concepts, Ontologies, and Project TANGO. Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu. Outline. NSF projects Semantic Web Concepts Project TIDIE Ontologies Project TANGO Tables Ontology generation. Acknowledgements. NSF
E N D
Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu
Outline • NSF projects • Semantic Web • Concepts • Project TIDIE • Ontologies • Project TANGO • Tables • Ontology generation
Acknowledgements • NSF • David Embley (BYU CS), Steve Liddle (BYU Marriott School) and Yuri Tijerino • BYU Data Extraction Group members
The National Science Foundation • Federal agency, $5.5 billion budget, funds 20% of all federally supported basic research conducted by America’s colleges and universities • 7 directorates • Biological Sciences, Computer and Information Science and Engineering, Engineering, Geosciences, Mathematics and Physical Sciences, Social, Behavioral and Economic Sciences, and Education and Human Resources • 200,000 scientists, engineers, educators and students at universities, laboratories and field sites • 10,000 awards/year, 3 years duration (avg.)
GEMINI TELESCOPES HANTAVIRUS IDENTIFICATION DNA FINGERPRINTING MRI—MAGNETIC RESONANCE IMAGING NANOTECHNOLOGY THE NATIONAL OBSERVATORIES OVERCOMING HEAVY METALS OVERCOMING SALT TOXICITY TISSUE ENGINEERING TUMOR DETECTION VOLCANIC ERUPTION DETECTION YELLOW BARRELS The NSF Nifty 50 (general) • ACCELERATING, EXPANDING UNIVERSE • ANTARCTIC OZONE HOLE RESEARCH • ARABIDOPSIS—A PLANT GENOME PROJECT • BAR CODES • BLACK HOLES CONFIRMED • BUCKY BALLS • COMPUTER VISUALIZATION TECHNIQUES • DATA COMPRESSION TECHNOLOGY • DISCOVERY OF PLANETS • DOPPLER RADAR • EFFECTS OF ACID RAIN • EL NIÑO AND LA NIÑA PREDICTIONS • FIBER OPTICS
Language-related Nifty 50 • AMERICAN SIGN LANGUAGE DICTIONARY DEVELOPMENT • COMPUTER VISUALIZATION TECHNIQUES • THE DARCI CARD • DATA COMPRESSION TECHNOLOGY • THE "EYE CHIP" OR RETINA CHIP • THE INTERNET • PERSONS WITH DISABILITIES ACCESS TO THE WEB • PROJECT LISTEN • SPEECH RECOGNITION TECHNOLOGY • vBNS—VERY HIGH SPEED BACKBONE NETWORK SYSTEM • WEB BROWSERS
Hypernym Synonym The search query Annotation Browsing the Semantic Web
movie astronomy sports Ranking based on content data and structure Using lexical semantics for similarity search Grouping results by their conceptual relationships Browsing the Semantic Web
Desirable, not (yet) possible • Word sense disambiguation • Other types of queries (e.g. services) • What is the cheapest available round-trip flight to Cancun the day after finals this semester? • Set up an appointment with my optometrist for next week. • List available 4-person BYU-approved apartments in Orem for under $150/month. • Find me a linguistics job in Tahiti.
Project TIDIE Apr 10, 2001 – May 12, 2005
Overview of TIDIE • 3-year NSF project at BYU • Total amount about $430,000 • PI David Embley (BYU CS), 4 co-PI’s from BYU • 18 grad students, 45 publications • Demos, tools, papers, presentations at website (www.deg.byu.edu/)
Goal of TIDIE • Target-Based Independent-of-Document Information Extraction • Target-based: user specifies what to find • Not just keyword search, but concept-based search using an ontology • Document independent • Should work even if pages change over time, on new documents • IE: match, merge, retrieve, format information • Present in way that user can search, query results
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and extraction
Concepts • What drive the matching process for IE • Inherent in words, numbers, phrases, text • Linguistics: lexical semantics • Denotations: entities, attributes • Location: relationships • Occurrences: constraints
Concept matching • We use exhaustive concept matching techniques to find concepts in documents including: • Lexical information (lexicons) • Natural language processing (NLP) techniques • Similarity of values • Features of value • Data frames • Constraints
Lexicons • Repositories of enumerable classes of lexical information • FirstNames, LastNames, USStates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc. • WordNet (synonyms, word senses, hypernyms/hyponyms)
The data-frame library • Snippets of real-world knowledge about data (type, length, nearby keywords, patterns [as in regexps], functional relations, etc) • Low-level patterns implemented as regular expressions • Match items such as email addresses, phone numbers, names, etc. Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; }, { extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b"; end;
Isolated concepts are OK, but... • We’re also interested in the relations between concepts • This is often best done graphically • Ontology: arrangement of concepts that explicitizes their relations, constraints • Conceptual modeling: field of CS / linguistics that deals with formalizing concepts, using such information • BYU has its own well-known conceptual modeling framework (OSM)
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension Conceptual modeling (OSM)
Ontologies and IE Source Target
Constant/keyword recognition '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888 Descriptor/String/Position(start/end) Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155
Database instance generator Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”)
Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End; Car ads ontology (textual)
Finding jobs in linguistics • Built ontology for linguistics jobs: what defines a linguistics job • Data frames and lexicons: language names (www.ethnologue.com), subfields of linguistics (www.linguistlist.org), tools linguists use, programming languages, activities, responsibilities, country names • Documents: 3500 web pages + emails to me • Complete results reported in DLLS 2003
Sample observations • 270 don’t have linguist* (!) • Computer/computational background required for almost 1/3 (1116) • Noticeable amount of headhunting, particularly in Seattle, DC areas • Often a job title is not even listed (!) • Great need for ontologies related to linguistics • job titles • theoretical frameworks, subfields • typical linguist job activities • linguistic research/development venues
An engineering discipline? • 160 linguistics jobs ending in “engineer” • Software development cycle • research e., software design e. • development e., software e. • software quality e., linguistic test e., linguistic quality e. • linguistic support e., user experience e. • presales e., technical sales e. • Specific subfields • web site e. • speech e., voice recognition e., speech recognition application e., speech e., ASR tuning e., audio e. • dialog e. • tools e. • AI e., NLP e. • knowledge e., ontology e. • linguist e., natural language e. • staff e. • human factors e., user interface e.
A recent ontologist job ad • Date: Thu, 28 Jul 2005 11:44:40 • Subject: General Linguistics: Ontologist, Denver, USA • Job Rank: Ontologist • Specialty Areas: General Linguistics • Position Summary: Ontologist • This person will be responsible for modifying & editing Ontology structures. • Skills: • Basic computer skills such as Internet, email, and spreadsheet programs • In-depth knowledge of any major industry, such as Health Care, Automotive, Legal, Construction, and so forth helpful • Superior communication skills, both oral and written. Ability to communicate effectively with reports, peers, superiors, and customers essential • Travel &/or foreign language experience desired • Personal Characteristics: • A healthy sense of logic, and a love for details • A deep and abiding love of language, and of rule-governed classification systems. This person should be excited by the challenge of figuring out the precise place where a word belongs, and be delighted with the prospect of performing such tasks as the major part of their job • Position Qualifications: • -Bachelor's degree, preferably in Linguistics, Library Science, English, or related field
Another recent ontologist ad • Position Summary: Lead Ontologist • The Lead Ontologist will be responsible for creating & designing Ontology and Ontology structures. This person will be responsible for innovation and general Ontology development as Ontology requirements change. They will serve as Team Lead on various Ontology projects, and they will assist the Director with certain aspects of management, including the development of department culture and standards. They will also serve as a liaison between the Director and the rest of the team. • Skills: • Ability to edit & manipulate text highly desired, using tools such as Emacs and Perl. High level programming language experience and SQL also desired • Knowledge of Ontology structures, and experience with developing and maintaining such structures • Ability to assist with Ontology development and use problem-solving skills to overcome obstacles • Ability to QA own Ontology work, and work of others • Ability to lead projects from set-up through to QA • Leadership or management experience a plus • Position Qualifications: • -Bachelor's degree in Linguistics, Library Science, or related field • -2-3 years experience in Ontology or related field • Application Deadline: Open until filled.
Matching request with ontology “Tell me about cruises on San Francisco Bay. I’d like to know scheduled times, cost, and the duration of cruises on Friday of next week.”
Projection Selection Constants scheduled times San Francisco Bay Join Path Building a query cost Friday, Oct. 29th duration = Result ( )
Another example • Service Request • Match with Task Ontology • Domain Ontology • Process Ontology • Complete, Negotiate, Finalize I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.
Ontologies: issues • Most successful in data-rich, narrow- domain applications • Ambiguities are problematic, context only partially eliminates • Incompleteness: implicit information • Commonsense world pragmatics evasive • Knowledge prerequisites are steep • Major efforts in creation, maintenance • Must be created by experts • Experts are biased in knowledge, agreement needed • Ontologies continually change; upkeep a massive task
Ontologies: possible solutions • Some automation is needed • Current automatic generation of ontologies is not successful, because extracted from free-form, unstructured text. • A more effective alternative is to extract ontologies from structured data on the web (tables, charts, etc.) • TANGO project • Part 1: Extract tables from the web • Part 2: Define mini-ontologies from tables • Part 3: Merge into growing domain ontology
Overview • Table ANalysis for Generating Ontologies • 3-year NSF-funded project • Joint BYU/RPI project • Uses and extends TIDIE concepts, ontologies • Goal is to process tables, generate ontologies, use results for IE
Motivation • Keyword or link analysis search not enough to search for information in tables • Structure in tables can lead to domain knowledge which includes concepts, relationships and constraints (ontologies) • Tables on web created for human use can lead to robust domain ontologies
Table understanding • What is a table? • Why table normalization? • What is table understanding? • What is mini-ontology generation?
What is a table? • “…a two-dimensional assembly of cells used to present information…” • Lopresti and Nagy • Normalized tables (row-column format) • Small paper (using OCR) and/or electronic tables (marked up) intended for human use