1 / 95

Concepts, Ontologies, and Project TANGO

Concepts, Ontologies, and Project TANGO. Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu. Outline. NSF projects Semantic Web Concepts Project TIDIE Ontologies Project TANGO Tables Ontology generation. Acknowledgements. NSF

keelyr
Download Presentation

Concepts, Ontologies, and Project TANGO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

  2. Outline • NSF projects • Semantic Web • Concepts • Project TIDIE • Ontologies • Project TANGO • Tables • Ontology generation

  3. Acknowledgements • NSF • David Embley (BYU CS), Steve Liddle (BYU Marriott School) and Yuri Tijerino • BYU Data Extraction Group members

  4. The National Science Foundation • Federal agency, $5.5 billion budget, funds 20% of all federally supported basic research conducted by America’s colleges and universities • 7 directorates • Biological Sciences, Computer and Information Science and Engineering, Engineering, Geosciences, Mathematics and Physical Sciences, Social, Behavioral and Economic Sciences, and Education and Human Resources • 200,000 scientists, engineers, educators and students at universities, laboratories and field sites • 10,000 awards/year, 3 years duration (avg.)

  5. GEMINI TELESCOPES HANTAVIRUS IDENTIFICATION DNA FINGERPRINTING MRI—MAGNETIC RESONANCE IMAGING NANOTECHNOLOGY THE NATIONAL OBSERVATORIES OVERCOMING HEAVY METALS OVERCOMING SALT TOXICITY TISSUE ENGINEERING TUMOR DETECTION VOLCANIC ERUPTION DETECTION YELLOW BARRELS The NSF Nifty 50 (general) • ACCELERATING, EXPANDING UNIVERSE • ANTARCTIC OZONE HOLE RESEARCH • ARABIDOPSIS—A PLANT GENOME PROJECT • BAR CODES • BLACK HOLES CONFIRMED • BUCKY BALLS • COMPUTER VISUALIZATION TECHNIQUES • DATA COMPRESSION TECHNOLOGY • DISCOVERY OF PLANETS • DOPPLER RADAR • EFFECTS OF ACID RAIN • EL NIÑO AND LA NIÑA PREDICTIONS • FIBER OPTICS

  6. Language-related Nifty 50 • AMERICAN SIGN LANGUAGE DICTIONARY DEVELOPMENT • COMPUTER VISUALIZATION TECHNIQUES • THE DARCI CARD • DATA COMPRESSION TECHNOLOGY • THE "EYE CHIP" OR RETINA CHIP • THE INTERNET • PERSONS WITH DISABILITIES ACCESS TO THE WEB • PROJECT LISTEN • SPEECH RECOGNITION TECHNOLOGY • vBNS—VERY HIGH SPEED BACKBONE NETWORK SYSTEM • WEB BROWSERS

  7. Hypernym Synonym The search query Annotation Browsing the Semantic Web

  8. movie astronomy sports  Ranking based on content data and structure  Using lexical semantics for similarity search  Grouping results by their conceptual relationships Browsing the Semantic Web

  9. Desirable, not (yet) possible • Word sense disambiguation • Other types of queries (e.g. services) • What is the cheapest available round-trip flight to Cancun the day after finals this semester? • Set up an appointment with my optometrist for next week. • List available 4-person BYU-approved apartments in Orem for under $150/month. • Find me a linguistics job in Tahiti.

  10. Project TIDIE Apr 10, 2001 – May 12, 2005

  11. Overview of TIDIE • 3-year NSF project at BYU • Total amount about $430,000 • PI David Embley (BYU CS), 4 co-PI’s from BYU • 18 grad students, 45 publications • Demos, tools, papers, presentations at website (www.deg.byu.edu/)

  12. Goal of TIDIE • Target-Based Independent-of-Document Information Extraction • Target-based: user specifies what to find • Not just keyword search, but concept-based search using an ontology • Document independent • Should work even if pages change over time, on new documents • IE: match, merge, retrieve, format information • Present in way that user can search, query results

  13. Document-based IE

  14. Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and extraction

  15. Concepts • What drive the matching process for IE • Inherent in words, numbers, phrases, text • Linguistics: lexical semantics • Denotations: entities, attributes • Location: relationships • Occurrences: constraints

  16. Concept matching • We use exhaustive concept matching techniques to find concepts in documents including: • Lexical information (lexicons) • Natural language processing (NLP) techniques • Similarity of values • Features of value • Data frames • Constraints

  17. Lexicons • Repositories of enumerable classes of lexical information • FirstNames, LastNames, USStates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc. • WordNet (synonyms, word senses, hypernyms/hyponyms)

  18. The data-frame library • Snippets of real-world knowledge about data (type, length, nearby keywords, patterns [as in regexps], functional relations, etc) • Low-level patterns implemented as regular expressions • Match items such as email addresses, phone numbers, names, etc. Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; }, { extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b"; end;

  19. Isolated concepts are OK, but... • We’re also interested in the relations between concepts • This is often best done graphically • Ontology: arrangement of concepts that explicitizes their relations, constraints • Conceptual modeling: field of CS / linguistics that deals with formalizing concepts, using such information • BYU has its own well-known conceptual modeling framework (OSM)

  20. Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension Conceptual modeling (OSM)

  21. Ontologies and IE Source Target

  22. Constant/keyword recognition '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888 Descriptor/String/Position(start/end) Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155

  23. Database instance generator Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”)

  24. Car ads extraction ontology

  25. Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End; Car ads ontology (textual)

  26. A gene ontology

  27. A geneology data model

  28. Finding jobs in linguistics • Built ontology for linguistics jobs: what defines a linguistics job • Data frames and lexicons: language names (www.ethnologue.com), subfields of linguistics (www.linguistlist.org), tools linguists use, programming languages, activities, responsibilities, country names • Documents: 3500 web pages + emails to me • Complete results reported in DLLS 2003

  29. Sample query

  30. Sample output

  31. Subfield expertise sought

  32. Technical skills sought

  33. Sample observations • 270 don’t have linguist* (!) • Computer/computational background required for almost 1/3 (1116) • Noticeable amount of headhunting, particularly in Seattle, DC areas • Often a job title is not even listed (!) • Great need for ontologies related to linguistics • job titles • theoretical frameworks, subfields • typical linguist job activities • linguistic research/development venues

  34. An engineering discipline? • 160 linguistics jobs ending in “engineer” • Software development cycle • research e., software design e. • development e., software e. • software quality e., linguistic test e., linguistic quality e. • linguistic support e., user experience e. • presales e., technical sales e. • Specific subfields • web site e. • speech e., voice recognition e., speech recognition application e., speech e., ASR tuning e., audio e. • dialog e. • tools e. • AI e., NLP e. • knowledge e., ontology e. • linguist e., natural language e. • staff e. • human factors e., user interface e.

  35. A recent ontologist job ad • Date: Thu, 28 Jul 2005 11:44:40 • Subject: General Linguistics: Ontologist, Denver, USA • Job Rank: Ontologist • Specialty Areas: General Linguistics • Position Summary: Ontologist • This person will be responsible for modifying & editing Ontology structures. • Skills: • Basic computer skills such as Internet, email, and spreadsheet programs • In-depth knowledge of any major industry, such as Health Care, Automotive, Legal, Construction, and so forth helpful • Superior communication skills, both oral and written. Ability to communicate effectively with reports, peers, superiors, and customers essential • Travel &/or foreign language experience desired • Personal Characteristics: • A healthy sense of logic, and a love for details • A deep and abiding love of language, and of rule-governed classification systems. This person should be excited by the challenge of figuring out the precise place where a word belongs, and be delighted with the prospect of performing such tasks as the major part of their job • Position Qualifications: • -Bachelor's degree, preferably in Linguistics, Library Science, English, or related field

  36. Another recent ontologist ad • Position Summary: Lead Ontologist • The Lead Ontologist will be responsible for creating & designing Ontology and Ontology structures. This person will be responsible for innovation and general Ontology development as Ontology requirements change. They will serve as Team Lead on various Ontology projects, and they will assist the Director with certain aspects of management, including the development of department culture and standards. They will also serve as a liaison between the Director and the rest of the team. • Skills: • Ability to edit & manipulate text highly desired, using tools such as Emacs and Perl. High level programming language experience and SQL also desired • Knowledge of Ontology structures, and experience with developing and maintaining such structures • Ability to assist with Ontology development and use problem-solving skills to overcome obstacles • Ability to QA own Ontology work, and work of others • Ability to lead projects from set-up through to QA • Leadership or management experience a plus • Position Qualifications: • -Bachelor's degree in Linguistics, Library Science, or related field • -2-3 years experience in Ontology or related field • Application Deadline: Open until filled.

  37. Matching request with ontology “Tell me about cruises on San Francisco Bay. I’d like to know scheduled times, cost, and the duration of cruises on Friday of next week.”

  38. Projection Selection Constants scheduled times San Francisco Bay Join Path  Building a query cost Friday, Oct. 29th duration   = Result ( )

  39. Another example • Service Request • Match with Task Ontology • Domain Ontology • Process Ontology • Complete, Negotiate, Finalize I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.

  40. Service domain ontology

  41.     

  42. Relevant mini-ontology

  43. Ontologies: issues • Most successful in data-rich, narrow- domain applications • Ambiguities are problematic, context only partially eliminates • Incompleteness: implicit information • Commonsense world pragmatics evasive • Knowledge prerequisites are steep • Major efforts in creation, maintenance • Must be created by experts • Experts are biased in knowledge, agreement needed • Ontologies continually change; upkeep a massive task

  44. Ontologies: possible solutions • Some automation is needed • Current automatic generation of ontologies is not successful, because extracted from free-form, unstructured text. • A more effective alternative is to extract ontologies from structured data on the web (tables, charts, etc.) • TANGO project • Part 1: Extract tables from the web • Part 2: Define mini-ontologies from tables • Part 3: Merge into growing domain ontology

  45. Project TANGO

  46. Overview • Table ANalysis for Generating Ontologies • 3-year NSF-funded project • Joint BYU/RPI project • Uses and extends TIDIE concepts, ontologies • Goal is to process tables, generate ontologies, use results for IE

  47. Motivation • Keyword or link analysis search not enough to search for information in tables • Structure in tables can lead to domain knowledge which includes concepts, relationships and constraints (ontologies) • Tables on web created for human use can lead to robust domain ontologies

  48. Table understanding • What is a table? • Why table normalization? • What is table understanding? • What is mini-ontology generation?

  49. What is a table? • “…a two-dimensional assembly of cells used to present information…” • Lopresti and Nagy • Normalized tables (row-column format) • Small paper (using OCR) and/or electronic tables (marked up) intended for human use

More Related