270 likes | 449 Views
Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval *. Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov Workshop on Science and the Semantic Web October 24, 2002. * Work done by the author when at MCC and LSDIS Lab, UGA. Outline.
E N D
Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval* Vipul KashyapNational Library of Medicinekashyap@nlm.nih.govWorkshop on Science and the Semantic WebOctober 24, 2002 * Work done by the author when at MCC and LSDIS Lab, UGA
Outline • Ontologies for Information Retrieval: The InfoSleuth System • The Ontology Design Process: • “Reverse Engineering” from a database schema • Ontology refinement based on user queries • Using a data dictionary and Thesaurus • Ontology-based Multimedia Information Retrieval • Information Extraction from Textual Data • Information Extraction from Image Data • Conclusions and Future Work
KQML/OKBC agents Ontologies for Information Retrieval:The InfoSleuth System Image Database: features, patterns, semantic objects Document Database e.g., Verity Ontology-based retrieval query Structured Database e.g., Oracle
county block area population spatial_location land_cover containment name Region Fire isLocatedNear A Multimedia GIS Query using an ontological model Get me all regions (blocks, counties) having apopulation greater than 500 andarea greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment select county, block, spatial_location from region where area > 50 and population > 500 and land_cover = “urban” and region.isLocatedNear.containment = “excellent”
Ontologies for Information Retrieval • Provide a concise, uniform, declarative description of semantic information • Independent of syntactic representations, conceptual models of the underlying information bases • Domain models provide wider access by supporting multiple world views on the same underlying data • EDEN ontology defined in the context of the InfoSleuth system: • important and crucial to capture elements of environmental information
Sources for Ontology construction • Pre-existing Database Schemas • data directed component • Collection of representative set of queries possibly parameterized based on application user interface • application directed component • Thesauri and Vocabularies (e.g., EEA Thesaurus) • knowledge directed component • Ontology = knowledge-based middle ground between applications and data !!!
Abstract detailsfrom Database Schema Choose newDatabase Schema DetermineRelationships Determine entitiesand attributes Group information,Analyze foreign keysand dependencies Drop entitiesand attributes Implementand Test EvaluateOntology Add new entitiesand attributes Add new subclassesand superclasses Choose new query No morequeries The Ontology Design Process Ontology fromDatabase Schema Ontology fromQueries
Environmental Databases • CERCLIS 3 • http://www.epa.gov/enviro/html/cerclis/ • ITT • HAZDAT • http://www.atsdr.cdc.gov/hazdat.html • ERPIMS • http://ns1.ktc.com/personal/larnold/erpims.htm • Basel Convention Database • http://www.unep.ch/basel
Site code name date alias_name description Grouping Information in Multiple Tables Site site_id (PK) site_name site_ifms_ssid_ code site_rcra_id site_epa_id Site_Characteristic site_id (PK, FK to Site) rsic_code (PK, FK to Ref_Sic) sc_date Ref_Sic rsic_code (PK) rsic_code_desc Site_Alias site_id (PK, FK to Site) site_alias_id (PK) sa_name Database Schema Ontology
Ref_action_type rat_code (PK) rat_name rat_def Remedial_Responsesite_idact_code_idrat_code Waste_Src_Media_Contaminated Contaminant wsmrc_nmbr (PK) site_id (PK, FK to Action) rat_code (FK to Action) act_code_id (FK to Action) actionName Site RemedialResponse PerformedAt Identifying Relationships Site site_id (PK) site_name site_ifms_ssid_ code site_rcra_id site_epa_id Action site_id (PK, FK to Site) rat_code (PK, FK to ref_action_type) act_code_id (PK) Database Schema Ontology
Ontology refinement based on user queries • Addition of New Attributes • At NPL sites with a land use category of INDUSTRIAL, what is the cleanup level range for LEAD …. • Add an attribute landUseCategory to the entity Site in the ontology • Addition of new Relationships • What is the range of concentrations for ARSENIC is a contaminant of concern in the SURFACE SOIL at NPL sites • Add a relationship HasContaminant between the entities Site and Contaminant in the ontology • Addition of class-subclass relationships and new entities • How many Super fund sites are in Edison County, New Jersey ? • Add an entity SuperFundSite as a subclass of Site in the ontology
Site Map coding_scheme1 coding_scheme2 coding_scheme3 state StateName StateCode StateAbbr Using a data dictionary (EDR) to enhance the ontology { “Texas”, “California” } { “TX”, “CA” } select * from Site where state = ‘TX’ or state = ‘California’ select coding_scheme1 from Map where coding_scheme3 = ‘TX’
LandSetup Site AbandonedSite DisusedMilitarySite SuperfundSite Enhancing the Ontology by using a Thesaurus abandoned siteTHEME POLLUTIONBT land setupNT disused military site
county block area population spatial_location land_cover containment name Region Fire isLocatedNear Information Extraction from Text andMultimedia Data Get me all regions (blocks, counties) having apopulation greater than 500 andarea greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment select county, block, spatial_location from region where area > 50 and population > 500 and land_cover = “urban” and region.isLocatedNear.containment = “excellent”
containment county block state Region Fire isLocatedNear Information Extraction from Textual Data = “excellent” <ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25), <WORD>(%), <WORD>(active)), <PHRASE>(full, containment,, <STEM>(was), expected) <PHRASE>(the, fire, <STEM>(is), contained)) <ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San), [region.county]), <OR>(county, block, state))) <PARAGRAPH>(FIRE, REGION)
Mapping “domain specific” model elements to media specific metadata • county(x,y) gets mapped to: • word(x), phrase(x), accrue(<list-of-subtrees>) • containment(x, “excellent”) gets mapped to: • sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>) • isLocatedNear(x, y) gets mapped to: • paragraph(x,y)
Mapping SQL queries to Topic Expressions select county from region where isLocatedNear.containment = “excellent” <PARAGRAPH>( <ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25), <WORD>(%), <WORD>(active)), <PHRASE>(full, containment,, <STEM>(was), expected) <PHRASE>(the, fire, <STEM>(is), contained)), <ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San), [region.county]), county)) )
Limitations of Current Indexing Technologies: “selection operation” select county from region <ACCRUE>(<SENTENCE>(<PHRASE>(<OR>(New, Las, San), WILDCARD), <OR>(county, block, state))) => post-processing of patterns returned (WILDCARD as place-holder) Problem: WILDCARD may match a lot of words in the same sentence WILDCARD may match different words in different sentences
Using NLP and statistical techniques • WILDCARD matches a number of words in the same sentence Yeltsin was appointed thePrime Ministerwhensleeping articlenounconjunction verb => Use part of speech tagging to reduce number of possibilities • WILDCARD matches different words in different sentencesYeltsin was appointed Prime MinisterYeltsin was appointed President=> use frequency statistics to give a level of confidence
INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT NATIONAL PREPAREDNESS LEVEL II CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires ha staffed for structure protection. SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena The fore is active on the southern perimeter, which is burning into a continuous stand of black s fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern 35% contained, while protection of the historic cabit continues. CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Weh assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up wher burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this we depending on the results of infrared scanning. Phrase: SIMELS, Galina District, BLM. Slot: fire.name value: SIMELS structure: <name> , <place> , <unit> . Definition Support
MIDAS*: Information Extraction from Multimedia Data Query: Get me all regions (blocks, counties) having apopulationgreater than 500 and area greater than 50 acres having an urban land cover select county, block, area, population, spatial_location, land_coverfrom regionwhere area > 50and population > 500and land_cover = ‘urban’and relief = ‘moderate’ *Media Independent DomAin Specific correlation
Get me all regions (counties, blocks) having 50 < population < 100 25 < area < 50 and low density urban area land cover ... media independent correlation across domain specific metadata correlation across image and structured data at an intensional domain level
SQL queries to structured data (Census DB) Population: Area: Boundaries: SQL Gateway to textual data (TIGER/Line DB) Land cover: Relief: Image Processing routines for Image Data
Mapping “domain specific” model elementsto media specific metadata • contained(<concept>, <image>) gets mapped to: • latitude/longitude, image-coordinates • bounding box of region • image type: LULC, DEM • land_cover(x, “low density urban”) gets mapped to: • percentage(<pixel-color>, <bounding-box>) • relief(x, “moderate”) gets mapped to: • standard-deviation(<pixel-value, <bounding-box>)
Geological Region Water Urban Forest Land Industrial Reservoirs Residential Lakes Evergreen Commercial Streams and Canals Deciduous Mixed Geological Region State County Rural Area City Tract Block Group Block Need for characterization of Domain Vocabularies Another source of domain ontology Construction: • Classification Standards
Conclusions and Future Work • Role of semantic content in handling data/information overload • Domain Specific ontologies: an approach for capturing semantic content • Design and construction of domain ontologies • labor intensive, time consuming, difficult endeavor • Re-use readily information: schemas, queries, data dictionaries, thesauri • minimize the involvement of the domain expert • Metadata is the key for MultiMedia Information Retrieval • Use an expanded notion of metadata as schema and declarative SQL like query language • Pragamatic Incorporation of NLP/Image+Speech+Video Processing/Computer Vision techniques • Exploit synergy across multiple media for better precision and performance • Extrapolate this technique into other domains: • Medical and Bio-Informatics • telecommunication • IP networks (use of CIM information model by DMTF) • Ontology Extraction from Textual Data: • Clustering techniques to identify central concepts and taxonomic relationships • NLP techniques to identify concept associations • Consensus analysis techniques to establish ontologies