300 likes | 644 Views
Geoinformatics. Introduction. The Web. Goal of the current Web is to make knowledge widely accessible and to increase the utility of this knowledge by enabling advanced applications for searching, browsing, and evaluation. The Web presents a vast amount of distributed data and information
E N D
Geoinformatics Introduction
The Web • Goal of the current Web is to make knowledge widely accessible and to increase the utility of this knowledge by enabling advanced applications for searching, browsing, and evaluation. • The Web presents a vast amount of distributed data and information • for human consumption using • the Internet infrastructure and • a set of WWW communication standards e.g., FTP, URI, HTTP
Semantic Web • Semantic Web is related to the World Wide Web. • It is based on data format to encode knowledge for processing in computer systems (software) • Endeavors leading to it: • Building abstract models that simplify the complex reality • Ontology: Description of knowledge about a specific domain in a machine-processable specification with a formally defined meaning. • Models are designed with a structure and relationships of its components • Computing with knowledge: • Representing knowledge such that computers can automatically come to reasonable conclusions (i.e., infer) from encoded knowledge • Exchanging information (communication): distribute, interlink, and reconcile heterogeneous knowledge at a global scale (i.e., on the Web) (required the Web, HTTP, HTML, etc.)
Problems • There knowledge sources are heterogeneous and distributed globally • The exchange of these heterogeneous information required standard data formats, language (e.g., HTML) and protocols (e.g., HTTP, FTP, Web services)
Problem with the present Web • The problem is that the Web cannot consume the information which it carries • The users of the Web are human beings, who try to make sense of the information (depending on their background) by reading the ‘best’ hits for their probabilistic keyword-based search provided by the search engines • The number of nodes (servers) on the Web, presenting the large volume of data and information, is increasing at a fast pace, making it hard to effectively index the pages and present a useful statistics to the users
What can be done? • It would be nice if the information on the Web could automatically be handled by machines, which are capable of processing vast volumes of information at high speed, in a fraction of the time it takes humans to read a document with information • However, this requires the computers to ‘know’ the meaning, i.e., the semantics of the information, like human beings
A Scenario • Suppose we are interested to know the location of the normal faults that are currently seismically active in southwest Montana, and which formed through the Tertiary Basin-and-Range tectonics • We, as geologists, type in our keywords in the search engines, and get many probabilistic hits. We start reading the documents available on the Web, and depending on the level of our knowledge, which is very variable, extract and learn different information about these faults • The information may not be correct, because some of the returned hits which are listed as seismic faults may be reactivated Tertiary faults which we are looking for
Others may be younger than the Basin-and-Range tectonics, and formed through thermal expansion and subsequent subsidence when the North American plate moved southwest relative to the fixed Yellowstone hot spot • Geologists can decipher the difference between these two events using their knowledge of the two extensional events (Basin-and-Range and the hot spot) • They can make this distinction based on their experience in relation to the characteristics of each of these events, such as fault orientation, spatial distribution, cross-cutting relationships, and unconformities based on the sedimentary cover
Problem with keyword search • The computer has no clue as to the meaning of the fault, normal fault, and seismically-active keywords • How can we make computers to learn the geological knowledge, so that our queries return more useful information? • How can we tell the computer that the hanging wall in a normal fault moves down during extension; that horsts represent the footwall; the tick (or lollipop) symbol on the fault trace map is on the hanging wall; the trend of the fault trace is read from the North or South reference in azimuth or quadrant format; the length of a fault trace is read based on the scale of the map; or rocks are not liquid?
Computer doesn’t know meaning • Not only the computer does not know what a normal fault or stress is • it does not know the fact that a normal fault forms by extension, not contraction, when the maximum principal compressive stress is vertical, or • that although a fault is a planar feature, it is represented on a map as a horizontal, linear feature (fault trace) • Currently, only geologists know these. We need to make ‘geologistoids’ by formally structuring and specifying our knowledge and feeding them to software that can read them!
Data, Information, and Knowledge • Knowledge management deals with accessing, manipulating, andsharing of knowledge • Knowledge engineering: Developing knowledge-based systems (software) in any field that can help the community to process data and information based on the consensual knowledge in that field • This requires understating the notion of data, information, and knowledge
Terminology • Data refers to values assigned to the attributes (properties) of particular objector processentity that occupies space and time • An object is a bona fide or fiat portion of reality, such as a class of individuals, an individual or its parts, or a spatial region • Bona fide objects exist independent of our perceptions and classification, and are demarcated from their surrounding • (e.g., fold, formation, oil) • Fiat objects, on the other hand, exist only because of our partitioning (classifying) activities. e.g., Montana, west of the Mississippi
What do Earth Scientists do? • We collect data about particular: • continuant, geological objects • e.g., the San Andreas Fault • occurent processes, such as Mount St. Helens volcanic eruption (e.g., time interval, type, and nature of eruption) • Data constitute the raw values collected during an activity such as field work, experiment, simulation, or calculation • For example, the age of a rock, the salinity of sea water, and the depth to the water table are data.
Information • We commonly need a series of data about something to make sense of their meaning, i.e, extract information • Data may become meaningful and useful to the scientists, i.e., become information, when they are put together, for example, in a plot, map, or pattern • Informationis a collection of data, which based on the background knowledge, may mean something to the person examining the data if he/she has the background knowledge about the subject (i.e., domain or knowledge expert) from which data were extracted • Information, is therefore, the meaning of the data based on background knowledge (e.g., map)
Map is information • A map that presents the orientation (e.g., strike and dip or trend), spatial data (location, distribution), and temporal data (age) about many thrust faults and related folds (axial trace, limb attitude), is information • As such, a geological map is more meaningful to a structural geologist than it may be to say a geographer who may not have the required ‘knowledge’ about these geologic structures
Information makes sense with knowledge • Information may be an emergent propertyof data after they are processed in a context • For example, a population of faults oriented parallel to each other may represent a set, which based on the domain knowledge, may be assumed by the domain experts, i.e., structural geologists, to have formed together during a single tectonic event • Same information may be interpreted differently applying different knowledge based on different truths, beliefs, perspectives, judgments, and know-how!
Truth depends on knowledge • That’s how science expands, by interpreting same data differently, until the ‘truth’ is found and verified with the existing knowledge • The ‘truth’, which is what is believed to be true at a given time, may change with new knowledge and discoveries • Knowledge in current scientific books generally present the latest sets of true statements • Although the data, which are presented on thecurrent Web, may be created and formed into meaningful information by both humans and machines, they may only be understood by humans
Difference between knowledgeand information • Knowledge is a collection or total sumof true beliefs (statements) about real objects in a field (domain or universe of discourse), which can be used to make a decision • The true beliefs are mainly about universals (i.e., types of things such as fault, mineral), but also include facts about particulars or individuals (i.e., instance of the general types, such as San Andreas Fault, a sample of quartz) True statements
General, universal truths • Notice that here we are talking about general true statements (facts) which have been discovered by geoscientists throughout the history of geoscience through scientific method • Although scientists learn about the general (i.e., universal) types of objects and features which they study, they study particular objects in their research • A hydrogeology book presents the hydrogeology knowledge by dealing more with general facts about universal types of aquifers (confined, unconfined, and leaky) and to some extentabout particular aquifers (e.g., Floridan Aquifer)
Knowledge = set of known true statements • Knowledge is a set of true statements(i.e., knowledge fragments) • Examples of knowledge fragment: • ‘rock is made of one or more minerals’ • ‘thrusting moves older rocks on top of younger ones’ • ‘mylonite forms in a ductile shear zone’ • ‘pressure of an unconfined aquifer is atmospheric’ • The goal of a knowledge-based system is to translate these knowledge fragments into a machine-understandable and processablecode: Rock hasPart Mineral Mineral partOf Rock
Example • Suppose a good number of temperature measurements in the past winter month rangedbetween -10oC and -20oC, with an average of -11oC,and a cooling trend • The -10oC to -20oC temperatures are data, the cooling trend, and the comparison of the average temperature over many years are information • The statement: ‘average temperature drops in winter’ is a piece of general knowledge • Given that even colder temperatures are coming (information), we may make a decision not to go out with a T-shirt unless we want to freeze (we have the knowledge that we may freeze at extremely low temperatures) or to show off our ‘Love Earth’s Diploes’ T-shirt
Geoscience example • We are planning to build a large structure (a nuclear reactor or dam) in an area. We are not sure if a fault runs through the area • The epicenters of microseismic events over the past two decades show a linear spatial distribution, which coincides with a straight drainage cut in Quaternary alluvium • In this case the epicenters are data; the linear spatial distribution of the epicenters is information • We know that Quaternary faults may cut through recent alluvium, and that seismic faults are active, i.e., they can slip at any time • This knowledge, and the knowledge that ’building a nuclear reaction on a fault may be dangerous’ lead us to make a decision not to build the reactor or the fault in this location. The reasoning to make these decisions is based on background knowledge which resides in geoscientists’ heads
Rules of Inference • Computers can also make use of information through inference rules if we explicitly formalize our knowledge with specific rule-based machine language and logic • Automatic processing of information and performing inference about it requires specific languages (e.g., RDF, RDFS, and OWL) with built-in inference rules • We need ways to represent the semantics of our knowledge fragments by identifying real domain objects, and modeling the relationships among these objects and processes that involve them • This knowledge-based model of reality (ontology), with embedded metadata and inference rules, canbe used for reasoning (i.e., drawing implicit entailments from the explicitly asserted facts, e.g.: NormalFaultisA Fault Fault isAPlanarStructure Entailment: NormalFaultisAPlanarStructure PlanarStructure Fault NormalFault
Realist View to the World • Each group of scientists studies a part ofthe world (domain) by abstracting and simplifying it based on community’s interest • These so called domain or knowledge experts (e.g., paleontologist, petroleum geologist): • look at the reality from specific perspectives, and • understand the relationships among the domain object and process entities differently
Different perspectives • An oil or gas ‘reservoir’ for a petroleum geologist is a ‘formation’ for a stratigrapher, a ‘rock type’ for a sedimentologist, and may be an ‘anticline’ to a structural geologist • It is clear that these related domains have a lot in common, and integrated information, collected about the same objects (e.g., a reservoir), viewed from different perspectives, and applying variable knowledge (sedimentology, structural geology), can improve existing geological knowledge, and lead to knowledge discovery and better decision making. This requires integration!
Scientists work autonomously • Individuals or a group of scientists in a same domain (e.g., isotope geology, planetary geology) often work independent of each other,applying autonomous data acquisition and processing methodologies, despite sharing the same general knowledge about real domain objects • These scientists may store their data in either worksheets (e.g., MS Excel) or relational databases with ad-hoc design or schema • The names in their database tables are as variable as the number of their databases and worksheets • Each geologist wants to say something about a geological feature or process that he/she studies This is a mess; isn’t it?
These geologists do this by publishing their peer reviewed work in scientific journals • To see their work, one has to study the article in paper or digital format • Their database may be a node on the Web, and available for human consumption, but commonly cannot be processed with different computers distributed over the Web • Is this frustrating?
The AAA slogan and the OWA • The good thing is that any geoscientist can study and present his/her findings about any geological problem as long as the statements go through the scientific peer review process • This happens to nicely fit the Semantic Web’s AAA slogan: Anyone can say Anything about Any topic • The good news is that scientific endeavors are based on the open world assumption (OWA): • Scientists may find new information at any time, and what they presently know is but apart of a never-ending universe of knowledgewhich they will accumulate So, be cool!
Semantic Web based on AAA, OWA, and NUA • Just like the AAA slogan, the open world assumption is exactly what the current Web and the Semantic Web are based on • Another parallel between the Semantic Web and scientific research is the No unique Naming Assumption: • Different scientists may refer to the same object or process by different names (synonymy), and draw different meaning from the same process or object (polysemy) • Although this seems to be a problem, it reflects the reality, and is hard to change Let’s just live with it! • The good news is that the Semantic Web is also based on the ‘No Unique Assumption (NUA)’, and can elegantly handle the disparate naming and meaning in scientific research No problem?
Scientific data are either stored in diverse databases, randomly scattered in unstructured publication tables, or in ad hoc Excel spreadsheets • The problem with these data stores is the lack of a capability to efficiently link their content (i.e., integration) • Moreover, the quality of the data, when they were collected, entered, or processed, is not controlled (lack data integrity), sufficiently enough, to turn the often voluminous data into information that can lead to knowledge discovery, and useful decision making • Software cannot interoperate and process these disparate data • The question in Earth Sciences is how we can automatically (i.e., with autonomous, distributed computers) use a large body of data, collected about components and processes of the Earth system, turn it into information, and then discover and improve our current understanding about the Earth Is there hope?