1 / 48

Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 9, October 30, 2012

Academic Basis for Data and Information Science, Data Models, Schema, Data Tools and Data as Service Paradigms. Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 9, October 30, 2012. Contents. Informatics Data models Schema Tools Markup languages Data as service

phil
Download Presentation

Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 9, October 30, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Academic Basis for Data and Information Science, Data Models, Schema, Data Tools and Data as Service Paradigms Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 9, October 30, 2012

  2. Contents • Informatics • Data models • Schema • Tools • Markup languages • Data as service • How are the projects going?

  3. Definitions (revisited) • Data - are pieces of <x> that represent the qualitative or quantitative attributes of a variable or set of variables. • Data (plural of "datum", which is seldom used) - are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. • Data - are often viewed as the lowest level of abstraction from which information and knowledge are derived

  4. Definitions ctd. • Information • Representations (of facts? data?) in a form that lends itself to human use • Knowledge • …. meaning

  5. Data-Information-Knowledge Ecosystem Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context

  6. Mind the gap • As we aim to use modern technology to advance data science: • There is often a gap between science and the underlying infrastructure and technology that is available • Informatics - information science includes the science of (data and) information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, behavior, and interactions of natural and artificial systems that store, process and communicate (data and) information. It also develops its own conceptual and theoretical foundations. Since computers, individuals and organizations all process information, informatics has computational, cognitive and social aspects, including study of the social impact of information technologies. Wikipedia. • Cyberinfrastructure is the new research environment(s) that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet.

  7. A moment of history • In the late 1950’s (actually around 1957-1958) the modern informatics term was coined • Existed for a while but then split into library science and computer science and developed their own fields, became disconnected • Now coming back to be relevant to science • Informatics IS NOT just having a scientist work with an “IT/ICT” person (NOT, NOT, NOT)

  8. Advertisement • Spring 2013 – Xinformatics • http://tw.rpi.edu/web/course/Xinformatics/2013

  9. Library science • Curates the artifacts of knowledge • Organizes and manages them for consumers • Cataloging and classification • Preservation • ‘maintaining or restoring access to artifacts, documents and records through the study, diagnosis, treatment and prevention of decay and damage’ (wikipedia) • Digital age • Curation and preservation

  10. Cognitive Science • Cognitive science is an interdisciplinary study of the mind and intelligence • It operates at the intersection of psychology, philosophy, computer science, linguistics, anthropology, and neuroscience. • Of relevance for data and information science are three significant theoretical underpinnings • mental representation, • the nature of expertise, • and intuition • Very relevant to model, data/metadata choice

  11. Social Science • Branch of humanities • Especially as it relates to networks of scientists • Exploits sociology of groups, teams • Cultural norms as well as discipline norms • Modes of what and how rewards are given • Between those who produce and those who consume data (and information) • More

  12. Information theory • Semiotics, also called semiotic studies or semiology, is the study of sign processes (semiosis), or signification and communication, signs and symbols, into three branches: • Syntactics: Relation of signs to each other in formal structures • Semantics: Relation between signs and the things to which they refer; their denotata • Pragmatics: Relation of signs to their impacts on those who use them

  13. Note: we have theories for… • Knowledge -> various forms of logic(s) • Information (Shannon, Weaver, Peirce…) • But not ‘Data’ (except for Mealy and some other person I keep forgetting ;-( …)

  14. Premise Context Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation 14

  15. 1. Assume context free • Content and Structure • D=f(x;p) • D=data, f=transduction function, x=thing, p=parametric dependence (e.g. time of transduction) • HAVE – Syntax • DO NOT HAVE - Semantics – no meaning without context • OR - Pragmatics – no use without meaning?? • What about - Uncertainty, quality, bias (error) – none without context?

  16. 2. Assume minimal context • Minimal = incomplete? • E.g. know instrument but not when, or of what • E.g. know what but not how • Partial uncertainty? Conditional entropy? • Constructive induction?

  17. (Information) Architecture • Definition: • “is the art of expressing a model or concept of information used in activities that require explicit details of complex systems” (wikipedia) • “… I mean architect as in the creating of systemic, structural, and orderly principles to make something work - the thoughtful making of either artifact, or idea, or policy that informs because it is clear.” Wuman

  18. Information Models • Conceptual models, sometimes called domain models, are typically used to explore domain concepts • High-level conceptual models are often created as part of initial requirements envisioning efforts as they are used to explore the high-level static business or science or medicine structures and concepts. • Conceptual models are often created as the precursor to logical models or as alternatives to them • Followed by logical and physical models • http://en.wikipedia.org/wiki/Data_modelling

  19. Data Models • Conceptual data models, sometimes called domain models, are typically used to explore domain concepts • High-level conceptual models are often created as part of initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. • Conceptual data models are often created as the precursor to logical data models or as alternatives to LDMs.

  20. Conceptual model

  21. Data Models • Logical data models (LDMs). • LDMs are used to explore the domain concepts, and their relationships, of your problem domain. • This could be done for the scope of a single project or for your entire enterprise. • LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities.

  22. Logical model

  23. Data Models • Physical data models (PDMs). • PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. • PDMs often prove to be useful on a range of applications

  24. Physical model

  25. Conceptual model – shoreline photos

  26. Logical model – shoreline photos

  27. However as a consumer • Do you ever really see these data models? • What’s the most common form of making data available to others? • What’s the most common means? Second most common?

  28. Example XML <?xml version="1.0" encoding="ISO-8859-1"?> <shiporder orderid="889923" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="shiporder.xsd"> <orderperson>John Smith</orderperson> <shipto> <name>Ola Nordmann</name> <address>Langgt 23</address> <city>4000 Stavanger</city> <country>Norway</country> </shipto> <item> <title>Empire </title> <note>Special Edition</note> <quantity>1</quantity> <price>10.90</price> </item> <item> <title>Hide your heart</title> <quantity>1</quantity> <price>9.90</price> </item> </shiporder>

  29. Very simple schema <?xml version="1.0" encoding="ISO-8859-1" ?> <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema> <xs:element name="shiporder"> <xs:complexType> <xs:sequence> <xs:element name="orderperson" type="xs:string"/> <xs:element name="shipto"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="address" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="item" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="note" type="xs:string" minOccurs="0"/> <xs:element name="quantity" type="xs:positiveInteger"/> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="orderid" type="xs:string" use="required"/> </xs:complexType> </xs:element> </xs:schema>

  30. Markup Languages • Reminder: • Mixes data and metadata, and yes, information • Tag structure does not always model the underlying data structure • Modeling the XML itself, i.e. the schema is another task • Does have the potential benefit that it is more for use than storage • Parsing the file: • Incomplete versus complete tags • Empty or optional fields

  31. Data tools (just a few) • Models • http://www.datamodel.org/ • MSDN: http://msdn.microsoft.com/en-us/library/bb399249.aspx • Schema • The Schematron differs in basic concept from other schema languages in that it not based on grammars but on finding tree patterns in the parsed document. This approach allows many kinds of structures to be represented which are inconvenient and difficult in grammar-based schema languages. If you know XPath or the XSLT expression language, you can start to use The Schematron immediately. • http://www.schematron.com/

  32. Markup Language tools • Any context-sensitive editor • XMLSpy, XML Notepad, XML Editor, oXygen

  33. Data as Service • Modern internet architectures allow for • Service oriented architectures • Resource oriented architectures • Why is this important for data models, schema, etc. • Hides/ obscures underlying model, schemas • Service interfaces are often a poor/ hybrid match for underlying models • UML and ISO 19xxx family of standards, e.g. 19135 are changing the landscape • Mature in certain settings.

  34. Open Geospatial Consortium • Web Feature Service (WFS) • http://www.opengeospatial.org/standards/wfs • support INSERT, UPDATE, DELETE, LOCK, QUERY and DISCOVERY operations on geographic features using HTTP as the distributed computing platform • Built on Geographic Markup Language (GML) • Tutorial • http://docs.codehaus.org/display/MAP/WFS+Tutorial

  35. WFS examples

  36. Open Geospatial Consortium • Web Mapping Service (WMS) • http://www.opengeospatial.org/standards/wms • produces maps of spatially referenced data dynamically from geographic information ("map" is a portrayal of geographic information as a digital image file suitable for display on a computer screen). A map is not the data itself. WMS-produced maps are generally rendered in a pictorial format such as PNG, GIF or JPEG, or occasionally as vector-based graphical elements in Scalable Vector Graphics formats. • http://www.intl-interfaces.com/cookbook/WMS/ • http://oceanesip.jpl.nasa.gov/esipde/guide.html

  37. Open Geospatial Consortium • Web Coverage Service (WCS) • http://www.opengeospatial.org/standards/wcs • supports electronic interchange of geospatial data as "coverages" – that is, digital geospatial information representing space-varying phenomena

  38. Open Geospatial Consortium • Sensor Observation Service (SOS) • http://www.opengeospatial.org/standards/sos • SWE Common • http://www.opengeospatial.org/projects/groups/swecommonswg • Get_capabilities

  39. IVOA (www.ivoa.net) • Simple Image Access Protocol • http://ivoa.net/Documents/SIA/20091008/PR-SIA-1.0-20091008.pdf • This specification defines a protocol for retrieving image data from a variety of astronomical image repositories through a uniform interface. The interface is meant to be reasonably simple to implement by service providers. A query defining a rectangular region on the sky is used to query for candidate images. • The service returns a list of candidate images formatted as a VOTable. For each candidate image an access reference URL may be used to retrieve the image. Images may be returned in a variety of formats including FITS and various graphics formats. Referenced images are often computed on the fly, e.g., as cutouts from larger images.

  40. IVOA (www.ivoa.net) • E.g. Simple Spectrum Access Protocol • http://ivoa.net/Documents/REC/DAL/SSA-20080201.pdf • The Simple Spectrum Access (SSA) Protocol (SSAP) defines a uniform interface to remotely discover and access one dimensional spectra. SSA is a member of an integrated family of data access interfaces altogether comprising the Data Access Layer (DAL) of the IVOA. • SSA is based on a more general data model capable of describing most tabular spectrophotometric data, including time series and spectral energy distributions (SEDs) as well as 1-D spectra; however the scope of the SSA interface as specified in this document is limited to simple 1-D spectra, including simple aggregations of 1-D spectra.

  41. Discussion • Theoretical concepts? • Data models? • Schema? • Tools? • Service paradigms? • Relation to data management? • Provenance considerations • Is encapsulation good?

  42. Summary • Informatics as a new field • Data models and schema and the tools that go with them are plentiful • Modern use of XML and specific markup languages obscure the underlying data structure (physical and logical) but have other advantages • Data as service carry this to another level

  43. What is next • Next week • Data Workflow Management, Preservation and Data Stewardship • Reading: • See web site

  44. How about those projects?

More Related