650 likes | 819 Views
Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Citation. Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 12, November 26, 2013. Contents. Review of reading assignment Webs of data and semantic web Data on the web, linked data Deep web Data discovery
E N D
Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Citation Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 12, November 26, 2013
Contents • Review of reading assignment • Webs of data and semantic web • Data on the web, linked data • Deep web • Data discovery • Data citation • Summary • Next week
Reading • Mealy • Wickett et al. • Data Quality European Union Presentation • ISO Technical Standards - General Reference
Webs of data (science) • Early Web - Web of pages • http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html • Semantic web started as a way to facilitate “machine accessible content” • Initially was available only to those with familiarity with the languages and tools, e.g. your parents could not use it • Webs of data grew out of this • One specific example is W3C’s Linked Open Data
Semantic Web • http://www.w3.org/2001/sw/ • “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF)...”
Terminology • Semantic Web • An extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation, www.semanticweb.org • Primer: http://www.ics.forth.gr/isl/swprimer/ • Ontology (n.d.). The Free On-line Dictionary of Computing. http://dictionary.reference.com/browse/ontology • An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.
Semantic Web Layers http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/
Application Areas for SW • Smart search • Annotation (even simple forms), smart tagging • Geospatial • Implementing logic (rules), e.g. in workflows • Data integration • Verification …. and the list goes on • Web services • Web content mining with natural language parsing • User interface development (portals) • Semantic desktop • Wikis - OntoWiki, SemanticMediaWiki • Sensor Web • Software engineering • Explanation
Semantic Web Basics • The triple: {subject-predicate-object} Interferometeris-aoptical instrument Optical instrumenthasfocal length • W3C is the primary (but not sole) governing org. • RDF • OWL 1.0 and 2.0 - Ontology Web Language • RDF • programming environment for 14+ languages, including C, C++, Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ;-( ) • OWL programming for Java • Closed World - where complete knowledge is known (encoded), AI relied on this • Open World - where knowledge is incomplete/ evolving, SW promotes this
Ontology Spectrum Thesauri “narrower term” relation Selected Logical Constraints (disjointness, inverse, …) Frames (properties) Formal is-a Catalog/ ID Informal is-a Formal instance General Logical constraints Terms/ glossary Value Restrs. Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty; – updated by McGuinness. Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Semantic Web Myths • ‘the Semantic Web is a reincarnation of Artificial Intelligence on the Web’ (closed world versus open world) • ‘it relies on giant, centrally controlled ontologies for "meaning" (as opposed to a democratic, bottom-up control of terms)’ • ‘one has to add metadata to all Web pages, convert all relational databases, and XML data to use the Semantic Web’ • ‘one has to learn formal logic, knowledge representation techniques, description logic, etc, to use it’ • ‘it is, essentially, an academic project, of no interest for industry’
Integrating Multiple Data Sources • The Semantic Web lets us merge statements from different sources • The RDF Graph Model allows programs to use data uniformly regardless of the source • Figuring out where to find such data is a motivator for Semantic Web Services #Ionosphere hasCoordinates #magnetic name hasLowerBoundaryValue “100” “Terrestrial Ionosphere” hasLowerBoundaryUnit “km” Different line & text colors represent different data sources
Drill Down /Focused Perusal • The Semantic Web uses Uniform Resource Identifiers (URIs) to name things • These can typically be resolved to get more information about the resource • This essentially creates a web of data analogous to the web of text created by the World Wide Web • Ontologies are represented using the same structure as content • We can resolve class and property URIs to learn about the ontology …#NeutralTemperature …#Norway Internet locatedIn measuredby ...#ISR ...#FPI type operatedby …#EISCAT ...#MilllstoneHill
Statements about Statements • The Semantic Web allows us to make statements about statements • Timestamps • Provenance / Lineage • Authoritativeness / Probability / Uncertainty • Security classification • … • This is an unsung virtue of the Semantic Web #Danny’s #Aurora hasSource hasDateTime hascolor 20031031 Red Ontologies Workshop, APL May 26, 2006
‘Collecting’ the ‘data’ • Part of the (meta)data information is present in tools ... but thrown away at output e.g., a business chart can be generated by a tool: it ‘knows’the structure, the classification, etc. of the chart, but, usually, this information is lost storing it in web data would be easy! • SW-awaretools are around (even if you do not know it...), though more would be good: • Photoshop CS stores metadata in RDF in, say, jpg files (using XMP) • RSS 1.0 feeds are generated by (almost) all blogging systems (a huge amount of RDF data!)
‘Collecting’ the ‘data’ • Scraping - different tools, services, etc, come around every day: • get RDF data associated with images, for example: service to get RDF from flickr images • service to get RDF from XMP • XSLT scripts to retrieve microformat data from XHTML files • scripts to convert spreadsheets to RDF – e.g. see csv2rdf4lod and the tools, tutorials, demos at http://logd.tw.rpi.edu • schema.org and the datasets extension
‘Collecting’ the ‘data’ • SQL - A huge amount of data in Relational Databases • Although tools exist, it is not feasible to convert that data into RDF • Instead: SQL ⇋ RDF ‘bridges’are being developed: a query to RDF data is transformed into SQL on-the-fly • Reading for this week, article by Berners Lee and Sahoo et al. • RDB2RDF W3 working group - http://www.w3.org/2001/sw/rdb2rdf/ • D2RQ/ D2RServer • Commercial solutions appearing • NoSQL • Other ‘graph’ forms…
More Collecting • RDFa extends XHTML by: • extending the link and meta to include child elements • add metadata to any elements (a bit like the class in microformats, but via dedicated properties) • Used in schema.org/ datasets • It is very similar to microformats, but with more rigor: • it is a general framework (instead of an ‘agreement’on the meaning of, say, a class attribute value) • terminologies can be mixed more easily • GRDDL - Gleaning Resource Descriptions from Dialects of Languages
Linked open data • http://linkeddata.org/guides-and-tutorials • http://tomheath.com/slides/2009-02-austin-linkeddata-tutorial.pdf (we will look at some of these slides now, #1-25 and 30-37) • And of course: • http://logd.tw.rpi.edu/
September 2011 - http://lod-cloud.net/ “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
(Class 2) Management • Creation of logical collections • Physical data handling • Interoperability support • Security support • Data ownership • Metadata collection, management and access. • Persistence • Knowledge and information discovery • Data dissemination and publication
Data Management and WOD • How is the data managed? • Found? • Curated? • What about the metadata? • What problems are introduced/ solved? • See discussion in: Parsons and Fox (2012): http://mp-datamatters.blogspot.com/
Data on the Web, Internet • Data behind web services • Data files on web sites • We have covered data as service approaches (week 11) • Thinking you have found data when you have really only found information and metadata • The real difference between this topic and the next one is: • Access and dissemination • Level of curation (and often description)
Data on the internet • http://www.dataspaceweb.org/ • Data files on other protocols • FTP • RFTP • GridFTP • SABUL • XMPP/AMQP • Others…
Deep web • Data behind web services • Data behind query interfaces (databases or files) • Introduces a different curation problem
The loose definition • Something that a crawler cannot find and/or index • Creates the other definition of shallow web • Has many implications for discovery, access and use • Curation is more complex to satisfy this definition, i.e. not a matter of just putting files ‘on the web’ • 50, 100, 1000 times the ‘shallow web’?
Managing (in) the deep web • Sometimes, the deep web aspect of a data source can be due to extreme obscurity, language peculiarities, NO metadata, NO documentation • There are no known studies of how effective data management (what you are learning) could change the percentage of deep/ shallow • Semantics are often put forward as a solution http://www.mkbergman.com/458/new-currents-in-the-deep-web/
Internet impacts on management • Management of data that is… on the Internet! • Web –> ‘stateless’ • Curation, Preservation –> highly stateful (by definition) • You will hear terms such as digital curation and digital preservation but what about internet curation and internet preservation (Internet Archive)?
Thus data frameworks are appearing • Many – meaning they go beyond web sites, they incorporate many of the data management functions • Initially syntactic – e.g. OPeNDAP, ADDE, ODATA, OODT • Application oriented – e.g. virtual observatories • Semantic – e.g. Virtual Solar-Terrestrial Observatory • ALL of these are changing the nature of data management and role of data ‘providers’
Some Definitions DAP = Data Access Protocol • Model used to describe the data; • Request syntax and semantics; and • Response syntax and semantics. OPeNDAP • The software; • Numerous reference implementations; • Core/libraries and services (servers and clients). OPeNDAP Inc. • OPeNDAP is a 501.c(3) non-profit corporation; • Formed to maintain, evolve and promote the discipline neutral DAP that was the DODS core infrastructure. BOM, Melbourne, VIC 20071015 (Fox)
Considerations with regard to the development of DAP and OPeNDAP • Many data formats • Many different client types Many data providers • Many different semantic representations of the data • Many different security requirements BOM, Melbourne, VIC 20071015 (Fox)
Broad Vision A world in which a single data access protocol is used for the exchange of data between network based applications regardless of discipline. A layer above TCP/IP providing for syntactic and semantic consistency not available in existing protocols such as FTP. BOM, Melbourne, VIC 20071015 (Fox)
Practical Considerations The broad vision: • Is syntactically achievable, but • Was not semantically achievable, at least not fully, but perhaps in the near term. BOM, Melbourne, VIC 20071015 (Fox)
The Data Access Protocol (DAP) • The DAP has been designed to be as general as possible without being constrained to a particular discipline or world view. • The DAP is a discipline neutral data access protocol; it is being used in astronomy, medicine, earth science,… • Provides data format and location, and data organization transparency • Is metadata neutral BOM, Melbourne, VIC 20071015 (Fox)
OPeNDAP V4 (Hyrax) Architecture Client OLFS BES Data • OPeNDAP Lightweight Front end Server (OLFS) • Receives requests and asks the BES to fill them • Uses Java Servlets • Does not directly ‘touch’ data • Multi-protocol • Back End Server (BES) • Reads data files, Databases, et c., returns info • May return DAP2 objects or other data • Does not require web server BOM, Melbourne, VIC 20071015 (Fox)
IDL Client pyDAP pyDAP netCDF Java netCDF C ArcGIS ArcGIS Ferret GrADS IDV VisAD ncBrowse IDL Access Excel OPeNDAP Clients Internet NCL Client Matlab Client Web Browser OPeNDAP Data Connector NCL Matlab BOM, Melbourne, VIC 20071015 (Fox)
netCDF DSP Tables Data Data Data Data General netCDF JGOFS DSP OPeNDAP Servers CDM Flat Binary ESML HDF4 HDF5 SQL FITS CDF CEDAR Data Data Data Data Data Data Data HDF5 FITS FreeForm HDF4 JDBC CDF CEDAR Internet BOM, Melbourne, VIC 20071015 (Fox)
ESG FDS GDS DAPPER CODAR pyDAP pyDAP Data Data Data Data Data Data Data netCDF OPeNDAP netCDF OPeNDAP GRIB BUFR OPeNDAP netCDF OPeNDAP CODAR General General TDS TDS Data Data netCDF OPeNDAP netCDF OPeNDAP OPeNDAP Servers (specialized processing) Internet BOM, Melbourne, VIC 20071015 (Fox)
Servers • Servers may also provide other services • Directory traversal. • Browser-based form to build URL. • Ascii or other representations of data. • Metadata associated with the data. • Server side functions. BOM, Melbourne, VIC 20071015 (Fox)
Summary Tetherless World Constellation
Data discovery • Free text search on the internet/ web • Data portals • What makes discovery work? • For Deep Web? • For Linked Data?
Data discovery • What makes discovery work? • Metadata • Logical organization • Attention to the fact that someone would want to discover it • It turns out that file types are a key enabler or inhibitor to discovery • What does not work? • Result ranking using *any* conventional algorithms
Smart search • Semantically aware search, e.g. http://noesis.itsc.uah.edu • Faceted search, e.g. • mspace (http://mspace.fm ) • jSpace • Exhibit (MIT) • S2S – e.g. International Open Government Dataset Catalog (IOGDC; http://logd.tw.rpi.edu )
http://logd.tw.rpi.edu Intl. Open Govt. Data Cat.
Federated search • “is the simultaneous search of multiple online databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine.” wikipedia • Libraries have been doing this for a long time (Z39.50, ISO23950) • Key is consistent search metadata fields (keywords) • E.g. Geospatial One Stop http://www.geodata.gov
Data Citation • “Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice.” (http://www.force11.org/datacitation)