DiGIR Di stributed G eneric I nformation R etrieval

DiGIRDistributed Generic Information Retrieval Stan Blum, Dave Vieglais, P.J. Schwartz DiGIR

Project Goals • To define a protocol for retrieving structured data from multiple, heterogeneous databases • To build a reference implementation of said protocol DiGIR

Design Goals • To use open protocols and standards, such as HTTP, XML, and UDDI to leverage existing and emerging technologies • To de-couple the protocol, software and semantics • To automate the establishment of a new data provider as much as possible DiGIR

High-level Architecture • Protocol • Provider • Portal • Registry DiGIR

Protocol • Defines request and response message formats for communication between Provider and Portal • Assumes Providers conform to a known federation schema • Remains flexible to allow for federation schema pluggability DiGIR

Provider • Makes structured data available to portals • Communicates via protocol compliant messaging only • Complies with a known federation schema • Supplies meta-data to describe data classification and availability DiGIR

Portal • The entry point for a “user” • Can make requests of N number of providers • Communicates via protocol compliant messaging only • Queries registry for available providers • Can determine, based on provider meta-data, whether a provider should be queried DiGIR

Project Information • The DiGIR project is a collaborative effort • DiGIR is currently established as an open source project on SourceForge (http://sourceforge.net). • Further documentation is available on the SourceForge site. • Please join us in collaborating! DiGIR

Protocol Details DiGIR

Protocol Details • Specified in an XML Schema (.xsd) • Intended to work in conjunction with federation schemas, also expressed as XML Schemas • Actual request and response documents are instance documents conforming to both the protocol schema and a federation schema DiGIR

<request xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd"> <header> <requestType>search</requestType> </header> <search> <dbName>myDiggableBipesDB</dbName> <filter> <and> <in> <list xsi:type=“darwin:list”> <darwin:Month>11</darwin:Month> <darwin:Month>12</darwin:Month> </list> </in> <equals> <darwin:Genus>Bipes</darwin:Genus> </equals> </and> </filter> <records start=“0” count=“50”> </search> </request> DiGIR

Request Explanation • Composed of elements from the protocol namespace (default) and the schema namespace • <header> contains information about the payload • <search> contains dbName, filter, and record specification (will also specify result format) • <filter> is effectively an XML representation of a SQL where clause • This search request is for the first 50 specimen records that are genus Bipes and were found in the months of November or December. DiGIR

LOPs (logical operators) <and> <or> <andNot> <orNot> Can be nested COPs (comparison ops) <equals> <lessThan> <lessThanOrEquals> <notEquals> <greaterThan> <greaterThanOrEquals> <like> <in> (multi value) Filter Building DiGIR

What “binds” the schemas? • The protocol schema defines various abstract types and elements: <xsd:element name="searchCondition" abstract="true"> <xsd:element name="alphaSearchCondition" abstract="true“ substitutionGroup="searchCondition"> <xsd:complexType name="listType" abstract="true" /> <xsd:complexType name="numericListType" abstract="true" /> • A federation schema must define searchable concepts, or groups of them, as substitutable for these abstract elements or extensions of the abstract types <xsd:element name="Species" type="xsd:string“ substitutionGroup="digir:alphaSearchCondition" /> DiGIR

<xsd:complexType name="list <xsd:complexContent> <xsd:extension base="digir:listType"> <xsd:sequence> <xsd:choice> <xsd:element ref="ScientificName" maxOccurs="unbounded"/> <xsd:element ref="Kingdom" maxOccurs="unbounded" /> <xsd:element ref="Phylum" maxOccurs="unbounded" /> <xsd:element ref="Class" maxOccurs="unbounded" /> <xsd:element ref="Order" maxOccurs="unbounded" /> <xsd:element ref="Family" maxOccurs="unbounded" /> <xsd:element ref="Genus" maxOccurs="unbounded" /> <xsd:element ref="Species" maxOccurs="unbounded" /> <…> </xsd:choice> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> DiGIR

Why “bind” like this? • To provide data-typing (string, numeric, etc.) for various concepts within operators at an abstract level (e.g. LIKE only valid for string data; IN allows for multiples, but in a controlled fashion) • To allow for federation schemas to simply classify data as types without having to redefine/extend operators DiGIR

Request Issues • Do we need another abstract element such as dateSearchCondition? • What information will be useful in the header? • How should we specify the format of the results? What standard formats should be offered (I.e. brief, full?). • Will tblName be part of the meta-data required of providers? • What concepts of Darwin Core 2 are searchable? DiGIR

Response Prototype <response xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd"> <header>  </header> <content> <record> </record> </content> <diagnostics> </diagnostics> </response> DiGIR

Response Issues • How do we format and validate the response content? • What elements are needed for the <header>, if any? • Do we always have diagnostics, or only if there is an error? • Should a finite set of diagnostics be created and maintained in its own XML Schema? Will there ever be a diagnostic that is specific to a federation schema? DiGIR

Provider Details DiGIR

Provider Details • Implemented as a web application that answers questions • Interface is not specific to a particular information domain • No state information is recorded • Each request is treated as unique and uninfluenced by previous requests • Must always generate a valid response • Consists of four key components • Request handler • Filter handler • Result set cache • Response generator DiGIR

Request Handler • Receives XML document • Validates document • Generates internal structures for further processing DiGIR

Filter Handler • Internal structural representation of filter (query) structure • Responsible for generating a native query string for querying the database • Communicates with UDDI to obtain standard database definition • Custom configured to work with specific database implementation DiGIR

Result Set Cache • Contains the results of applying a query • Responsible for generating the response records in the requested format • Somewhat directly integrated with the response generator DiGIR

Response Generator • Generates the response XML document • Serializes the response header information • Serializes diagnostic information • Serializes the requested subset of records DiGIR

Provider Configuration DiGIR

Portal Details DiGIR

Portal Details • Divided into two distinct components: a presentation layer and PortalServices • The presentation layer supports the UI and translates requests (HTTP requests from forms or links) into protocol compliant XML requests • The presentation layer also handles all display issues involving the responses, such as format, sorting, collating, etc… • The presentation layer is envisioned to be an application server/web server implementation DiGIR

Portal Details • PortalServices handles all external network activity (UDDI calls, provider calls, etc) • PortalServices limits provider calls to those necessary based on provider meta-data • PortalServices threads provider calls for increased performance (I.e. response time) • PortalServices is envisioned to be a webapp and supporting classes running within an application server, such as TomCat DiGIR

PortalServices • RegistryAccess • ProviderCache • PortalConfig • PortalServlet • PortalRequestHandler • ProviderFilterer • Marshallers DiGIR

Portal Issues • What information will be stored in UDDI about a provider? • What information will be known for communicating with a Provider (I.e. IP address, port, etc…?) • What meta-data will be provided and what are the rules for using such data for provider filtering? • What requirements are there for logging and monitoring? DiGIR

DiGIR Di stributed G eneric I nformation R etrieval

DiGIR Di stributed G eneric I nformation R etrieval

Presentation Transcript

G EOG 2250 – I NTRODUCTION TO G EOGRAPHIC I NFORMATION S YSTEMS

I nformation and

EGERIS E uropean G eneric E mergency R esponse I nformation S ystem

I nformation

S CHEDULE AND G ENERAL I NFORMATION

DiGIR Di stributed G eneric I nformation R etrieval

Useful I nformation

QUIRK: QU estion Answering = I nformation R etrieval + K nowledge

I nformation R esource Centre “COMMON HOME”

DIS tributed CO ntent-based V isual I nformation R etrieval

G eomatic R egional I nformation S ociety I nitiative

I G O R

I E R I e O G G I

A udio I nformation R etrieval using S emantic S imilarity

I nformation

DiGIR

2014 G eneric T hemed P resentation

I nformation