310 likes | 324 Views
DiGIR is a protocol for retrieving structured data from heterogeneous databases. It aims to leverage existing technologies, automate new data provider establishment, and use open protocols and standards.
E N D
DiGIRDistributed Generic Information Retrieval Stan Blum, Dave Vieglais, P.J. Schwartz DiGIR
Project Goals • To define a protocol for retrieving structured data from multiple, heterogeneous databases • To build a reference implementation of said protocol DiGIR
Design Goals • To use open protocols and standards, such as HTTP, XML, and UDDI to leverage existing and emerging technologies • To de-couple the protocol, software and semantics • To automate the establishment of a new data provider as much as possible DiGIR
High-level Architecture • Protocol • Provider • Portal • Registry DiGIR
Protocol • Defines request and response message formats for communication between Provider and Portal • Assumes Providers conform to a known federation schema • Remains flexible to allow for federation schema pluggability DiGIR
Provider • Makes structured data available to portals • Communicates via protocol compliant messaging only • Complies with a known federation schema • Supplies meta-data to describe data classification and availability DiGIR
Portal • The entry point for a “user” • Can make requests of N number of providers • Communicates via protocol compliant messaging only • Queries registry for available providers • Can determine, based on provider meta-data, whether a provider should be queried DiGIR
Project Information • The DiGIR project is a collaborative effort • DiGIR is currently established as an open source project on SourceForge (http://sourceforge.net). • Further documentation is available on the SourceForge site. • Please join us in collaborating! DiGIR
Protocol Details DiGIR
Protocol Details • Specified in an XML Schema (.xsd) • Intended to work in conjunction with federation schemas, also expressed as XML Schemas • Actual request and response documents are instance documents conforming to both the protocol schema and a federation schema DiGIR
<request xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd"> <header> <requestType>search</requestType> </header> <search> <dbName>myDiggableBipesDB</dbName> <filter> <and> <in> <list xsi:type=“darwin:list”> <darwin:Month>11</darwin:Month> <darwin:Month>12</darwin:Month> </list> </in> <equals> <darwin:Genus>Bipes</darwin:Genus> </equals> </and> </filter> <records start=“0” count=“50”> </search> </request> DiGIR
Request Explanation • Composed of elements from the protocol namespace (default) and the schema namespace • <header> contains information about the payload • <search> contains dbName, filter, and record specification (will also specify result format) • <filter> is effectively an XML representation of a SQL where clause • This search request is for the first 50 specimen records that are genus Bipes and were found in the months of November or December. DiGIR
LOPs (logical operators) <and> <or> <andNot> <orNot> Can be nested COPs (comparison ops) <equals> <lessThan> <lessThanOrEquals> <notEquals> <greaterThan> <greaterThanOrEquals> <like> <in> (multi value) Filter Building DiGIR
What “binds” the schemas? • The protocol schema defines various abstract types and elements: <xsd:element name="searchCondition" abstract="true"> <xsd:element name="alphaSearchCondition" abstract="true“ substitutionGroup="searchCondition"> <xsd:complexType name="listType" abstract="true" /> <xsd:complexType name="numericListType" abstract="true" /> • A federation schema must define searchable concepts, or groups of them, as substitutable for these abstract elements or extensions of the abstract types <xsd:element name="Species" type="xsd:string“ substitutionGroup="digir:alphaSearchCondition" /> DiGIR
<xsd:complexType name="list <xsd:complexContent> <xsd:extension base="digir:listType"> <xsd:sequence> <xsd:choice> <xsd:element ref="ScientificName" maxOccurs="unbounded"/> <xsd:element ref="Kingdom" maxOccurs="unbounded" /> <xsd:element ref="Phylum" maxOccurs="unbounded" /> <xsd:element ref="Class" maxOccurs="unbounded" /> <xsd:element ref="Order" maxOccurs="unbounded" /> <xsd:element ref="Family" maxOccurs="unbounded" /> <xsd:element ref="Genus" maxOccurs="unbounded" /> <xsd:element ref="Species" maxOccurs="unbounded" /> <…> </xsd:choice> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> DiGIR
Why “bind” like this? • To provide data-typing (string, numeric, etc.) for various concepts within operators at an abstract level (e.g. LIKE only valid for string data; IN allows for multiples, but in a controlled fashion) • To allow for federation schemas to simply classify data as types without having to redefine/extend operators DiGIR
Request Issues • Do we need another abstract element such as dateSearchCondition? • What information will be useful in the header? • How should we specify the format of the results? What standard formats should be offered (I.e. brief, full?). • Will tblName be part of the meta-data required of providers? • What concepts of Darwin Core 2 are searchable? DiGIR
Response Prototype <response xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd"> <header> <!-- contents TBD --> </header> <content> <record> </record> </content> <diagnostics> </diagnostics> </response> DiGIR
Response Issues • How do we format and validate the response content? • What elements are needed for the <header>, if any? • Do we always have diagnostics, or only if there is an error? • Should a finite set of diagnostics be created and maintained in its own XML Schema? Will there ever be a diagnostic that is specific to a federation schema? DiGIR
Provider Details DiGIR
Provider Details • Implemented as a web application that answers questions • Interface is not specific to a particular information domain • No state information is recorded • Each request is treated as unique and uninfluenced by previous requests • Must always generate a valid response • Consists of four key components • Request handler • Filter handler • Result set cache • Response generator DiGIR
Request Handler • Receives XML document • Validates document • Generates internal structures for further processing DiGIR
Filter Handler • Internal structural representation of filter (query) structure • Responsible for generating a native query string for querying the database • Communicates with UDDI to obtain standard database definition • Custom configured to work with specific database implementation DiGIR
Result Set Cache • Contains the results of applying a query • Responsible for generating the response records in the requested format • Somewhat directly integrated with the response generator DiGIR
Response Generator • Generates the response XML document • Serializes the response header information • Serializes diagnostic information • Serializes the requested subset of records DiGIR
Provider Configuration DiGIR
Portal Details DiGIR
Portal Details • Divided into two distinct components: a presentation layer and PortalServices • The presentation layer supports the UI and translates requests (HTTP requests from forms or links) into protocol compliant XML requests • The presentation layer also handles all display issues involving the responses, such as format, sorting, collating, etc… • The presentation layer is envisioned to be an application server/web server implementation DiGIR
Portal Details • PortalServices handles all external network activity (UDDI calls, provider calls, etc) • PortalServices limits provider calls to those necessary based on provider meta-data • PortalServices threads provider calls for increased performance (I.e. response time) • PortalServices is envisioned to be a webapp and supporting classes running within an application server, such as TomCat DiGIR
PortalServices • RegistryAccess • ProviderCache • PortalConfig • PortalServlet • PortalRequestHandler • ProviderFilterer • Marshallers DiGIR
Portal Issues • What information will be stored in UDDI about a provider? • What information will be known for communicating with a Provider (I.e. IP address, port, etc…?) • What meta-data will be provided and what are the rules for using such data for provider filtering? • What requirements are there for logging and monitoring? DiGIR