1 / 31

DiGIR Di stributed G eneric I nformation R etrieval

DiGIR is a protocol for retrieving structured data from heterogeneous databases. It aims to leverage existing technologies, automate new data provider establishment, and use open protocols and standards.

mfernandes
Download Presentation

DiGIR Di stributed G eneric I nformation R etrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DiGIRDistributed Generic Information Retrieval Stan Blum, Dave Vieglais, P.J. Schwartz DiGIR

  2. Project Goals • To define a protocol for retrieving structured data from multiple, heterogeneous databases • To build a reference implementation of said protocol DiGIR

  3. Design Goals • To use open protocols and standards, such as HTTP, XML, and UDDI to leverage existing and emerging technologies • To de-couple the protocol, software and semantics • To automate the establishment of a new data provider as much as possible DiGIR

  4. High-level Architecture • Protocol • Provider • Portal • Registry DiGIR

  5. Protocol • Defines request and response message formats for communication between Provider and Portal • Assumes Providers conform to a known federation schema • Remains flexible to allow for federation schema pluggability DiGIR

  6. Provider • Makes structured data available to portals • Communicates via protocol compliant messaging only • Complies with a known federation schema • Supplies meta-data to describe data classification and availability DiGIR

  7. Portal • The entry point for a “user” • Can make requests of N number of providers • Communicates via protocol compliant messaging only • Queries registry for available providers • Can determine, based on provider meta-data, whether a provider should be queried DiGIR

  8. Project Information • The DiGIR project is a collaborative effort • DiGIR is currently established as an open source project on SourceForge (http://sourceforge.net). • Further documentation is available on the SourceForge site. • Please join us in collaborating! DiGIR

  9. Protocol Details DiGIR

  10. Protocol Details • Specified in an XML Schema (.xsd) • Intended to work in conjunction with federation schemas, also expressed as XML Schemas • Actual request and response documents are instance documents conforming to both the protocol schema and a federation schema DiGIR

  11. <request xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd"> <header> <requestType>search</requestType> </header> <search> <dbName>myDiggableBipesDB</dbName> <filter> <and> <in> <list xsi:type=“darwin:list”> <darwin:Month>11</darwin:Month> <darwin:Month>12</darwin:Month> </list> </in> <equals> <darwin:Genus>Bipes</darwin:Genus> </equals> </and> </filter> <records start=“0” count=“50”> </search> </request> DiGIR

  12. Request Explanation • Composed of elements from the protocol namespace (default) and the schema namespace • <header> contains information about the payload • <search> contains dbName, filter, and record specification (will also specify result format) • <filter> is effectively an XML representation of a SQL where clause • This search request is for the first 50 specimen records that are genus Bipes and were found in the months of November or December. DiGIR

  13. LOPs (logical operators) <and> <or> <andNot> <orNot> Can be nested COPs (comparison ops) <equals> <lessThan> <lessThanOrEquals> <notEquals> <greaterThan> <greaterThanOrEquals> <like> <in> (multi value) Filter Building DiGIR

  14. What “binds” the schemas? • The protocol schema defines various abstract types and elements: <xsd:element name="searchCondition" abstract="true"> <xsd:element name="alphaSearchCondition" abstract="true“ substitutionGroup="searchCondition"> <xsd:complexType name="listType" abstract="true" /> <xsd:complexType name="numericListType" abstract="true" /> • A federation schema must define searchable concepts, or groups of them, as substitutable for these abstract elements or extensions of the abstract types <xsd:element name="Species" type="xsd:string“ substitutionGroup="digir:alphaSearchCondition" /> DiGIR

  15. <xsd:complexType name="list <xsd:complexContent> <xsd:extension base="digir:listType"> <xsd:sequence> <xsd:choice> <xsd:element ref="ScientificName" maxOccurs="unbounded"/> <xsd:element ref="Kingdom" maxOccurs="unbounded" /> <xsd:element ref="Phylum" maxOccurs="unbounded" /> <xsd:element ref="Class" maxOccurs="unbounded" /> <xsd:element ref="Order" maxOccurs="unbounded" /> <xsd:element ref="Family" maxOccurs="unbounded" /> <xsd:element ref="Genus" maxOccurs="unbounded" /> <xsd:element ref="Species" maxOccurs="unbounded" /> <…> </xsd:choice> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> DiGIR

  16. Why “bind” like this? • To provide data-typing (string, numeric, etc.) for various concepts within operators at an abstract level (e.g. LIKE only valid for string data; IN allows for multiples, but in a controlled fashion) • To allow for federation schemas to simply classify data as types without having to redefine/extend operators DiGIR

  17. Request Issues • Do we need another abstract element such as dateSearchCondition? • What information will be useful in the header? • How should we specify the format of the results? What standard formats should be offered (I.e. brief, full?). • Will tblName be part of the meta-data required of providers? • What concepts of Darwin Core 2 are searchable? DiGIR

  18. Response Prototype <response xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd"> <header> <!-- contents TBD --> </header> <content> <record> </record> </content> <diagnostics> </diagnostics> </response> DiGIR

  19. Response Issues • How do we format and validate the response content? • What elements are needed for the <header>, if any? • Do we always have diagnostics, or only if there is an error? • Should a finite set of diagnostics be created and maintained in its own XML Schema? Will there ever be a diagnostic that is specific to a federation schema? DiGIR

  20. Provider Details DiGIR

  21. Provider Details • Implemented as a web application that answers questions • Interface is not specific to a particular information domain • No state information is recorded • Each request is treated as unique and uninfluenced by previous requests • Must always generate a valid response • Consists of four key components • Request handler • Filter handler • Result set cache • Response generator DiGIR

  22. Request Handler • Receives XML document • Validates document • Generates internal structures for further processing DiGIR

  23. Filter Handler • Internal structural representation of filter (query) structure • Responsible for generating a native query string for querying the database • Communicates with UDDI to obtain standard database definition • Custom configured to work with specific database implementation DiGIR

  24. Result Set Cache • Contains the results of applying a query • Responsible for generating the response records in the requested format • Somewhat directly integrated with the response generator DiGIR

  25. Response Generator • Generates the response XML document • Serializes the response header information • Serializes diagnostic information • Serializes the requested subset of records DiGIR

  26. Provider Configuration DiGIR

  27. Portal Details DiGIR

  28. Portal Details • Divided into two distinct components: a presentation layer and PortalServices • The presentation layer supports the UI and translates requests (HTTP requests from forms or links) into protocol compliant XML requests • The presentation layer also handles all display issues involving the responses, such as format, sorting, collating, etc… • The presentation layer is envisioned to be an application server/web server implementation DiGIR

  29. Portal Details • PortalServices handles all external network activity (UDDI calls, provider calls, etc) • PortalServices limits provider calls to those necessary based on provider meta-data • PortalServices threads provider calls for increased performance (I.e. response time) • PortalServices is envisioned to be a webapp and supporting classes running within an application server, such as TomCat DiGIR

  30. PortalServices • RegistryAccess • ProviderCache • PortalConfig • PortalServlet • PortalRequestHandler • ProviderFilterer • Marshallers DiGIR

  31. Portal Issues • What information will be stored in UDDI about a provider? • What information will be known for communicating with a Provider (I.e. IP address, port, etc…?) • What meta-data will be provided and what are the rules for using such data for provider filtering? • What requirements are there for logging and monitoring? DiGIR

More Related