170 likes | 283 Views
GIR-WG @ OGF19. Grid Information Retrieval Working Group January 30, 2007 Chapel Hill, NC. Agenda. IP Policy reminder Introduce participants GIR-WG charter & overview GIR document status review Reference implementations Mention of related work elsewhere Paul Kim presentation
E N D
GIR-WG @ OGF19 Grid Information RetrievalWorking Group January 30, 2007 Chapel Hill, NC
Agenda • IP Policy reminder • Introduce participants • GIR-WG charter & overview • GIR document status review • Reference implementations • Mention of related work elsewhere • Paul Kim presentation • Chris Fallen presentation • Discussion 2
Session Particulars • OGF IP policies apply • GIR-WG chairs: • Dr. Greg Newby, Arctic Region Supercomputing Center • Dr. Paul Yangwoo Kim, Dongguk U. • Nassib Nassar, RENCI 3
What is GIR-WG? • GIR-WG was chartered by OGF to develop standards and reference implementations for information retrieval (IR) on computational grids. • GIR-WG has published a Requirements document under GGF (GFD-I.027) • Our first Experimental document was published recently (GFD-E.082) • Progress on the Architecture document is dormant, awaiting practical experience • Practical experience is being gained, and will result in at least further experimental documents. 4
What is Information Retrieval? • IR is the science and method of delivering documents that are relevant to human information needs. • Rather than delivering sets of matching documents (as DBMS do), IR systems rank matching documents. • IR systems usually focus on textual input data (aka, natural language) either unformatted or formatted (plain text, HTML, XML, etc.) 5
GIR-WG Charter • The GIR WG will establish a specific set of requirements, an architecture, and detailed specifications for Information Retrieval (IR) on computational grids. GIR will provide document collection management, indexing/searching, and query processing services to grid users and applications. • GIR Milestones: • GIR Requirements Document - Stakeholder-driven list of service-level requirements for building a grid-based IR system. Published in 2005 as GFD-I.27. • GIR Architecture Document - Describes overall system comprised of integrated grid services, scenarios, etc. Draft under consideration since 2004; based on Experimental document outcomes, final version is expected in 2007. • Experimental Documents - Experiences with GIR implementations or partial implementations (query processors, indexers, collection managers...). GFD-E.082 in 2006; others under consideration • GIR Recommendation Draft Document - Describes each service in detail, with sections for different implementation platforms (such as Web Services, Grid Services, standalone...). Draft is expected after Architecture document, in 2008. • GIR Recommendation Final Document - After the Draft Recommendation, based on independent interoperable implementations and further practical experiences. Within 2 years of the Draft Recommendation. 6
Why IR is a good candidatefor Grid computing • Excellent for “divide and conquer” coarse-grained parallelism • Input items are discrete • Coordination across subsets of a document collection can be minimal • Results from multiple sources can be coordinated and relevance ranked together • Queries may be handled independently 7
Significant Progress • Documents: • “GIR Requirements” published • “GIR Architecture” in mid-draft (dormant) • Experimental document: published • Implementation: • MCNC released a technology preview • Kim’s work: an experimental document • Newby’s work: heading to an experimental document • Nassar’s work: Sarcomere & Amberfish, open source toolkit based on GT4 • Fallen & Newby distributed IR research 8
Requirements overview (per GFD-I.027) • Desirability of Grid infrastructure for IR, notably enterprise IR: • VO (for security, segmentation) • Conceptual separation of functions (for indexing, collection management & query processing) • Flexible but coarse-grained flow of control among elements • Persistence of queries, collections and indexes • Three primary components : • Collection manager: handles input gathering, transformation, transport, staging and delivery • Indexer: core information retrieval collection representation • Query processor: respond to user needs, including standing information needs (i.e., information filtering) 9
Implementation Approaches • Do not rely on particular implementations or middleware (e.g., Globus) • Pursue different types of Grid implementations: • Minimalist, home grown • Globus-based • Pure Web services • These approaches can each be separate Experimental docs; will be appendices in the Architecture doc 10
GFD-E.082 • Kim: Grid Information Retrieval System for Dynamically Reconfigurable Virtual Organization • Practical experience on re-allocation of GIR nodes based on system load • Indexer, collection manager or query processor, based on system load • Dynamic reallocation of nodes within a computational grid 11
Nassar: Sarcomere See http://sourceforge.net/projects/sarcomere/ • Sarcomere calls a collection of documents a "database". One or more "indexes" can be created per database. Each index represents an access point for searching the document collection. In theory, indexes can differ in how they constrain the queries (e.g. by fields), what kind of data structures are used, etc. At the moment only Amberfish full text indexes are supported (index type = "Amberfish"). • Current port types (very rudimentary and highly subject to change): • createDatabase • deleteDatabase • createIndex • deleteIndex • addDocument • Search • Stay tuned for more developments! 12
Newby: Multisearch • How can we merge result sets from different IR engines? • Desire to merge based on global relevance • Challenging because different IR engines have different scoring/ranking algorithms • Challenging because different collections have different characteristics, influencing ranking • Used for TREC by Fallen & Newby 2005, 2006 13
Simple interface to an Axis/Tomcat backend • Results are merged based on statistical normalization • No accounting for different IR engines or different collections • Simplifying assumptions that all IR rankings come from the same basic distribution 14
Opportunities for Interaction • OGSA-DAI has middleware that provides basic query and result set transport • Search from multiple databases; add a higher-level merger • Seems promising for GIR! • http://www.ogsadai.org.uk 15
Discussion of GIR-WG • Your questions, thoughts and suggestions 16
Get Involved! • Visithttp://www.gir-wg.org • Subscribe to gir-wg@ogf.org • Talk with chairs about data and reference implementations and documents 17