310 likes | 319 Views
This project focuses on integrating content sources into GRACE, a grid search and categorization engine, at CERN. It includes steps for submitting a search query, parsing the results, and testing the integration. The project also explores parallelization techniques and simulation to optimize performance.
E N D
Grid Search and Categorization Engine Image by Hector Garcia Puigcerver
GRACE Workflow • CERN’s Tasks • Content Source Integration • Grid Integration • Grid Testing
Content Sources Integration • Content Source • Input: Search Query • Output: Search Results • HTML output • OAI (Open Archives Infrastructure) compliant output • Personalized configuration file for each Content Source(SPEC file) • Integration Steps • Submit the search • Parse the result • Retrieve associated documents
Step 1: Submit the Search • Goal: Submit Search Query • Input: Query in GRACE format • Go to contentsource, find search fields • Add fieldto SPEC file <get-param name='p'> <paramval name='/query/Quick-Search'/> </get-param>
Step 2: Parse the Result • Goal: Produce result sets interpretable by GRACE • Input: Search Result in HTML format
Step 2: Parse the Result • Goal: Produce result sets interpretable by GRACE • Input: Search Result in HTML format • Identify Fields: Title, Author, Abstract, … • Produce XPath Expressionse.g. /root/html/body/div/a • Produce XSL (eXtended Stylesheet Language) transformation code • Produce code for retrieval of associated documents • Output: XML result sets
Step 3: Test your file • Test application (seaLion): part of GRACE application • Submits search using a given SPEC file • Returns GRACE result set • Provides debug output • CSTest script • Uses seaLion • Validates results • Batch testing
Results • 16 Content Sources integrated • Input for Deliverable 6.1 • Workflow of Integration • Status, common problems and risks • HowTo: Configuration of Content Sources for Integration with GRACE • Usable by content providers who want to integrate their content source into GRACE • TestKit • Test application & scripts • HowTo & TestKit available on GRACE website
Grid Integration • Two Grid components: • Text Normalizing • Categorizing • Components provided by partners, CERN responsible for integration
First approach • “One for all” (model M1) • Parallel execution of simultaneous searches • O(hours) for complete process
Parallelized Model • Split text normalization
Parallelized Model • Split outside the Grid • Launch N jobs • Perform text normalization • Store results in the Grid (using Replica Manager) • Monitor Status • Launch Categorization job • Pick up documents from the Grid and merges them • Perform Categorization • Get result from Categorization job
Simulation • Simulate parallelized model including • Submission time • Grid overhead • Application overhead • Application performance • Interesting values • User (UI) Waiting Time • Spent Computing Time
Conclusions from Simulation • Derived rules for splitting parameters • Minimize user waiting time Kopt • Save “unnecessary” resources by splitting less than optimal value. Therefore let the user wait 20% more (unnoticeable) Keff • Calculated formulas for splitting parameters • Implemented in Java class for GRACE application
Results • JDLs for Grid Jobs created for both models • GRACE can run on the Grid • Description of Grid Jobs • Input for Deliverable 6.1 • Parallelized Job Model • Used in Grid Tests
Grid Tests • Test plan for both models and comparison • Creation of input corpus • Creation of test scripts for semi-automatic testing • Creation of scripts for validation of output and parsing of logging • General tests started 20.10.04 • Main test period from 05.11 to 25.11.04 • Tests performed in GILDA testbed • Submitted more than 1000 jobs • Made about 1 million Java API calls
Results • Input for Deliverable 7.2 • Validation of the suitability of GRACE for the Grid • Performance testing of the Application • Validation of the parallelized model • Validation of simulated results • Intensive use of GILDA • Feedback to GILDA • Feedback to EGEE • New requirements list
gContainer • SSL Web service container following WSRF standard • Based upon WSRF::Lite • Service discovery • Load management • Factory service • Can start and manage arbitrary service • Hosted services • Grid Access Service • API Service for Communication with ROOT
Grid Access Service (GAS) • The Grid Access Service represents the user entry point to a set of core services • Composed by different modules File Catalogue Metadata client GAS WMS
Trips • CERN School of Computing, Vico Equense • Grid Computing • Physics Computing • Software Techniques • GRACE General Meeting, Brussels • Project Meeting • Workshop at Global Grid Forum • EGEE JRA1 Design Team Meeting, Padova • Presenting the Grid Access Service
Thanks… … for your attention … this very nice time at CERN!