590 likes | 606 Views
This article discusses how Oracle Text saves time by providing an efficient search tool for managing engineering data at CERN, the European Organization for Nuclear Research. It explores how CERN benefits from this technology and highlights the importance of managing the complete life-cycle of documents and components in the Large Hadron Collider project. The article also emphasizes the need for indexing metadata and files, supporting multiple languages, and ensuring high performance and simplicity in the search process.
E N D
Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003
Content CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion
Content CERN
CERN - European Organization for Nuclear Research • The world’s largest particle physics research laboratory • Founded in 1954, CERN has today 20 member states • 2400 staff • Over 6500 scientists come here to use research facilities • 500 universities, over 80 nationalities • CERN explores what matter is made of, and what forces hold it together • WWW was born here
Content EDMS Engineering Data Management System
EDMS - Engineering Data Management System EDMS Portal EDMS Common layer Axalant MP5 Other DB’s Design Data Documents and Drawings Asset tracking Work management
Structures EDMS - Engineering Data Management System Managing
EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents.
EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning
EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector)
EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies
EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Equipment workflow, data
EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Equipment workflow, data Installation (jobs, locations, etc..)
Provide an efficient search tool to support with requirements above - our choice Oracle Text EDMS mandate Manage a full description of the LHC project’s engineering data over it’s lifetime (>25 years) Design Installation Operation Dismantling A full description of the machine and its components through their lifecycle must be constantly available for all concerned parties Support and coordinate engineering work / information / data workflow Help tracing solutions to all problems occurring in the machine Establish a knowledge transfer: evolving staff, many short time visitors Operation
Oracle Text – our choice Our needs
Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential
Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both
Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important
Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Simplicity: Simple for users Simple to develop Simple to maintain
Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Bi-lingual : Official CERN languages are English and French. We have to support both Oracle text supports 39 languages Performance: Response time is very important Very efficient for searches within big collection of data Simplicity: Simple for users Simple to develop Simple to maintain Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages
Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Bi-lingual : Official CERN languages are English and French. We have to support both Oracle text supports 39 languages Performance: Response time is very important Very efficient for searches within big collection of data Simplicity: Simple for users Simple to develop Simple to maintain Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Oracle text comes as an option in RDBMS - no additional costs
Content Oracle Text
Oracle Text Oracle text Uses standard SQL Enables the building of a Text Query Application and a Document Classification Application Takes care of: indexing searching: word and theme viewing text
CONTEXT Index Creation CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(‘parameters string’); [datastore datastore_pref] [filter filter_pref] [charset column charset_column_name] [format column format_column_name] [lexer lexer_pref] [language column language_column_name] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] [populate | nopopulate]
Types of indexes Index Query Operator Characteristics Used for CONTEXT CONTAINS CLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text column Most complete of all 3 types. Large coherent documents CTXCAT CATSEARCH CHAR, VARCHAR2 Combined index on a text column and one or more other columns. Transactional – no need for synchronizing when DML. Creating can be longer because of the sub-indexes. Supports: INDEX SET, LEXER*, STOPLIST, STORAGE, WORDLIST* Has it’s own query language. For indexing small text fragments and related information. To improve mixed query performance CTXRULE MATCHES VARCHAR2, CLOB On column containing a set of queries. Supports: LEXER (only BASIC) Does not support number of operators. Building a document classification application
Index Maintenance & Optimization ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize fast’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full maxtime10’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full’);
Index Maintenance & Optimization ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; CTX_DDL package CTX_DDL.OPTIMIZE_INDEX CTX_DDL.SYNC_INDEX
DML processing INSERT A new row inserted in DR$PENDING queue, not available for query before synchronization UPDATE Existing ROWID is placed in DR$PENDING, neither new nor old content is available for query before synchronization DELETE The row is immediately unavailable for query(marked as invalid), but only removed when optimization complete CTX_USER_PENDING (CTX_PENDING) view To check records waiting for synchronization
Scoring “To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.” Oracle Text Reference, Release 9.0.1 Example In data set: M number of occurrences of TERM1, N number of occurrences of TERM2 M >> N Document having equal (n-occurrences) of TERM1 and TERM2 Result SCORE for querying TERM1 < SCORE for querying TERM2
Query Operators Boolean OR NOT MINUS AND lhc AND magnet AND NOT cryogenic Linguistics SYNonym ABOUT STEM Translation Term Broader, Narrower, Preferred, Related Term SYN (science) ABOUT (particle) Others FUZZY NEAR SOUNDEX WITHIN SQE begin ctx_query.store_sqe ( ‘particle‘ , ’atom, molecule proton’ ); end; ‘SQE (particle)’
CTX packages CTX_ADMIN Administer servers and the data dictionary (only ctxsys user) Create and manage the preferences, section groups, stoplists, manage indexes CTX_DDL CTX_DOC Document presentation features (only for CONTAINS indexes) CTX_OUTPUT Manage logs for the indexes Generating query feedback, counting hits, and creating SQE (stored query expressions) CTX_QUERY CTX_THES Manage and browse thesaurus
Content How we profit from this technology
EDMS search for both languages accelerateur lhc méthode Version 1.5
Escaping characters to query them To be able to query on reserved words or symbols such as “minus”, “-” , “near” they must be escaped. There are 2 methods to escape the character, using “{}” or “\” . When using: We had to hardcode it for each symbol and word. A standard “dictionary table” with the reserved characters would be useful.
It is important to know how users will search the data and what kind of data you are going to index before you actually do it. EDMS metadata index preferences
Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Hardware & system: Two node cluster based on two Sun SPARC 450, running Solaris 2.6 + Sun Cluster 2.1 RDBMS: 8.1.7.4 ~500 MB SGA size 60-80 concurrent users (during working hours)
Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Index synchronization: every 10 min, takes a few seconds PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' sync ' ')'); END; Index optimization: every weekend, takes ~30 min PROCEDURE optimize_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' optimize full' ')'); END;
Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Synchronize every 24h ? Optimize (fast, full) every month?
Scoring SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’lhc’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 9 SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’evolution’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 15
Using the thesaurus DECLARE xtab ctx_thes.exp_tab; …. BEGIN ctxsys.ctx_thes.rt(xtab,p_term,’edms_thes’); FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel = C_RELETED_TERM THEN htp.anchor ( L_DOC_SEARCH ||'?cookie=' ||cookie ||'&p_search_type=' ||p_search_type ||'&p_free_text=' ||LOWER(xtab(i).xphrase) ,LOWER(xtab(i).xphrase) ); END IF; END LOOP; END; Propose the RT (Related Term) if nothing found with the original term(s). Would be nice to have a spell checker corrector, using existing tokens.
Querying with Oracle Text versus standard SQL …WHERE CONTAINS (c_text, p_free_text) > 0; Total 83 ms
Querying with Oracle Text versus standard SQL … WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Total 03.98s
Querying with Oracle Text versus standard SQL p_free_text is a single word or an exact sentence Tool Oracle Text Standard SQL Characteristics Fast. Underperforming. Statement WHERE CONTAINS (c_text,p_free_text) > 0 WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Time * 83 ms (821ms) 03.98s (39.14s) * Tests done with TOra 1.3.8 (in parentheses repeated 10x)
Querying with Oracle Text versus standard SQL p_free_text is an expression with OR operator Tool Oracle Text Standard SQL Characteristics Fast. Underperforming. Statement WHERE CONTAINS (c_text,p_free_text) > 0 WHERE ( UPPER(c_text) LIKE '%’||UPPER(p_text_1)||’%’ OR UPPER(c_text) LIKE '%’||UPER(p_text_2)||’%’ ) Time * 103ms (01:03 ) 09:09 (1:22.09) * Tests done with TOra 1.3.8 (in parentheses repeated 10x)