640 likes | 850 Views
Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics. Chris Biow, Federal CTO Wayne Feick, Lead Engineer Mark Logic. Universal Indexing: Agenda. MarkLogic Server Text XML Structure XML Semantics Scalars and aggregation Spatial location
E N D
Universal, Composable IndexingQueries, Text, Spatial Data, XML Structure, and XML Semantics Chris Biow, Federal CTO Wayne Feick, Lead Engineer Mark Logic
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
MarkLogic XML Server DBMS Search Application Server
What is MarkLogic Server? • An XML Server • A hybrid (integrated parts) • Special purpose DBMS for XML, with enterprise expectations: • ACID transactions • DBA, backup, replication • Search engine kernel, with enterprise expectations: • Full text • Faceted navigation, at massive scale • Boolean, proximity, stemming, tokenization, decompounding, case, diacritics • Application Server • HTTP • XCC Java/.NET • WebDav
MarkLogic as Special DBMS • Not relational (RDBMS) • Schema agnostic • Text a first-class citizen among data types • XQuery (SQL) • Search engine algorithms for many DB queries • Order(1) in number of docs • O(log(n)) for range indexing • O(n) in results • Very low DBA overhead (0.5 FTE / 100 hosts) • 5-minute install • 5-minute scale-out • Database and search engine are the same
MarkLogic as Special Search Engine timeline • Understands document structure • Transactional: high CRUD load • Unicode • Diacritics • Collations • Languages • Holds the documents • Reindexing • Delivery • Geospatial: Box, Point/radius, polygon • Alerting: Profile, alert, filter, tipping, selector, “trigger,” … • Analytics: Facets, Co-occurrence, word lexicons, … • Everything composes (e.g. geo-alerting, geo-text-data, search-alerting) • Processing near the data • Relational joins and inferencing • Database and search engine are the same message message @id 3 @id 5 status status Testing XProc Oh boy… Element node Attribute Node Text Node
MarkLogic as Special App Server • Native HTTP(S) server • RESTfulXML by default • Transform to HTML • Transform to PDF, MS Office, etc. • PKI with no dependencies • Optionally with external Auth • HTTP(S) client • XCC Java / .NET server • Similar to JDBC / ADO.NET • WebDAV • Folder on the user’s desktop RESTful Architecture user/ Representations + get .json.xqy .xml.xqy + … user.xqy + Get + Put + Post + Delete Resources notes/ URL Rewriter note.xqy Routes.xqy
MarkLogic at Scale • Scale up: typically 1TB or more per server • Scale out: low hundreds ++ • Commodity hardware • Typically ~$5K HW budget per server • 2-CPU x 4-core • 32+ GB RAM • 3x disk: local mount with failover • OS • Linux RHEL 5 • Solaris 10 • Windows 2003/8 (XP/Vista/7 for dev)
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Full-text Indexing (Search) Find all documents that contain the phrase “uniquely identify” 129 122 127 123 130 126 124 125 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 1 can uniquely identify however
Full-text Indexing (Search) Find all documents that contain the phrase “uniquely identify” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . “which uniquely” . . . 126, 130, 167, 212, 219, 377 . . . “identify each” . . .
Word Lexicons • Revertible index of words • Word-level search (wildcarding) • Corpus analysis • Most common words • Document-frequency of words • Result-set intersection • Most common words in result set • Document-frequency of chosen words in result
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
XML Structure Find all articles that have an abstract <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
XML Structure Find all articles that have an abstract UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . .
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
XML Semantics Find all documents that mention the product “IMS” <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
XML Semantics Find all documents that mention the product “IMS” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . . <product>IMS</product>
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Range Indexing Bi-directionally index ordered values (YEAR:int, Volume:text) to document (fragment) IDs UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume
Range indexes / lexicons • Scalar or ordered discrete • XS data types • Inequalities • Range queries • Aggregations • Corpus or Result-specific • Discrete • Bucketed (query-time bounds) • Faceted Navigation • Fundamentally O(log(n)) • Memory-mapped
Aggregate Queries How many of the articles that contain “data base” were written in each of the last 5 decades? <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
Aggregates How many of the articles that contain “data base” were written in each of the last 5 decades? UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume
Composition Count all articles that contain “data” in the title and mention the product IMS in a section, grouping by year. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product>
Back to Discrete Indexing • Directories • Exclusive, hierarchical, analogous to file system, map to URI • Collections • Set-based, N:N relationship • Security • Invisible to your app
The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product> Directory(“/articles/”) Collection(Red) Role:Editor + Action:Read
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Scalar Pairs • Data examples • Latitude / Longitude • Any other pair (e.g. volume / closing price) • Query types • Point (exact value) • Point-Radius (circle) • Lat/Lon bound (Mercator “rectangle”) • Polygon (10K+ vertices) • Aggregations: co-occurrence • Discrete facets (symptom : treatment) • Bucketed continuous • Frequency-ordered • Heat Map
Geospatial Indexing Points ordered in latitude major order; special scan operators apply geospatial query constraints GEOSPATIAL INDEX 130 0, -124 123 -10,10.5 126 127 0,0 -10,10.5 167 0,0 126 0,-167 ... 130 0, -29 113 0,-12 126 0,0 167 0,0 113 10.1, 35.553 …
Query Registration • Canonicalize a cts:query() • Hash for an ID • Resolve the query • Persist the term list in memory • Reuse as materialized sub-query • (AKA Topics, Concepts, Macros, etc.)
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Query Indexing • “Alerting” • Selectors, tippers, standing query, filters, “triggers*”, real-time search, content-based routing, stream DBMS, etc. • Search(query, Index) -> docs • Alert(doc, Index) -> queries • Common sub-expression factoring • O(1) in number of alerts • O(n) in results returned * MarkLogic has pre- and post-commit DBMS triggers, which are unrelated Alert New patient matching your study profile
The Reverse Index REVERSE INDEX Document References Query Unified Expression Trees year >= 1970 and (“data” and “size”) “data” and and 437 year > 2003 and (“data” and “web”) “size” year < 2000 and (“data” and “web”) “web” and and 562 (2000 <= year <= 2010) and “web” and 597 . . . year >= 1970 and 623 year < 2000 year >= 2000 and year > 2003 year <= 2010
Alerting in Composition • Scalar • Search is on scalar data with range queries • Alert on range data with scalar reverse-queries • Geospatial • Search on point data with box, circle, polygon query constraint • Compose with text, structure, semantic query • Alert on box, circle, polygon data with point reverse-query • Compose with text, structure, semantic data • Search • Queries compose with Boolean operations (AND, OR, NOT, ((())) • Reverse- and forward-query compose (AND, OR, NOT, ((())) • Why would you ever want to do that?
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Search Composed with Alerting • In Soviet Russia, the document queries YOU! • If you are also a document • Documents are XML • Elements • Typed data • Text • Serialized XQuery against documents • Attributes • Typed data • Composition • XQuery incorporating document and Booleans
Matchmaking • Constraints upon each other. • Matching pairs or one-to-many • Suitable date • man / woman • mixed pool • Carpool ride: driver / rider • Employment: job / resume • Medication: patient / drug • Search security: document / user • Battle: target / shooter
Carpool • Driver • Non-smoking woman driving from San Ramon to San Carlos, leaving at 8AM, listens to rock, pop, hip-hop, wants $10 for gas • Requires female passenger within five miles of start and end • Passenger • Woman will pay up to $20 • From: 3001 Summit View Dr, San Ramon, CA 94582 • To: 300 Oracle Pkwy, Redwood City, CA 94065 • Requires non-smoking car • Won’t listen to country music
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)