290 likes | 378 Views
Predicate-based Indexing of Annotated Data. Donald Kossmann ETH Zurich http://www.dbis.ethz.ch. Observations. Data is annotated by apps and humans Word: versions, comments, references, layout, ... humans: tags on del.icio.us Applications provide views on the data
E N D
Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich http://www.dbis.ethz.ch
Observations • Data is annotated by apps and humans • Word: versions, comments, references, layout, ... • humans: tags on del.icio.us • Applications provide viewson the data • Word: Version 1.3, text without comments, … • del.icio.us: Provides search on tags + data • Search Engines see and index the raw data • E.g., treat all versions as one document • Search Engine’s view != User’s view • Search Engine returns the wrong results • or hard-code the right search logic (del.icio.us)
User Desktop Search read & update query Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) Application(e.g., Word, Wiki, Outlook, …) File System(e.g., x.doc, y.xls, …) crawl & index
Example 1: Bulk Letters <address/> Dear <recipient/>, The meeting is at 12. CU, Donald Raw data x.doc, y.xls … Dear Paul, The meeting is at 12. CU, Donald … Dear Peter, The meeting is at 12. CU, Donald View …
Example1: Traditional Search Engines Inverted File Query: Paul, meeting Answer: - Correct Answer: x.doc Query: Paul, Mary Answer: y.xls Correct Answer: x.doc
Example 2: Versioning (OpenOffice) Raw Data <deleted id=„1“><info date =„5/15/2006“></deleted> <inserted id=„2“><info date =„5/15/2006“></inserted> <delete id=„1“>Mickey likes Minnie</delete> <insert id=„2“>Donald likes Daisy</insert> Mickey likes Minnie Donald likes Daisy Instance 1 (Version 1) Instance 2 (Version 2)
Example 2: Versioning (OpenOffice) Inverted File Query: Mickey likes Daisy Answer: z.swx Correct Answer: - Query: Mickey likes Minnie Answer: z.swx Correct Answer: z.swx (V1)
Example 3: Personalization, Localization, Authorization <header> <data row="duck" id=“man">Donald</data> <data row="duck" id=“woman">Daisy</data> <data row="mouse" id=“man">Mickey</data> <data row="mouse" id=“woman">Minnie</data> </header> <body> <field id=“man"/> likes <field id=“woman"/>. </body> Donald likes Daisy. Mickey likes Minnie. Donald Daisy Mickey Minnie likes .
Example 4: del.icio.us http://A.com Tag Table • Query: „Joe, software, Yahoo“ • both A and B are relevant, but in different worlds • if context info available, choice is possible Yahoo builds software. http://B.com Joe is a programmer at Yahoo.
Example 5: Enterprise Search • Web Applications • Application defined using „templates“ (e.g., JSP) • Data both in JSP pages and database • Content = JSP + Java + Database • Content depends on Context (roles, workflow) • Links = URL + function + parameters + context • Enterprise Search • Search: Map Content to Link • Enterprise Search: Content and Link are complex • Example: Search Engine for J2EE PetStore • (see demo at CIDR 2007)
Possible Solutions • Extend Applications with Search Capabilities • Re-invents the wheel for each application • Not worth the effort for small apps • No support for cross-app search • Extend Search Engines • Application-specific rules for „encoded“ data • „Possible Worlds“ Semantics of Data • Materialize view, normalize view • Index normalized view • Extended query processing • Challenge: Views become huge!
User read & update query Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) Application(e.g., Word, Wiki, Outlook, …) File System(e.g., x.doc, y.xls, …) crawl & index Views rules
Size of Views • One rule: size of view grows linearly with size of document • E.g., for each version, one instance in view • Constant can be high! (e.g., many versions) • Several rules: size of view grows exponentially with number of rules • E.g, #versions x #alternatives • Prototype, experiments: Wiki, Office, E-Mail… • About 30 rules; 5-6 applicable per document • View ~ 1000 Raw data
Rules and Patterns • Analogy: Operators of relational algebra • Patterns sufficient for Latex, MS Office, OpenOffice, TWiki, E-Mail (Outlook)
Normalized View Raw Data: Rule: Normalized View: <header> <data row="duck" id=“man">Donald</data> <data row="duck" id=“woman">Daisy</data> <data row="mouse" id=“man">Mickey</data> <data row="mouse" id=“woman">Minnie</data> </header> <body> <field id=“man"/> likes <field id=“woman"/>. </body> <field match=„//field“ref=„//data[@id=$m/@id]/text()“key=„$r/../@row“ /> <body> <select pred=“R1=duck">Donald</select> <select pred=“R1=mouse">Mickey</select> likes <select pred=“R1=duck">Daisy</select> <select pred=“R1=mouse">Minnie</select>. </body>
Normalized View Raw Data: Rule: Normalized View: <inserted id=1><info date=„5/1/2006“/></inserted> <inserted id=2><info date=„5/16/2006“/></inserted> Mikey <insert id=1>Mouse</insert> likes Minnie <insert id=2>Mouse</insert>. <version match=„//insert“ key=„//inserted[@id eq $m/@id]/info/@date /> Mikey <select pred=“R2>=5/1/2006">Mouse</select> likes Minnie <select pred=“R2>=5/16/2006">Mouse</select>. • General Idea: • Factor out common parts: „Mickey likes Minnie.“ • Markup variable parts: <select …/>, <select …/>
Normalization Algorithm • Step 1: Construct Tagging Table • Evaluate „match“ expression • Evaluate „key“ expression • Compute Operator from Pattern (e.g., > for version) • Step 2: Tagging Table -> Normalized View • Embrace each match with <select> tags
Predicate-based Indexing <body> <select pred=“R1=duck">Donald</select> <select pred=“R1=mouse">Mickey</select> likes <select pred=“R1=duck">Daisy</select> <select pred=“R1=mouse">Minnie</select>. </body> Normalized View: InvertedFile:
Query Processing Donald likes Minnie false R1=duck ^ true ^ R1=mouse Donald likes Daisy R1=duck^true^R1=duck R1=duck
Qualitative Assessment • Expressiveness of rules / patterns • Good enough for „desktop data“ • Extensible for other data • Unclear how good for general applications (e.g., SAP) • Normalized View • Size: O(n); with n size of raw data • Generation Time: depends on complexity of XQuery expressions in rules; typically O(n) • Predicate-based Inverted File • Size: O(n) - same as traditional inverted files • Generation Time: O(n) • But, constants can be large • Query Processing • Polynomial in #keywords in query (~ traditional) • High constants!
Experiments • Data sets from my personal desktop • E-Mail, TWiki, Latex, OpenOffice, MS Office, … • Data-set dependent rules • E-Mail: different rule sets (here conversations) • Latex: include, footnote, exclude, … • TWiki: versioning, exclude, … • Hand-cooked queries • Vary selectivity, degree that involves instances • Measure size of data sets, indexes, precision & recall, query running times
Precision (Twiki) Recall is 1 in all cases. Twiki: example for „false positives“.
Recall (E-Mail) Precision is 1 in all cases. E-Mail: example for „false negatives“.
Response Time in ms (Twiki) Enhanced one order of magnitude slower, but still within milliseconds.
Response Time in ms (E-Mail) Enhanced orders of magnitude slower, but still within milliseconds.
Conclusion & Future Work • See data with the eyes of users! • Give search engines the right glasses • Flexibility in search: reveal hidden data • Compressed indexes using predicates • Future Work • Other apps: e.g., JSP, Tagging, Semantic Web • Consolidate different view definitions (security) • Search on streaming data