1 / 29

Predicate-based Indexing of Annotated Data

Predicate-based Indexing of Annotated Data. Donald Kossmann ETH Zurich http://www.dbis.ethz.ch. Observations. Data is annotated by apps and humans Word: versions, comments, references, layout, ... humans: tags on del.icio.us Applications provide views on the data

Download Presentation

Predicate-based Indexing of Annotated Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicate-based Indexing of Annotated Data Donald Kossmann ETH Zurich http://www.dbis.ethz.ch

  2. Observations • Data is annotated by apps and humans • Word: versions, comments, references, layout, ... • humans: tags on del.icio.us • Applications provide viewson the data • Word: Version 1.3, text without comments, … • del.icio.us: Provides search on tags + data • Search Engines see and index the raw data • E.g., treat all versions as one document • Search Engine’s view != User’s view • Search Engine returns the wrong results • or hard-code the right search logic (del.icio.us)

  3. User Desktop Search read & update query Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) Application(e.g., Word, Wiki, Outlook, …) File System(e.g., x.doc, y.xls, …) crawl & index

  4. Example 1: Bulk Letters <address/> Dear <recipient/>, The meeting is at 12. CU, Donald Raw data x.doc, y.xls … Dear Paul, The meeting is at 12. CU, Donald … Dear Peter, The meeting is at 12. CU, Donald View …

  5. Example1: Traditional Search Engines Inverted File Query: Paul, meeting Answer: - Correct Answer: x.doc Query: Paul, Mary Answer: y.xls Correct Answer: x.doc

  6. Example 2: Versioning (OpenOffice) Raw Data <deleted id=„1“><info date =„5/15/2006“></deleted> <inserted id=„2“><info date =„5/15/2006“></inserted> <delete id=„1“>Mickey likes Minnie</delete> <insert id=„2“>Donald likes Daisy</insert> Mickey likes Minnie Donald likes Daisy Instance 1 (Version 1) Instance 2 (Version 2)

  7. Example 2: Versioning (OpenOffice) Inverted File Query: Mickey likes Daisy Answer: z.swx Correct Answer: - Query: Mickey likes Minnie Answer: z.swx Correct Answer: z.swx (V1)

  8. Example 3: Personalization, Localization, Authorization <header> <data row="duck" id=“man">Donald</data> <data row="duck" id=“woman">Daisy</data> <data row="mouse" id=“man">Mickey</data> <data row="mouse" id=“woman">Minnie</data> </header> <body> <field id=“man"/> likes <field id=“woman"/>. </body> Donald likes Daisy. Mickey likes Minnie. Donald Daisy Mickey Minnie likes .

  9. Example 4: del.icio.us http://A.com Tag Table • Query: „Joe, software, Yahoo“ • both A and B are relevant, but in different worlds • if context info available, choice is possible Yahoo builds software. http://B.com Joe is a programmer at Yahoo.

  10. Example 5: Enterprise Search • Web Applications • Application defined using „templates“ (e.g., JSP) • Data both in JSP pages and database • Content = JSP + Java + Database • Content depends on Context (roles, workflow) • Links = URL + function + parameters + context • Enterprise Search • Search: Map Content to Link • Enterprise Search: Content and Link are complex • Example: Search Engine for J2EE PetStore • (see demo at CIDR 2007)

  11. Possible Solutions • Extend Applications with Search Capabilities • Re-invents the wheel for each application • Not worth the effort for small apps • No support for cross-app search • Extend Search Engines • Application-specific rules for „encoded“ data • „Possible Worlds“ Semantics of Data • Materialize view, normalize view • Index normalized view • Extended query processing • Challenge: Views become huge!

  12. User read & update query Views Desktop Search Engine (e.g., Spotlight, Google Desktop, …) Application(e.g., Word, Wiki, Outlook, …) File System(e.g., x.doc, y.xls, …) crawl & index Views rules

  13. Size of Views • One rule: size of view grows linearly with size of document • E.g., for each version, one instance in view • Constant can be high! (e.g., many versions) • Several rules: size of view grows exponentially with number of rules • E.g, #versions x #alternatives • Prototype, experiments: Wiki, Office, E-Mail… • About 30 rules; 5-6 applicable per document • View ~ 1000 Raw data

  14. Solution Architecture

  15. Rules and Patterns • Analogy: Operators of relational algebra • Patterns sufficient for Latex, MS Office, OpenOffice, TWiki, E-Mail (Outlook)

  16. Normalized View Raw Data: Rule: Normalized View: <header> <data row="duck" id=“man">Donald</data> <data row="duck" id=“woman">Daisy</data> <data row="mouse" id=“man">Mickey</data> <data row="mouse" id=“woman">Minnie</data> </header> <body> <field id=“man"/> likes <field id=“woman"/>. </body> <field match=„//field“ref=„//data[@id=$m/@id]/text()“key=„$r/../@row“ /> <body> <select pred=“R1=duck">Donald</select> <select pred=“R1=mouse">Mickey</select> likes <select pred=“R1=duck">Daisy</select> <select pred=“R1=mouse">Minnie</select>. </body>

  17. Normalized View Raw Data: Rule: Normalized View: <inserted id=1><info date=„5/1/2006“/></inserted> <inserted id=2><info date=„5/16/2006“/></inserted> Mikey <insert id=1>Mouse</insert> likes Minnie <insert id=2>Mouse</insert>. <version match=„//insert“ key=„//inserted[@id eq $m/@id]/info/@date /> Mikey <select pred=“R2>=5/1/2006">Mouse</select> likes Minnie <select pred=“R2>=5/16/2006">Mouse</select>. • General Idea: • Factor out common parts: „Mickey likes Minnie.“ • Markup variable parts: <select …/>, <select …/>

  18. Normalization Algorithm • Step 1: Construct Tagging Table • Evaluate „match“ expression • Evaluate „key“ expression • Compute Operator from Pattern (e.g., > for version) • Step 2: Tagging Table -> Normalized View • Embrace each match with <select> tags

  19. Predicate-based Indexing <body> <select pred=“R1=duck">Donald</select> <select pred=“R1=mouse">Mickey</select> likes <select pred=“R1=duck">Daisy</select> <select pred=“R1=mouse">Minnie</select>. </body> Normalized View: InvertedFile:

  20. Query Processing Donald likes Minnie false R1=duck ^ true ^ R1=mouse Donald likes Daisy R1=duck^true^R1=duck R1=duck

  21. Qualitative Assessment • Expressiveness of rules / patterns • Good enough for „desktop data“ • Extensible for other data • Unclear how good for general applications (e.g., SAP) • Normalized View • Size: O(n); with n size of raw data • Generation Time: depends on complexity of XQuery expressions in rules; typically O(n) • Predicate-based Inverted File • Size: O(n) - same as traditional inverted files • Generation Time: O(n) • But, constants can be large • Query Processing • Polynomial in #keywords in query (~ traditional) • High constants!

  22. Experiments • Data sets from my personal desktop • E-Mail, TWiki, Latex, OpenOffice, MS Office, … • Data-set dependent rules • E-Mail: different rule sets (here conversations) • Latex: include, footnote, exclude, … • TWiki: versioning, exclude, … • Hand-cooked queries • Vary selectivity, degree that involves instances • Measure size of data sets, indexes, precision & recall, query running times

  23. Data Size (Twiki)

  24. Data Size (E-Mail)

  25. Precision (Twiki) Recall is 1 in all cases. Twiki: example for „false positives“.

  26. Recall (E-Mail) Precision is 1 in all cases. E-Mail: example for „false negatives“.

  27. Response Time in ms (Twiki) Enhanced one order of magnitude slower, but still within milliseconds.

  28. Response Time in ms (E-Mail) Enhanced orders of magnitude slower, but still within milliseconds.

  29. Conclusion & Future Work • See data with the eyes of users! • Give search engines the right glasses • Flexibility in search: reveal hidden data • Compressed indexes using predicates • Future Work • Other apps: e.g., JSP, Tagging, Semantic Web • Consolidate different view definitions (security) • Search on streaming data

More Related