1 / 48

Searching for Statistical Diagrams

Searching for Statistical Diagrams. Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November 17, 2011. Statistical Diagrams. Everywhere in serious academic, governmental, scientific documents

mircea
Download Presentation

Searching for Statistical Diagrams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November 17, 2011

  2. Statistical Diagrams • Everywhere in serious academic, governmental, scientific documents • Our only peek into data behind docs • Previously precious and rare, Web gives us a precious flood • In small Web crawl, found 319K diagrams in 153K academic papers • Google makes it easy to find docs, images; very hard to find diagrams

  3. Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets

  4. Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets

  5. Telephone System Manhole Drawing, fromArias, et al, Pattern Recognition Letters 16, 1995

  6. Sample Architectural Drawing, from Ah-Soon and Tombre, Proc’ds of Fourth Int’l Conf on Document Analysis and Recognition, 1997.

  7. Previous Work • Understanding diagrams isn’t new • Understanding a Web’s worth of diagrams is new • Need to search statistical diagrams in medicine, economics, biology, physics, etc • The (old) phone company can afford a system tailored for manhole diagrams; we can’t • Effective scaling with # of topics is central goal of domain-independent information extraction

  8. Domain-Independent IE • General IE topic since early 1990s • Goal is to obtain structured information from unstructured raw documents • [Title, Price] from online bookstores • [Director, Film] from discussion boards • [Scientist, Birthday] from biographies • Traditional IE requires topic-specific code, features, data • Supervision costs grow with # of domains • Domain-independent IE does not

  9. Related Work • Domain-independent extraction: • Text (Banko et al, IJCAI 2007; Shinyama+Sekine, HLT-NAACL 2006) • Tables (Cafarella et al., VLDB 2008) • Infoboxes (Wu and Weld, CIKM 2007) • Specific to diagrams, some DI: • Huang et al, “Associating text and graphics…”, ICDAR 2005 • Huang et al, “Model-based chart image recognition”, GREC 2003 • Kaiser et al, “Automatic extraction…”, AAAI 2008 • Liu et al, “Automated analysis…”, IJDAR 2009

  10. Outline • Intro • Related Work • OurApproach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets

  11. Our Approach • Typical Web search pipeline • Crawl Web for documents • Obtain and index text • Make index queryable • Our novel components • Diagram metadata extraction • Custom search ranker • Snippet generator

  12. Metadata Extraction • Recover good (text, x, y) from PDFs • Apply simple role label: title, legend, etc • Does text start with capitalized word? • Ratio of text region height to width • % of words in region that are nouns • Group texts into “model diagram” candidates, throw away unlikely ones • E.g., must include something on x scale • Relabel text using geometric relationships • Distance, angle to diagram’s origin? • Leftmost in diagram? Under a caption?

  13. Search Ranker • Tested four versions • Naïve – standard document relevance • Reference – caption and context only • Field – all fields, equal weighting • Weighted – all fields, trained weights

  14. Snippet Generation • Tested five versions Caption and Context text only No graphics at all Caption and Context text accompany graphics Apply metadata over graphic Apply metadata over graphic 1. Original-snippet 2. Small-snippet 3. Text-snippet 4. Integrated-snippet 5. Enhanced-snippet

  15. Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets

  16. Experiments • Crawled Web for scientific papers • From ClueWeb09 • Any URL ending in .pdf from .eduURL • 319K diagrams • Fed data to prototypesearch engine • Evaluated • Metadata extraction • Rank quality • Snippet effectiveness • All results compared against human judgments

  17. 1. Experiments - Extraction

  18. 1. Experiments - Extraction

  19. 2. Experiments - Ranking

  20. 2. Experiments - Ranking Improves ranking 52% over naïve ranker

  21. 3. Experiments - Snippets Improves snippet accuracy 33% over naïve soln

  22. Aggregated Diagrams

  23. Aggregated Diagrams

  24. Aggregated Diagrams

  25. Clustering

  26. Clustering

  27. Clustering

  28. Clustering 1978-2004 2000-2004 2002-2006

  29. Other Applications • Working now • Keyword, axis search • Similar diagrams / similar papers • In future: • Improved academic paper search • “Show plots that support my hypothesis”

  30. Outline • Intro • Related Work • Our Approach • Metadata Extraction • Ranking • Search Snippets • Experiments • Work In Progress: Spreadsheets

  31. Spreadsheets • SAUS has >1,300; we’ve dl’ed 350K • Many tasks: search, facets, integration, etc.

  32. Metadata Recovery

  33. Future Work • Has experiment X ever been run? • WY GDP vs coal production in 2002 • Preemptively compute good diagrams • Lots (most?) structured data lives outside DBMS • Spreadsheets • HTML tables • Log files, sensor readings, experiments, … • Structured search

  34. Conclusions • Metadata extraction enables 52% better search ranking • Extraction-enhanced snippets allow users to choose 33%more accurately • We rely on open information extraction, but extracted data not the main product • Can be successful even with imperfect extractors

  35. Thanks

  36. WebTables [VLDB08, “WebTables…”, Cafarella, Halevy, Wang, Wu, Zhang] • In corpus of 14B raw HTML tables, ~154M are “good” databases • Largest corpus of databases & schemas we know of • The WebTables system: • Recovers good relations from crawl and enables search • Builds novel apps on the recovered data

  37. WebTables Pipeline Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search • 2.6M distinct schemas • 5.4M attributes Attribute Correlation Statistics Db The Unreasonable Effectiveness of Data [Halevy, Norvig, Pereira]

  38. Synonym Discovery • Use schema statistics to automatically compute attribute synonyms • More complete than thesaurus • Given input “context” attribute set C: • A = all attrs that appear with C • P = all (a,b) where aA, bA, ab • rm all (a,b) from P where p(a,b)>0 • For each remaining pair (a,b) compute:

  39. Synonym Discovery Examples

More Related