440 likes | 613 Views
in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald. DB and IR: Two Parallel Universes. parallel universes forever ?. Database Systems. Information Retrieval. canonical application:. accounting.
E N D
in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald
DB and IR: Two Parallel Universes parallel universes forever ? Database Systems Information Retrieval canonical application: accounting libraries text numbers, short strings data type: foundation: algebraic / logic based probabilistic / statistics based Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists) search paradigm: market leaders: Oracle, IBM DB2, MS SQL Server, etc. Google, Yahoo!, MSN, Verity, Fast, etc.
Why DB&IR Now? – Application Needs Simplify life for application areas like: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer support & CRM in insurances, telcom, retail, software, etc. • Bulletin boards for social communities • Enterprise search for projects, skills, know-how, etc. • Personalized & collaborative search in digital libraries, Web, etc. • Comprehensive archive of blogswith time-travel search Typical data: Disease (DId, Name, Category, Pathogen …) UMLS-Categories ( … ) Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …) Typical query: symptoms of tropical virus diseases and reported anomalies with young patients in central Europe in the last two weeks
Why DB&IR Now? – Platform Desiderata Integrated DB&IR Platform Platform desiderata (from app developer‘s viewpoint): • Flexible ranking on text, categorical, numerical attributes • cope with „too many answers“ and „no answers“ • Ontologies (dimensions, facets) for products, locations, org‘s, etc. • for query rewriting (relaxation, strengthening) • Complex queries combining text & structured attributes • XPath/XQuery Full-Text with ranking • High update rate concurrently with high query load Keyword Search on Relational Graphs (IIT Bombay, UCSD, MSR, Hebrew U, CU Hong Kong, Duke U, ...) Unstructured search (keywords) IR Systems Search Engines Querying entities & relations from IE (MSR Beijing, UW Seattle, IBM Almaden, UIUC, MPI, … ) Structured search (SQL,XQuery) DB Systems Structured data (records) Unstructured data (documents)
Why DB&IR Forever? 2000 2007 indexed Web 2 Bio. 20 Bio. Flickr photos --- 100 Mio. digital photos ? 150 Bio. Wikipedia 8 000 1.8 Mio. OECD researchers 7.4 Mio. 8.4 Mio. patents world-wide ? 60 Mio. US Library of Congres 115 Mio. 134 Mio. Google Scholar --- 500 Mio. Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! • Data enrichment at very large scale • Text and speech are key sources of • knowledge production (publications, patents, conferences, meetings, ...)
Outline Past : Matter, Antimatter, and Wormholes • Present : XML and Graph IR • Future : From Data to Knowledge •
Quiz Time Gerard Salton: in which country was he born (and did grow up) ? A: USA B: England C: Netherlands D: Germany E: Singapore F: Indonesia D: Germany Gerard Salton 1927 – 1995 Prof. Cornell Univ. 1965 – 1995
Parallel Universes: A Closer Look Matter Antimatter • user = programmer • query = precise spec. • of info request • interaction via API • user = your kids • query = approximation • of user‘s real info needs • interaction process via GUI • strength: indexing, QP • weakness: user model • strength: ranking model • weakness: interoperability • eval. measure: efficiency • (throughput, response time, • TPC-H, XMark, …) • eval. measure: effectiveness • (precision, recall, F1, MAP, NDCG, • TREC & INEX benchmarks, …
Web Query Languages: W3QS, WebOQL, Araneus … Uncertain & Prob. Relations: Mystiq, Trio … Semistructured Data: Lore, Xyleme … 2nd Gen. XML IR: XRank,Timber, TIJAH, XSearch, FleXPath, CoXML, TopX, MarkLogic, Fast … 1st Gen. XML IR: XXL, XIRQL, Elixir, JuruXML Deep Web Search Graph IR Digital Libraries Struct. Docs Multimedia IR Web Entity Search: Libra, Avatar, ExDB … Faceted Search: Flamenco … DB Prob. DB (Cavallo&Pittarelli) XPath Prob. Tuples (Barbara et al.) WHIRL (Cohen) XPath Full-Text INEX VAGUE (Motro) Prob. Datalog (Fuhr et al.) IR Proximal Nodes (Baeza-Yates et al.) 1990 1995 2000 2005
WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Scoring and ranking: xj ~ tf (word j in x) idf (word j) s (<x,y>, q: A~B) = cosine (x.A, y.B) with dampening & normalization s (<x,y>, q1 … qm) = Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title And M.Plot ~ R.Comment Movies Reviews Title Plot … Year Title Comment … Rating 1999 Matrix 1 … cool fights … new techniques … Matrix In the near future … computer hacker Neo … … fight training … … 4 • DB&IR for query-time data integration • More recent work: MinorThird, Spider, DBLife, etc. • But scoring models fairly ad hoc Matrix Reloaded … fights … and more fights … … fairly boring … 1 Hero 2002 In ancient China … fights … sword fight … fights Broken Sword … Matrix Eigenvalues … matrix spectrum … orthonormal … 5 2004 Shrek 2 In Far Far Away … our lovely hero fights with cat killer … Ying xiong aka. Hero … fight for peace … … sword fight … dramatic colors … 5
XXL: Early XML IR Lecturer Professor Name: Ralf Schenkel Activities Name: Gerhard Weikum Address: Max-Planck Institute for Informatics, Germany Address ... City: SB Research Teaching Country: Germany ... Seminar Scientific Other Course Project … Title: Intelligent Search of Heterogeneous XML Data Contents: Ranked retrieval … Name: INEX task coordinator (Initiative for the Evaluation of XML …) Title: IR Sponsor: EU Syllabus Description: Information retrieval ... ... Literature: … Book Article ... ... Funding: EU [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Union of heterogeneous sources without global schema Similarity-aware XPath: //~Professor[//* = ”~SB“] [//~Course [//* = ”~IR“] ] [//~Research [//* = ”~XML“]] Similarity-aware XPath: //~Professor[//* = ”~SB“] [//~Course [//* = ”~IR“] ] [//~Research [//* = ”~XML“]] Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML?
XXL: Early XML IR alchemist primadonna magician director artist Lecturer Professor wizard investigator Name: Ralf Schenkel Activities intellectual Name: Gerhard Weikum RELATED (0.48) Address: Max-Planck Institute for Informatics, Germany Address ... researcher professor City: SB Research Teaching Country: Germany HYPONYM (0.749) ... query expansion model: disjunction of tags scientist Seminar Scientific Other scholar lecturer Course Project … mentor Title: Intelligent Search of Heterogeneous XML Data Contents: Ranked retrieval … Name: INEX task coordinator (Initiative for the Evaluation of XML …) teacher Title: IR Sponsor: EU academic, academician, faculty member Syllabus Wu&Palmer: |path| through lca(x,y) Dice coeff.: 2 #(x,y) / (#x + #y) on Web Description: Information retrieval ... ... Literature: … Book Article ... ... Funding: EU [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Motivation: Union of heterogeneous sources has no schema Similarity-aware XPath: //~Professor[//* = ”~Saarbruecken“] [//~Course [//* = ”~IR“] ] [//~Research [//* = ”~XML“]] Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? • Scoring and ranking: • tf*idf for content condition • ontological similarity for • relaxed tag condition • score aggregation with • probabilistic independence
The Past: Lessons Learned precision recall entity substance solid food produce edible fruit element pome apple Golden Delicious gold // ~Professor [...] // { Professor, Researcher, Lecturer, Scientist, Scholar, Academic, ... }[...] • DB&IR: added flexible ranking to (semi) structured querying • to cope with schema and instance diversity • but ranking seems „ad hoc“ and • not consistently good in benchmarks • to win benchmark: tuning needed, • but tuning is easier if ranking is principled ! • ontologies are mixed blessing: • quality diverse, concept similarity subtle, • danger of topic drift • ontology-based query expansion • (into large disjunctions) • poses efficiency challenge
Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR • Future : From Data to Knowledge •
Quiz Time Which is the largest XML data collection in the universe? A: Yahoo! Answers B: INEX benchmark C: Derwent WPI D: Elsevier Scopus E: 51.com F: Traffic violations in EU C: Derwent WPI
TopX: 2nd Generation XML IR [Martin Theobald, Ralf Schenkel, GW: VLDB’05, VLDB Journal] • Exploit tags & structure for better precision • Can relax tag names & structure for better recall • Principled ranking by probabilistic IR (Okapi BM25 for XML) • Efficient top-k query processing (using improved TA) • Robust ontology integration(self-throttling to avoid topic drift) • Efficient query expansion(on demand, by extended TA) • Relevance feedback for automatic query rewriting ”Semantic“ XPath Full-Text query: /Article [ftcontains(//Person, ”Max Planck“)] [ftcontains(//Work, ”quantum physics“)] //Children[@Gender = ”female“]//Birthdates supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/ http://topx.sourceforge.net
Commercial Break [Martin Theobald, Ralf Schenkel, GW: VLDB’95] TopX demo today 3:30 – 5:30
led to Okapi BM25 (wins TREC tasks) • adapted and extended to XML • in TopX, ... Principled Ranking by Probabilistic IR Relationship to tf*idf binary features, conditional independence of features [Robertson & Sparck-Jones 1976] related to but different from statistical language models „God does not play dice.“ (Einstein) IR does. odds for item d with terms di being relevant for query q = {q1, …, qm} with • Now estimate pi and qi values from • relevance feedback, • pseudo-relevance feedback, • corpus statistics • by MLE (with statistical smoothing) • and store precomputed pi, qi in index
Probabilistic Ranking for SQL [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06] SQL queries that return many answers need ranking • Examples: • Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …) • Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“) • Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …) • Select * From Movies Where Genre = ”Romance“ And Era = ”90s“ odds for tuple d with attributes XYrelevant for query q: X1=x1… Xm=xm Estimate prob‘s, exploiting workload W: • Example: frequent queries • … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“ • … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“ • boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“
From Tables and Trees to Graphs [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS] Schema-agnostic keyword search over multiple tables: graph of tuples with foreign-key relationships as edges Example: Conferences (CId, Title, Location, Year) Journals (JId, Title) CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person) Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95 • Related use cases: • XML beyond trees • RDF graphs • ER graphs (e.g. from IE) • social networks Result is connected tree with nodes that contain as many query keywords as possible Ranking: with nodeScore based on tf*idf or prob. IR and edgeScore reflecting importance of relationships (or confidence, authority, etc.) Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)
The Present: Observations & Opportunities actor movie movie plot director ”life physicist Max Planck“ //article[//person ”Max Planck“] [//category ”physicist“] //biography movie actor actor director plot • Probabilistic IR and statistical language models • yield principled ranking and high effectiveness • (related to prob. relational models (Suciu, Getoor, …) but different) • Structural similarity and ranking • based on tree edit distance (FleXPath, Timber, …) • Aim for comprehensive XML ranking model • capturing content, structure, ontologies • Aim to generate structure skeleton • in XPath query from user feedback • Good progress on performance • but still many open efficiency issues
Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge •
Quiz Time • Who said: • Information is not knowledge. Knowledge is not wisdom. • Wisdom is not truth. Truth is not beauty. • Beauty is not love. Love is not music. Music is the best. ? A: Richard Feynman B: Sigmund Freud C: Larry Page D: Frank Zappa E: Marie Curie F: Lao-tse D: Frank Zappa
Knowledge Queries Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! Answer „knowledge queries“ such as: proteins that inhibit both protease and some other enzyme neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘ differences in Rembetiko music from Greece and from Turkey connection between Thomas Mann and Goethe market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king
Three Roads to Knowledge • Handcrafted High-Quality Knowledge Bases • (Semantic-Web-style ontologies, encyclopedias, etc.) • Large-scale Information Extraction & Harvesting: • (using pattern matching, NLP, statistical learning, etc. • for product search, Web entity/object search, ...) • Social Wisdom from Web 2.0 Communities • (social tagging, folksonomies, human computing, • e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...) • Social Wisdom from Web 2.0 Communities • (social tagging, folksonomies, human computing, • e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...)
High-Quality Knowledge Sources • universal „common-sense“ ontologies: • SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms • Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts) • domain-specific ontologies: • UMLS (Unified Medical Language System): 1 Mio. biomedical concepts • 135 categories, 54 relations (e.g. virus causes disease | symptom) • GeneOntology, etc. • thesauri and concept networks: • WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations • can be cast into OWL-lite (or typed graph with statistical weights) • lexical sources: • Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc. • hand-tagged natural-language corpora: • TEI (Text Encoding Initiative) markup of historic encyclopedia • FrameNet: sentences classified into frames with semantic roles growing with strong momentum
High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family • can be cast into • OWL-lite or into • graph, with weights for relation strengths • (derived from co-occurrence statistics) enzyme -- (any of several complex proteins that are produced by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells; ...) => macromolecule, supermolecule ... => organic compound -- (any compound of carbon and another element or a radical) ... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation; ...)
High-Quality Knowledge Sources Wikipedia and other lexical sources
Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) …
YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, GW: WWW 2007] • Turn Wikipedia into explicit knowledge base (semantic DB) • Exploit hand-crafted categoriesand templates • Represent facts as explicit knowledge triples: • relation (entity1, entity2) • (in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.) • Map (and disambiguate) relations into WordNet concept DAG relation entity1 entity2 Examples: bornIn isInstanceOf City Max_Planck Kiel Kiel
YAGO Knowledge Representation Knowledge Base # Facts KnowItAll 30 000 SUMO 60 000 WordNet 200 000 OpenCyc 300 000 Cyc 5 000 000 YAGO 6 000 000 Accuracy: 97% Entity subclass subclass Person concepts Location subclass Scientist subclass subclass subclass subclass City Country Biologist Physicist instanceOf instanceOf Erwin_Planck Nobel Prize bornIn Kiel hasWon FatherOf individuals diedOn bornOn October 4, 1947 Max_Planck April 23, 1858 means means means “Max Karl Ernst Ludwig Planck” “Dr. Planck” “Max Planck” words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/
NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness statistical language model for result graphs conjunctive queries bornIn isa Kiel $x scientist queries with regular expressions hasFirstName | hasLastName isa Ling $x scientist (coAuthor | advisor)* worksFor locatedIn* $y Zhejiang Beng Chin Ooi
Ranking Factors • Confidence: • Prefer results that are likely to be correct • Certainty of IE • Authenticity and Authority of Sources bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) • Informativeness: • Prefer results that are likely important • May prefer results that are likely new to user • Frequency in answer • Frequency in corpus (e.g. Web) • Frequency in query log q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) • Compactness: • Prefer results that are tightly connected • Size of answer graph vegetarian Tom Cruise isa isa bornIn Einstein won 1962 won Nobel Prize Bohr diedIn
Information Extraction (IE): Text to Records Person BirthDate BirthPlace ... Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Constant Value Dimension Planck‘s constant 6.2261023 Js combine NLP, pattern matching, lexicons, statistical learning
Knowledge Acquisition from the Web • Learn Semantic Relations from Entire Corpora at Large Scale • (as exhaustively as possible but with high accuracy) • Examples: • all cities, all basketball players, all composers • headquarters of companies, CEOs of companies, synonyms of proteins • birthdates of people, capitals of countries, rivers in cities • which musician plays which instruments • who discovered or invented what • which enzyme catalyzes which biochemical reaction Existing approaches and tools (Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …): almost-unsupervised pattern matching and learning: seeds (known facts) patterns (in text) (extraction) rule (new) facts
Methods for Web-Scale Fact Extration Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person plays(head(NP), NN) seeds text rules new facts Example: city (Seattle) in downtown Seattle city (Seattle) Seattle and other towns city (Las Vegas) Las Vegas and other towns plays (Zappa, guitar) playing guitar: … Zappa plays (Davis, trumpet) Davis … blows trumpet Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X plays (Davis, trumpet) Davis … blows trumpet X … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X plays (Davis, trumpet) Davis … blows trumpet X … blows Y Example: city (Seattle) in downtown Seattlein downtown X city (Seattle) Seattle and other townsX and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y in downtown Beijing city(Beijing) Coltrane blows sax plays(C., sax) city(Beijing) plays(Coltrane, sax) city(Beijing) old center of Beijing plays(Coltrane, sax) sax player Coltrane city(Beijing) old center of Beijing old center of X plays(Coltrane, sax) sax player Coltrane Y player X
Performance of Web-IE State-of-the-art precision/recall results: relation precision recall corpus systems countries 80% 90% Web KnowItAll cities 80% ??? Web KnowItAll scientists 60% ??? Web KnowItAll headquarters 90% 50% News Snowball, LEILA birthdates 80% 70% Wikipedia LEILA instanceOf 40% 20% Web Text2Onto, LEILA Open IE 80% ??? Web TextRunner precision value-chain: entities 80%, attributes 70%, facts 60%, events 50% Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes)
Beyond Surface Learning with LEILA (Cairo, Rhine), (Rome, 0911), (, [0..9]*), … (Cologne, Rhine), (Cairo, Nile), … NP VP PP NP NP PP NP NP NP PP NP VP NP PP NP NP NP Cologne lies on the banks of the Rhine People in Cairo like wine from the Rhine valley Mp Js Os AN Ss MVp DMc Mp Dg Jp Js Sp Mvp Ds Js NP VP VP PP NP NP PP NP NP Paris was founded on an island in the Seine (Paris, Seine) Ss Pv MVp Ds DG Js Js MVp Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD‘06] Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” “Al Gore funded more work for a better basis of the Internet” Almost-unsupervised Statistical Learning with Dependency Parsing • LEILA outperforms other Web-IE methods • in terms of precision, recall, F1, but: • dependency parser is slow • one relation at a time
IE Efficiency and Accuracy Tradeoffs [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi] IE is cool, but what‘s in it for DB folks? • precision vs. recall: two-stage processing (filter pipeline) • recall-oriented harvesting • precision-oriented scrutinizing • preprocessing • indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? • exploit ontologies? exploit usage logs ? • turn crawl&extract into set-oriented query processing • candidate finding • efficient phrase, pattern, and proximity queries • optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06]
The Future: Challenges • Generalize YAGO approach (Wikipedia + WordNet) • Methods for comprehensive, highly accurate • mappings across many knowledge sources • cross-lingual, cross-temporal • scalable in size, diversity, number of sources • Pursue DB support towards efficient IE (andNLP) • Achieve Web-scale IEthroughput that can • sustain rate of new content production (e.g. blogs) • with > 90% accuracy and Wikipedia-like coverage • Integrate handcrafted knowledge with NLP/ML-based IE • Incorporate social tagging and human computing
Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge
Major Trends in DB and IR Database Systems Information Retrieval malleable schema (later) deep NLP, adding structure record linkage info extraction graph mining entity-relationship graph IR dataspaces Web objects ontologies statistical language models data uncertainty ranking programmability search as Web Service Web 2.0 Web 2.0
Conclusion • DB&IR integration agenda: • models− ranking, ontologies, prob. SQL ?, graph IR ? • languages and APIs− XQuery Full-Text++ ? • systems− drop SQL, go light-weight ? • − combine with P2P, Deep Web, ... ? • Rethink progress measures and experimental methodology • Address killer app(s) and grand challenge(s): • from data to knowledge (Web, products, enterprises) • integrate knowledge bases, info extraction, social wisdom • cope with uncertainty; ranking as first-class principle • Bridge cultural differences between DB and IR: • co-locate SIGIR and SIGMOD
DB&IR: Both Sides Now DB&IR • Joni Mitchell (1969): Both Sides Now • … • I've looked at life from both sides now,From up and down, and still somehowIt's life's illusions i recall.I really don't know life at all. Thank You !