CS276A Text Information Retrieval, Mining, and Exploitation

CS276AText Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002

…/~newbie/ www.ibm.com /…/…/leaf.htm Recap: Web Anatomy

E2 E1 WEB Recap:Size of the Web • Capture – Recapture technique • Assumes engines get independent random subsets of the Web E2 contains x% of E1. Assume, E2 contains x% of the Web as well Knowing size of E2 compute size of the Web Size of the Web = 100*E2/x Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)

Recent Measurements Source: http://www.searchengineshowdown.com/stats/change.shtml

Today’s Topics • Web IR infrastructure • Search deployment • XML intro • XML indexing and search

Web IR Infrastructure • Connectivity Server • Fast access to links to support link analysis • Term Vector Database • Fast access to document vectors to augment link analysis

Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01] • Fast web graph access to support connectivity analysis • Stores mappings in memory from • URL to outlinks, URL to inlinks • Applications • HITS, Pagerank computations • Crawl simulation • Graph algorithms: web connectivity, diameter etc. • Visualizations

Output URLs + Values Input Graph algorithm + URLs + Values Execution Graph algorithm runs in memory URLs to FPs to IDs IDs to URLs Usage Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes

ID assignment E.g., HIGH IDs: Max(indegree , outdegree) > 254 ID URL … 9891 www.amazon.com/ 9912 www.amazon.com/jobs/ … 9821878 www.geocities.com/ … 40930030 www.google.com/ … 85903590 www.yahoo.com/ • Partition URLs into 3 sets, sorted lexicographically • High: Max degree > 254 • Medium: 254 > Max degree > 24 • Low: remaining (75%) • IDs assigned in sequence (densely) Adjacency lists • In memory tables for Outlinks, Inlinks • List index maps from an ID to start of adjacency list

… … … 98 … 132 -6 104 105 106 153 34 104 105 106 98 21 147 -8 153 49 … 6 … … … Sequence of Adjacency Lists Delta Encoded Adjacency Lists List Index List Index Adjacency List Compression - I • Adjacency List: • - Smaller delta values are exponentially more frequent (80% to same host) • - Compress deltas with variable length encoding (e.g., Huffman) • List Index pointers: 32b for high, Base+16b for med, Base+8b for low • - Avg = 12b per pointer

Base (4 bytes) List Index Pointers ID offset Adjacency lists URL Info LC:TID LC:TID … Offsets For 16 IDs LC:TID FRQ:RL FRQ:RL … FRQ:RL ID to adjacency list lookup

Adjacency List Compression - II • Inter List Compression • Basis: Similar URLs may share links • Close in ID space => adjacency lists may overlap • Approach • Define a representative adjacency list for a block of IDs • Adjacency list of a reference ID • Union of adjacency lists in the block • Represent adjacency list in terms of deletions and additions when it is cheaper to do so • Measurements • Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM) • Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)

Term Vector Database[Stat00] • Fast access to 50 word term vectors for web pages • Term Selection: • Restricted to middle 1/3rd of lexicon by document frequency • Top 50 words in document by TF.IDF. • Term Weighting: • Deferred till run-time (can be based on term freq, doc freq, doc length) • Applications • Content + Connectivity analysis (e.g., Topic Distillation) • Topic specific crawls • Document classification • Performance • Storage: 33GB for 272M term vectors • Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)

URLid * 64 /480 offset Base (4 bytes) URL Info LC:TID LC:TID Terms 128 Byte TV Record … Bit vector For 480 URLids LC:TID FRQ:RL FRQ:RL Freq … FRQ:RL URLid to Term Vector Lookup Architecture

Search Deployment • Web IR is just one (very specific) type of IR • Commercially most important IR application: • Enterprise search (large corporations) • Problem different from Web IR • Peer-2-Peer (P2P) search • Another search deployment strategy

Enterprise Search Deployment Search Boxes Web Portals E-Commerce Enterprises Markets Proprietary content Public content Content Location Database Company Web Site Corporate Network World Wide Web Sources Content Management Groupware

Evolution of Enterprise Search 1st Generation: Classic Information Retrieval 2nd Generation: Driven by WWW 3rd Generation: Discovery (Text Mining) User: Trained specialist Everyone Everyone and software agents Scope: Small, closed collections Intranet/Extranet Structured, semi-structured and unstructured information Technology: Pattern/string matching Pattern/string matching and external factors for relevance ranking + categorization Introduction of linguistic and semantic processing 2000+ 1994 - 1999 1985 - 1993

Security Cannot search what you should not read Content organization & creation Automatic classification Taxonomy generation Support for multiple languages, multiple formats Conduits into databases and other content management -- homes for “valuable” content Information processing tools Annotation Range searches Custom ranking criteria Cross lingual tools,… Individual preferences Personalization Notification, … Enterprise IR is a lot more than search …

Peer-To-Peer (P2P) Search • No central index • Each node in a network builds and maintains own index • Each node has “servent” software • On booting, servent pings ~4 other hosts • Connects to those that respond • Initiates, propagates and serves requests

Which hosts to connect to? • The ones you connected to last time • Random hosts you know of • Request suggestions from central (or hierarchical) nameservers • All govern system’s shape and efficiency

Serving P2P search requests • Send your request to your neighbors • They send it to their neighbors • decrement “time to live” for query • query dies when ttl = 0 • Send search matches back along requesting path

Some P2P Networks • Gnutella • Kazaa • Bearshare • Aimster • Grokster • Morpheus

P2P: Information Retrieval Issues • Why is this more difficult than centralized IR?

P2P: Information Retrieval Issues • Selection of nodes to query • Merging of results • Spam

What is XML? • eXtensible Markup Language • A framework for defining markup languages • No fixed collection of markup tags • Each XML language targeted for application • All XML languages share features • Enables building of generic tools

Basic Structure • An XML document is an ordered, labeled tree • character data leaf nodes contain the actual data (text strings) • data nodes must be non-empty and non-adjacent to other character data nodes • elementnodes, are each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes

XML Example

XML Example <chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

Elements • Elements are denoted by markup tags • <foo attr1=“value” … > thetext </foo> • Element start tag: foo • Attribute: attr1 • The character data: thetext • Matching element end tag: </foo>

XML vs HTML • Relationship?

XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML can be formalized as an XML language (XHTML) • XML defines logical structure only • HTML: same intention, but has evolved into a presentation language

XML: Design Goals • Separate syntax from semantics to provide a common framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML? • Represent semi-structured data (data that are structured, but don’t fit relational model) • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free

Applications of XML • XHTML • CML – chemical markup language • WML – wireless markup language • ThML – theological markup language • <h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>

XML Schemas • Schema = syntax definition of XML language • Schema language = formal language for expressing XML schemas • Examples • DTD • XML Schema (W3C) • Relevance for XML IR • Our job is much easier if we have a (one) schema

XML Tutorial • http://www.brics.dk/~amoeller/XML/index.html • (Anders Møller and Michael Schwartzbach) • Previous (and some following) slides are based on their tutorial

XML Indexing and Search

Native XML Database • Uses XML document as logical unit • Should support • Elements • Attributes • PCDATA (parsed character data) • Document order • Contrast with • DB modified for XML • Generic IR system modified for XML

XML Indexing and Search • Most native XML databases have taken an DB approach • Exact match • Evaluate path expressions • No IR type relevance ranking • Only a few that focus on relevance ranking

Timber: XML as DB extension • DB: search tuples • Timber: search trees • Main focus • Complex and variable structure of trees (vs. tuples) • Ordering • XML query optimization vs relational optimization

ToXin • Native XML database • Exploits overall path structure • Supports any general path query • Query evaluation in three stages • Preselection stage • Selection stage • Postselection stage

ToXin: Motivation • Strawman: • Index all paths occurring in database • Does not allow backward navigation • Example query: • find all the titles of articles from 1990

Query Evaluation Stages • Pre-selection • First navigation down the tree • Selection • Value selection according to filter • Post-selection • Navigation up and down again

ToXin

Factors Impacting Performance • Data source specific • Document size • Number of XML nodes and values • Path complexity (degree of nesting) • Average value size • Query specific • Selectiveness of path constraint • Size of query answer • Number of elements selected by filter

Benchmark Parameters

Query Classification

Evaluation

ToXin: Summary • Efficient native XML database • All paths are indexed (not just from root) • Path index linear in corpus size • Shortcomings • Order of nodes ignored • Semantics of IDRefs ignored What is missing?

IR/Relevance Ranking for XML • Why is this difficult?

CS276A Text Information Retrieval, Mining, and Exploitation