360 likes | 490 Views
Web Archaeology. Raymie Stata Compaq Systems Research Center raymie.stata@compaq.com www.research.compaq.com/SRC. What is Web Archaeology?. The study of the content of the Web exploring the Web sifting through data making valuable discoveries Difficult! Because the Web is:
E N D
Web Archaeology Raymie Stata Compaq Systems Research Center raymie.stata@compaq.com www.research.compaq.com/SRC
What is Web Archaeology? • The study of the content of the Web • exploring the Web • sifting through data • making valuable discoveries • Difficult! Because the Web is: • Boundless • Dynamic • Radically decentralized
Some recent results • Empirical studies • Quality of almost-breadth-first crawling • Structure of the Web • Random walks (size of search engines) • Improving the Web experience • Better and more precise search results • Surfing assistants and tools • Data mining • Technologies for “page scraping”
Tools for “Web scale” research • Use data • Search quality • Crawl quality • Duplicate elimination • Web characterization Apps Feature Databases • Access subset of data fast • Full-text index, shingleprints • Connectivity, Term vectors Datastorage • Store and access web pages • Myriad • Download web pages • Mercator, Web Language Data collection
Web-scale crawling Mercator Atrax
The Mercator web crawler • A high-performance web crawler • Downloads and processes web pages • Written entirely in Java • Runs on any Java-capable platform • Extensible • Wide range of possible configurations • Users can plug in new modules at run-time
Mercator design points • Extensible • Well-chosen extensibility points • Framework for configuration • Multiple threads, synchronous I/O • vs. single thread, asynchronous I/O • Checkpointing • Allows crawls to be restarted • Modules export prepare, commit
Atrax, a distributed version of Mercator • Distributes load across cluster of crawlers • Partitions data structures across crawlers • No central bottleneck • Network bandwidth is limiting factor
Myriad -- new project • A very large, archival storage system • Scalable to a petabyte • With function shipping • Supports data mining
Myriad Requirements • Large (up to 10K disks) • Commodity hardware (low cost) • Easy to manage • Easy to use (queries vs. code) • Fault tolerance & containment • No backups, tape or otherwise
Two phases of Myriad project • Define service-level interface • Implemented to run on collections of files • Testing and tuning • Build scalable implementation • Cluster storage and processing • Designing now, prototype in summer • Won’t describe today
Applications Storage service Blocks New service level interface file systems, databases, Myriad • Better suited to this problem and scale • Supports “function shipping”
Myriad interface • Single table database • Stored vs. virtual columns • Virtual columns computed by injected code • Bulk input of new records • Management of code defining virtual columns • Output via select/project queries • User-defined code run implicitly • Support for repeatable random sampling
Example Myriad query [samplingprob=0.1, samplingseed=321223421332] select name, length where insertionDate < Date(00/01/01) && mimeType == “text/html”;
Model for large-scale data mining • Step 1: make an extract • Do data-parallel select and project • Don’t do any sorts, joins, groupings • Step 2: put extract into high-power analysis tool • Joins, sorts, joins, groupings
Feature Databases • URL DB • URL pgid • Host DB • pgid hostid • Link DB • out: pgid pgid* • in: pgid pgid* • Term vector DB • pgid term vector
URL database: prefix compression http://kiva.net/~markh/surnames.html http://kiwi-us.com/~amigo/links/index.htm http://kiwi.emse.fr/ http://kiwi.etri.re.kr/~khshim/internet/bookmark.html http://kiwi.etri.re.kr/~ksw/bookmark http://kiwi.futuris.net/linen http://kiwi.futuris.net/linen/special/backiss.html Prefix compress 0 http://kiva.net/~markh/surnames.html 9 wi-us.com/~amigo/links/index.htm 11 .emse.fr/ 13 tri.re.kr/~khshim/internet/bookmark.html 25 sw/bookmark 12 futuris.net/linen 29 /special/backiss.html
URL compression • Prefix compression • 44 14.5 bytes/URL • Fast to decompress (~10 s) • ZIP compression • 14.5 9.2 bytes/URL • Slow to decompress (~80 s)
Term vector basics • Basic abstraction for information retrieval • Useful for measuring “semantic” similarity of text • A row in the above table is a “term vector” • Columns are word stems and phrases • Trying to capture “meaning”
Compressing term vectors • Sparse representation • Only store columns with non-zero counts • Lossy representation • Only store “important” columns • “Importance” determined by: • Count of term on page (high ==> important) • Number of pages with term (low ==> important)
Applications • Categorizing pages • Topic distillation • Filtering pages • Identifying languages • Identifying running text • Relevance feedback (“more like this”) • Abstracting pages
Categorization • “Bulls take over”
How to categorize a page • Off line: • Collect training set of pages per category (~30K) • Combine training pages into category vectors • ~10K terms per category vector • On line: • Use term vector DB to look up vector of page • Find category vector that best matches this page vector • Use a Bayesian classifier to match vectors • Give no category if match not “definitive”
Topic drift in topic distillation • Some Web IR algorithms have this structure: • Compute a “seed set” on a query • Find neighborhood by following links • Rank this neighborhood • Topic drift (a problem): • The neighborhood graph includes off-topic nodes • “Download Microsoft Explorer” MS Home page
Avoid topic drift with term vectors • Combine term vectors of seed set into topic vector • Detecting topic drift in neighboring nodes: • Combine topic vector with node’s term vector • Inner product works fine • Expunge or weight • Integration of feature databases helps!!
Link database • Goals • Fit links into RAM (fast lookup) • Build in 24 hours • Applications • Page ranking • Web structure • Mirror site detection • Related page detection
Link storage baseline design Links Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 ... ...
Link storage: deltas Link Deltas Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 2 9 -4 -31 136 4 ... ...
Link storage: compression Link Deltas Variable-length encoding 1.7 bytes/link Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 2 9 -4 -31 136 4 = 4 bits = 8 bits = 8 bits = 8 bits = 12 bits = 8 bits ... ...
The future of Web Archaeology • Driving applications • Web search -- “finding things on the web” • Page classification (topic, community, type) • Purpose-specific search • Web “asset management” (what’s on my site?) • Automated information extraction (price robots) • Multi-billion page web • Dynamics