140 likes | 262 Views
Archive overview and projects too. Important links. Need to sign up for “library cards” http://www.archive.org/account/login.createaccount.php Then you can access following pages: www.archive.org/web/researcher/researcher.php www.archive.org/web/researcher/data_available.php
E N D
Important links • Need to sign up for “library cards” • http://www.archive.org/account/login.createaccount.php • Then you can access following pages: • www.archive.org/web/researcher/researcher.php • www.archive.org/web/researcher/data_available.php • www.archive.org/web/researcher/parallel.php • www.archive.org/web/researcher/example_research_create_arc.php
Machine overview • Data stored on ~200 desktop computers • Host names: ia00xxx (e.g., ia00660) • Initially, you’ll use ia0010[0-7] • Four 160GB drives on each • /0, /1, /2, and /3 • /1-/3 filled to capacity • /0 filled to 1/2 capacity • /0/tmp is “temp” space for computations
Your account • Fill out form at: http://www.soe.ucsc.edu/~raymie/290g-userinfo.html • I’ll take it from there • Expect an e-mail
Files • ARC files -- contain raw data • Multiple doc’s/file, ~100MB per file • DAT files -- contain commonly-used fields • CDX files -- index of ARC and DAT • /0/tmp/complete.cdx -- per machine • Archive-wide cdx’s on 6 machines (wayback) • All compressed (ARC on page boundaries)
Programs • Unix tools • grep, join, cut, Awk, perl, screen(!), ... • Alexa tools • P2
Alexa tools • av_arcfilter, av_cat, av_getpage, av_grep, av_prepend_random, av_randomize, av_search, av_sort
P2 • Based on data-parallel programming model • SIMD, single-instruction, multiple data • Thinking machines • Idea: run the same command line on all
P2 • P2 program [-c combiner] -p machines • program: command-line to be run • combiner: program to combine results • machines: machines to use • “-p /net/ia00100 /net/ia00101” • “-p $rack1” • $rack[1-5], $arcs
P2 - example • p2 uptime -p $ARCS • Returns result of uptime on all machines • p2 ‘zcat /0/tmp/complete.cdx.gz | wc -l’ -p .. • Returns length (in lines) of indexes
p2 • Output of “subprograms” sent to initiating “p2” program • This program “combines” these lines • By default, av_cat is used to get them to standard output • The -c option allows the user to set a combiner • But lines from subprograms can be interleaved
Crawl catalog Counts & histograms Page-change Word-change study Language id Table detection RSS download/studies Id “soft” 404/30x’s Mirror detection Javascript link extract Storage redundancy URL database Validating host counts IP sampling vs. crawls Correcting for vrt. host Possible projects