210 likes | 455 Views
A Characterization of the Portuguese Web. Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt. Presentation. Introduction Setup Statistics Conclusions Future Work. Terminology. Document: file resultant from a successful HTTP download
E N D
A Characterization of the Portuguese Web Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt
Presentation • Introduction • Setup • Statistics • Conclusions • Future Work
Terminology • Document: file resultant from a successful HTTP download • Publisher: entity responsible for publishing the document on the Web • Web site: collection of documents referenced by URLs that share the same host name
Why Characterize? • Extraction of cultural, commercial and social aspects: • Presence of natural languages • Most popular web servers • Adequate design and tuning of web applications: • The web is described through its characterization. • Parameters of the Web graph • How many nodes compose the graph • Types of this nodes
Huge Sampling is a “must” WWW is not uniform Small partitions are ignored Characterizing the WWW vs. Community Webs • + Relevant to a certain community • + Less resources • + A complete scan is possible, no sampling! • Difficult to establish boundaries
WWW.TUMBA.PT Publicly available: • Characterize • Search Almost: • Archive • The Portuguese Web
Main objectives: • Estimate the resources need to create a web-archive of the Portuguese Web; • Validate crawls; • Gather guidelines to improve the systems (crawling, repository, index).
Characterization Setup • Viúva Negra Crawlers: gather information from the Web and insert it into Versus. • Versus: keeps documents in files and meta-data in relations. • Web statistics are produced issuing SQL queries to the Versus Repository.
What is the Portuguese Web? • Set of documents of cultural and sociological interest to the Portuguese people. • Language • Brazilian/Portuguese community web sites • Both written in Portuguese • TLDs • Many sites hosted in gTLDs.
Crawler configuration • Influences statistics • The depth of the crawl influences the number of documents gathered • Replication • Mirrors • URL Aliases • Crawl as many documents as possible • Maintain robustness against pathological situations
VN Configuration Parameters • Text documents (list selected MIME types) • Hosted under “.PT” • Hosted under “.COM”, “.NET”, “.ORG”, “.TV”. • Written in Portuguese • Host site had at least one incoming link originated under “.PT” • Download timeout=60s • Max Size=2MB • Avoid traps: • max docs per site=8000 • crawl at most 50 times the same document
Collected Statistics • 4 million URLs and 78 GB. • 83% successfully downloaded (200) • 3.4% not found (404) • 1.2% took more than 1 minute to download • 0.5% bigger than 2 MB
Site statistics Sites per TLD Documents per Site
Other Statistics • Average length of an URL is 62 chars • unknown Last-Modified Date: 53% • HTML: 95% • 78 GB of data produced 8.7 GB of text • Meta-tags are scarce (description 17%, keywords 18%) • 15.5% Replication
http://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blankhttp://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blank
Other Statistics • Average length of an URL is 62 chars • unknown Last-Modified Date: 53% • HTML: 95% • 78 GB of data produced 8.7 GB of text • Meta-tags are scarce (description 17%, keywords 18%) • 15.5% Replication
Conclusions • Defined the Portuguese Web as a crawling policy. • Characterization can not be dissociated from crawling technology. • A search engine repository is a source of interesting statistics. • Statistics are an important tool for validating and designing web applications
Future Work • Study the linkage structure • Crawl other types such as postscripts • Improve the algorithm used to find Portuguese web sites outside the .PT domain • Study the evolution of the Portuguese Web
Thank you for your attention. daniel@tumba.pt http://xldb.fc.ul.pt http://www.tumba.pt