210 likes | 344 Views
Presentation . IntroductionSetupStatisticsConclusionsFuture Work. Terminology. Document: file resultant from a successful HTTP downloadPublisher: entity responsible for publishing the document on the Web Web site: collection of documents referenced by URLs that share the same host name. Why C
E N D
1. A Characterization of the Portuguese Web Daniel Gomes and Mário J. Silva
University of Lisbon
http://xldb.fc.ul.pt
2. Presentation Introduction
Setup
Statistics
Conclusions
Future Work
3. Terminology Document: file resultant from a successful HTTP download
Publisher: entity responsible for publishing the document on the Web
Web site: collection of documents referenced by URLs that share the same host name
4. Why Characterize? Extraction of cultural, commercial and social aspects:
Presence of natural languages
Most popular web servers
Adequate design and tuning of web applications:
The web is described through its characterization.
Parameters of the Web graph
How many nodes compose the graph
Types of this nodes Cultural: what’s the presence of the Portuguese language on the Web?
Comercial: which are the most popular web servers?
Web applications: systems that use the web as a source of information
Pratical example: programming languages interpreted by browsers
Cultural: what’s the presence of the Portuguese language on the Web?
Comercial: which are the most popular web servers?
Web applications: systems that use the web as a source of information
Pratical example: programming languages interpreted by browsers
5. Characterizing the WWW vs. Community Webs Huge
Sampling is a “must”
WWW is not uniform
Small partitions are ignored
pártitions
pártitions
6. WWW.TUMBA.PT Publicly available:
Characterize
Search
Almost:
Archive
The Portuguese Web - Tumba is a combination of a search engine and an archive for the Portuguese Web. The archival functions are yet not publicly available.- Tumba is a combination of a search engine and an archive for the Portuguese Web. The archival functions are yet not publicly available.
7. Main objectives: Estimate the resources need to create a web-archive of the Portuguese Web;
Validate crawls;
Gather guidelines to improve the systems (crawling, repository, index). We can’t inspect all the pages we gather, so a characterization will help us to identify strange patterns that would may sugest a bug. Example: Content-Length and realSize, helped us identifying a bug on the conversion module.
Knowing our source of information (the Web) will help us improving our system.
We can’t inspect all the pages we gather, so a characterization will help us to identify strange patterns that would may sugest a bug. Example: Content-Length and realSize, helped us identifying a bug on the conversion module.
Knowing our source of information (the Web) will help us improving our system.
8. Characterization Setup Viúva Negra Crawlers: gather information from the Web and insert it into Versus.
Versus: keeps documents in files and meta-data in relations.
Web statistics are produced issuing SQL queries to the Versus Repository. Just to give an idea of the system we used.Just to give an idea of the system we used.
9. What is the Portuguese Web? Set of documents of cultural and sociological interest to the Portuguese people.
Language
Brazilian/Portuguese community web sites
Both written in Portuguese
TLDs
Many sites hosted in gTLDs. WHOIS
types that we could convert to text
WHOIS
types that we could convert to text
10. Crawler configuration Influences statistics
The depth of the crawl influences the number of documents gathered
Replication
Mirrors
URL Aliases
Crawl as many documents as possible
Maintain robustness against pathological situations
11. VN Configuration Parameters Text documents (list selected MIME types)
Hosted under “.PT”
Hosted under “.COM”, “.NET”, “.ORG”, “.TV”.
Written in Portuguese
Host site had at least one incoming link originated under “.PT”
Download timeout=60s
Max Size=2MB
Avoid traps:
max docs per site=8000
crawl at most 50 times the same document
12. Collected Statistics 4 million URLs and 78 GB.
83% successfully downloaded (200)
3.4% not found (404)
1.2% took more than 1 minute to download
0.5% bigger than 2 MB
13. Site statistics Virtual hosts
Host aliases
Virtual hosts
Host aliases
14. Language Distribution (.pt only)
15. Size Distribution
16. Other Statistics Average length of an URL is 62 chars
unknown Last-Modified Date: 53%
HTML: 95%
78 GB of data produced 8.7 GB of text
Meta-tags are scarce (description 17%, keywords 18%)
15.5% Replication We found valid urls with length from 5 to 1368 chars
82% of the unknown dates had embedded parameters
The MIMEs are not always correct, JARS and PPS have been identified as text/plain
Building a search engine requires less storage than a archive per crawl!
Language analysis only under PT
Documents written in multiple languages
Meta-tags and titles are often repeated in all the site.
We found valid urls with length from 5 to 1368 chars
82% of the unknown dates had embedded parameters
The MIMEs are not always correct, JARS and PPS have been identified as text/plain
Building a search engine requires less storage than a archive per crawl!
Language analysis only under PT
Documents written in multiple languages
Meta-tags and titles are often repeated in all the site.
17. http://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blank Será que o sistema está preparado para lidar com isto?
Depende dos requisitos do sistema e da frequência com que estas situações ocorram
Tem de se estudar a distribuição tb!
1368 charsSerá que o sistema está preparado para lidar com isto?
Depende dos requisitos do sistema e da frequência com que estas situações ocorram
Tem de se estudar a distribuição tb!
1368 chars
18. Other Statistics Average length of an URL is 62 chars
unknown Last-Modified Date: 53%
HTML: 95%
78 GB of data produced 8.7 GB of text
Meta-tags are scarce (description 17%, keywords 18%)
15.5% Replication We found valid urls with length from 5 to 1368 chars
82% of the unknown dates had embedded parameters
The MIMEs are not always correct, JARS and PPS have been identified as text/plain
Building a search engine requires less storage than a archive per crawl!
Language analysis only under PT
Documents written in multiple languages
Meta-tags and titles are often repeated in all the site.
We found valid urls with length from 5 to 1368 chars
82% of the unknown dates had embedded parameters
The MIMEs are not always correct, JARS and PPS have been identified as text/plain
Building a search engine requires less storage than a archive per crawl!
Language analysis only under PT
Documents written in multiple languages
Meta-tags and titles are often repeated in all the site.
19. Conclusions Defined the Portuguese Web as a crawling policy.
Characterization can not be dissociated from crawling technology.
A search engine repository is a source of interesting statistics.
Statistics are an important tool for validating and designing web applications
spider traps can bias characterizations
spider traps can bias characterizations
20. Future Work Study the linkage structure
Crawl other types such as postscripts
Improve the algorithm used to find Portuguese web sites outside the .PT domain
Study the evolution of the Portuguese Web - crawls must be finished in a short period of time (dynamic web pages and replication)- crawls must be finished in a short period of time (dynamic web pages and replication)
21. Thank you for your attention. daniel@tumba.pt
http://xldb.fc.ul.pt
http://www.tumba.pt