1 / 21

A Characterization of the Portuguese Web

Presentation . IntroductionSetupStatisticsConclusionsFuture Work. Terminology. Document: file resultant from a successful HTTP downloadPublisher: entity responsible for publishing the document on the Web Web site: collection of documents referenced by URLs that share the same host name. Why C

addo
Download Presentation

A Characterization of the Portuguese Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. A Characterization of the Portuguese Web Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt

    2. Presentation Introduction Setup Statistics Conclusions Future Work

    3. Terminology Document: file resultant from a successful HTTP download Publisher: entity responsible for publishing the document on the Web Web site: collection of documents referenced by URLs that share the same host name

    4. Why Characterize? Extraction of cultural, commercial and social aspects: Presence of natural languages Most popular web servers Adequate design and tuning of web applications: The web is described through its characterization. Parameters of the Web graph How many nodes compose the graph Types of this nodes Cultural: what’s the presence of the Portuguese language on the Web? Comercial: which are the most popular web servers? Web applications: systems that use the web as a source of information Pratical example: programming languages interpreted by browsers Cultural: what’s the presence of the Portuguese language on the Web? Comercial: which are the most popular web servers? Web applications: systems that use the web as a source of information Pratical example: programming languages interpreted by browsers

    5. Characterizing the WWW vs. Community Webs Huge Sampling is a “must” WWW is not uniform Small partitions are ignored pártitions pártitions

    6. WWW.TUMBA.PT Publicly available: Characterize Search Almost: Archive The Portuguese Web - Tumba is a combination of a search engine and an archive for the Portuguese Web. The archival functions are yet not publicly available.- Tumba is a combination of a search engine and an archive for the Portuguese Web. The archival functions are yet not publicly available.

    7. Main objectives: Estimate the resources need to create a web-archive of the Portuguese Web; Validate crawls; Gather guidelines to improve the systems (crawling, repository, index). We can’t inspect all the pages we gather, so a characterization will help us to identify strange patterns that would may sugest a bug. Example: Content-Length and realSize, helped us identifying a bug on the conversion module. Knowing our source of information (the Web) will help us improving our system. We can’t inspect all the pages we gather, so a characterization will help us to identify strange patterns that would may sugest a bug. Example: Content-Length and realSize, helped us identifying a bug on the conversion module. Knowing our source of information (the Web) will help us improving our system.

    8. Characterization Setup Viúva Negra Crawlers: gather information from the Web and insert it into Versus. Versus: keeps documents in files and meta-data in relations. Web statistics are produced issuing SQL queries to the Versus Repository. Just to give an idea of the system we used.Just to give an idea of the system we used.

    9. What is the Portuguese Web? Set of documents of cultural and sociological interest to the Portuguese people. Language Brazilian/Portuguese community web sites Both written in Portuguese TLDs Many sites hosted in gTLDs. WHOIS types that we could convert to text WHOIS types that we could convert to text

    10. Crawler configuration Influences statistics The depth of the crawl influences the number of documents gathered Replication Mirrors URL Aliases Crawl as many documents as possible Maintain robustness against pathological situations

    11. VN Configuration Parameters Text documents (list selected MIME types) Hosted under “.PT” Hosted under “.COM”, “.NET”, “.ORG”, “.TV”. Written in Portuguese Host site had at least one incoming link originated under “.PT” Download timeout=60s Max Size=2MB Avoid traps: max docs per site=8000 crawl at most 50 times the same document

    12. Collected Statistics 4 million URLs and 78 GB. 83% successfully downloaded (200) 3.4% not found (404) 1.2% took more than 1 minute to download 0.5% bigger than 2 MB

    13. Site statistics Virtual hosts Host aliases Virtual hosts Host aliases

    14. Language Distribution (.pt only)

    15. Size Distribution

    16. Other Statistics Average length of an URL is 62 chars unknown Last-Modified Date: 53% HTML: 95% 78 GB of data produced 8.7 GB of text Meta-tags are scarce (description 17%, keywords 18%) 15.5% Replication We found valid urls with length from 5 to 1368 chars 82% of the unknown dates had embedded parameters The MIMEs are not always correct, JARS and PPS have been identified as text/plain Building a search engine requires less storage than a archive per crawl! Language analysis only under PT Documents written in multiple languages Meta-tags and titles are often repeated in all the site. We found valid urls with length from 5 to 1368 chars 82% of the unknown dates had embedded parameters The MIMEs are not always correct, JARS and PPS have been identified as text/plain Building a search engine requires less storage than a archive per crawl! Language analysis only under PT Documents written in multiple languages Meta-tags and titles are often repeated in all the site.

    17. http://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blank Será que o sistema está preparado para lidar com isto? Depende dos requisitos do sistema e da frequência com que estas situações ocorram Tem de se estudar a distribuição tb! 1368 charsSerá que o sistema está preparado para lidar com isto? Depende dos requisitos do sistema e da frequência com que estas situações ocorram Tem de se estudar a distribuição tb! 1368 chars

    18. Other Statistics Average length of an URL is 62 chars unknown Last-Modified Date: 53% HTML: 95% 78 GB of data produced 8.7 GB of text Meta-tags are scarce (description 17%, keywords 18%) 15.5% Replication We found valid urls with length from 5 to 1368 chars 82% of the unknown dates had embedded parameters The MIMEs are not always correct, JARS and PPS have been identified as text/plain Building a search engine requires less storage than a archive per crawl! Language analysis only under PT Documents written in multiple languages Meta-tags and titles are often repeated in all the site. We found valid urls with length from 5 to 1368 chars 82% of the unknown dates had embedded parameters The MIMEs are not always correct, JARS and PPS have been identified as text/plain Building a search engine requires less storage than a archive per crawl! Language analysis only under PT Documents written in multiple languages Meta-tags and titles are often repeated in all the site.

    19. Conclusions Defined the Portuguese Web as a crawling policy. Characterization can not be dissociated from crawling technology. A search engine repository is a source of interesting statistics. Statistics are an important tool for validating and designing web applications spider traps can bias characterizations spider traps can bias characterizations

    20. Future Work Study the linkage structure Crawl other types such as postscripts Improve the algorithm used to find Portuguese web sites outside the .PT domain Study the evolution of the Portuguese Web - crawls must be finished in a short period of time (dynamic web pages and replication)- crawls must be finished in a short period of time (dynamic web pages and replication)

    21. Thank you for your attention. daniel@tumba.pt http://xldb.fc.ul.pt http://www.tumba.pt

More Related