1 / 40

Webometrics – methods and perspectives

This guide explores Webometrics methods including link analyses, search engine performance, and cybermetrics. Discover dataset usage indicators and trends in web mining.

chancem
Download Presentation

Webometrics – methods and perspectives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Webometrics – methods and perspectives Professor Peter Ingwersen, Ph.D. Information Interaction & Information Architecture Royal School of LIS, Denmark pi@db.dk - http://www.db.dk/pi

  2. Table of Contents • Webometrics – Cybermetrics – a framework • Link topology & structural conceptions in webometrics • Overview of potentials • Search engine analysis • Link analyses – Web Impact Factor (Web-IF) • Dataset Usage Indicators • Web mining – Trend analyses (blog contents) • Concluding remarks Ingwersen

  3. Webometrics • The study of quantitative aspects of the construction and use of information resources, structuresand technologieson the Web, drawing on bibliometric and informetric methods • search engine performance • link structures, e.g., WIFs, cohesiveness of link topologies, etc. • users’ information behaviour (searching, browsing, etc.) • web page contents – knowledge mining – blog trends • Dataset analyses & impact • cybermetrics: quantitative studies of the whole Internet • i.e. chat, mailing lists, news groups, MUDs, etc. - and WWW Ingwersen • Lennart Björneborn 2001

  4. L. Björneborn & P. Ingwersen 2003 infor-/biblio-/sciento-/cyber-/webo-/metrics informetrics bibliometrics scientometrics cybermetrics webometrics Ingwersen

  5. corona model (Björneborn 2004) SCC Strongest Connected Component OUT reachable from SCC IN traversable to SCC Disconnected IN-Tendrils connected from IN Tube connecting IN to OUT OUT-Tendrils connected to OUT Ingwersen

  6. Source: www.cybergeography.org Ingwersen

  7. L. Björneborn & P. Ingwersen 2003 Link terminologybasic concepts A B E G C D F • B has an outlink to C; outlinking : ~ reference • B has an inlink from A; inlinked : ~ citation • B has a selflink; selflinking : ~ self-citation • A has no inlinks; non-linked: ~ non-cited • E and F are reciprocally linked • A is transitively linked with H via B – DH is reachable from A by a directed link path • A has a transversal link to G : short cut • C and D are co-linked from B, i.e. have co-inlinks orshared inlinks: co-citation • B and E are co-linking to D, i.e. have co-out-links orshared outlinks: bibliog.coupling H co-links Ingwersen

  8. d b a c Levels of web nodes • Lennart Björneborn 2002 • 3 basic levels of web nodes: pages , sites, TLDs • different levels of selflinks and outlinks • a = page selflink • b = page outlink andsite selflink • c = site outlink and TLD selflink • d = TLD outlink • more levels: frames (page sections), sub-sites, sub-TLDs ... Ingwersen

  9. Search engine analyses • See e.g. Judith Bar-Ilan’s excellent longitudinal analyses • Mike Thelwall et al. in several case studies • Scientific material on the Web: • Lawrence & Giles (1999):approx. 6 % of Web sites contains scientific or educational contents • Increasingly:the Web is a web of uncertainty • Allen et al. (1999) – biology topics from 500 Web sites assessed for quality: • 46 % of sites were ”informative” – but: • 10-35 % inaccurate; 20-35 % misleading • 48 % unreferenced Ingwersen

  10. http://searchenginewatch.com/3634992 Ingwersen

  11. Ingwersen

  12. Jepsen et al. – JASIST 2004 • Accessibility (LIMITS) of publications for download: • AllTheWeb: 4.100 • Google: 1.000 • AltaVista: 200 – this is a problem! • Only the highest ranked Web publications • Sample of 3 x 200 = 600 publications: • to be classified as to quality of contents • sub sample (n=88): correlation between categories and links, references, metadata Ingwersen

  13. Quality categorisation • Scientific: preprints, conference reports, articles, abstracts • Scientifically related: (potentially relevant) CVs, directories, institutional reports • Teaching: textbooks, tutorials, student papers, course programmes • Low grade:still on topic, but commercial, inaccurate, misleading, opinionated • Noise: not pertinent to topic • Unavailable: inaccessible web pages Ingwersen

  14. Quality vs. links and references Ingwersen

  15. The only valid webometric tool: Site Explorer Yahoo Search … • If one enters (old valid) commands like: • Link:URL or Domain: topdomain (edu, dk) or Site:URL you are transferred to: http://siteexplorer.search.yahoo.com/new/ • Or find it via this URL • The same facilities are available in click-mode, as one starts with a given URL: • Finding ‘all’ web pages in a site • Finding ‘all’ inlinks to that site/those pages • Also without selflinks! – this implies … Ingwersen

  16. … to calculate Web Impact Factors • But one should be prudent in interpretations. • Note that external inlinks is the best indicator of recognition (see sample) • Take care of how many sub-domains (and pages) that are included in the click analysis. • Results can be downloaded Ingwersen

  17. The Web-Impact Factor Ingwersen, 1998 • Intuitively (naively?) believed as similar to the Journal Impact Factor • Demonstrate recognition by other web sites - or simply impact – notnecessarilyquality • Central issue: are web sites similar to journals and web pages similar to articles? • Are in-links similar to citations – or simply road signs? • What is really calculated? • DEFINE WHAT YOU ARE CALCULATING: site or page IF Ingwersen

  18. Search sample: www.db.dk/pi Ingwersen

  19. Without selflinks … Ingwersen

  20. Possible types of Web-IF: • E-journal Web-IF • Calculated as traditional JIF (citations) • Calculated by in-links • Scientific web site – IF(by link analyses) • National – regional (some size & URL-problems) • Institutions – single sites - Other entities • Top-Level Domains – not always applied • .com - .org - .edu - .ac • Sites in a country distributed over TLDs! Ingwersen

  21. Consequences for Yahoo Site Expl. • Take care on which domain-level you are: • www.yahoo.com does not contain sub-domains like maps.yahoo.com – only those below its name directly. • Yahoo.com will thus contain maps… • Also beware of the path structure • Minor tests show that probably the inlink no. really implies inlinks – not inlinking web pages. Ingwersen

  22. Web-links like citations? • Kleinberg (1998) between citation weights and Google’s PageRank: Hubs~ review article: have many outlinks (refs) to: Authority pages~ influential (highly cited) documents: have many inlinks fromHubs! Typical: Web index pages =homepage with self-inlinks = Table of contents Ingwersen

  23. Reasons for outlinking … • Out-links mainly for functional purposes • Navigation – interest spaces… • Pointing to authority in certain domains? (Latour:rhetoric reasons for references-links) • Normative reasonsfor linking? (Merton) • Do we have negative links? • We do have non-linking (commercial sites) Ingwersen

  24. Some additional reasons for providing links In part analogous to providing references (recognition) And, among others, • emphasising the own position and relationship (professional, collaboration, self-presentation etc.) • sharing knowledge, experience, associations … • acknowledging support, sponsorship, assistance • providing information for various purposes (commercial, scientific, education, entertainment) • drawing attention to questions of individual or common interest and to information provided by others (the navigational purpose) Ingwersen

  25. Other differences between references, citations & links • The time issue: • Agingof sources are different on the Web: • Birth, Maturity & Obsolescence happens faster • Decline & Death of sources occur too– but • Mariages – Divorse – Re-mariage – Death & Resurrection…& alike liberalphenomena are found on the Web! (Wolfgang Glänzel) Ingwersen

  26. Dataset usage indicators • Biodiversity datasets are: • Searchable • Downloadable … in • Open access • See e.g. GBIF websiteand 2009 publication:Vishwas S Chavan and Peter Ingwersen, BMC Bioinformatics, 2009, 10(Suppl 14):S2 Ingwersen

  27. Denmark – GBIF datasetproviders • DanBioInfoFacility – many datasets • HerbariumUA: only two datasets • Comparable US dataset provider: • OBIS – Ocean Bio Info System Ingwersen

  28. Data Capture Procedure • Go to: http://data.gbif.org/ • Click on ’Datasets’ (top right on page) • ’Data Providers’ in alphabetical order. – Select, e.g. OBIS • You will get to http://data.gbif.org/datasets/provider/82 • Below map you observe OBIS’ 181 datasets. Below that is a link: ’View Event log for Ocean …’ – Click on link • On log console select: • Datasets: ALL • Events: Usage - ALL -– 3. Level: ALL • Start date: 1st Jan 2010 -- End date: 31st Jan 2010 • Click on ’REFRESH’ Ingwersen

  29. Data capturecont. • You obtain the logs of usage of all OBIS datasets • One may ’Download these logs’ • From the logs one may obtain all the necessary data to create the Dataset Usage Indicators through Excell by data import. Ingwersen

  30. DanBIF distribution of datasets - sorted by Search Events Ingwersen

  31. Sample of Dataset Usage Indicators (DUI) Ingwersen

  32. Issue tracking – Web mining • Adequate sampling requires knowledge of the structure and properties of the population- the Web space to be sampled • Issue trackingof known properties / issues may help • Web mining the unknown is more difficult, due to • the dynamic, distributed & diverse nature • the variety of actors and minimum of standards • the lack of quality control of contents • Web archeology – study of the past Web Ingwersen

  33. Nielsen Blog Pulse • Observes blogs worldwide by providing: • Trend search – development over time of terms/concepts – user selection! • Featured trends – predefined categories • Coversation tracker – blog conversations • BlogPulse profiles – blog profiles • Look into: http://www.blogpulse.com/tools.html Ingwersen

  34. Home > ToolsTrend Search Ingwersen

  35. Informetric methods useful • Co-occurrence analyses (terms; names…) • Co-link and co-linking analyses • Bradford-like (skewed) distributions of links probably found in sectors of web space • In order to define the strong ties…between top frequency web objects in two sectors of big topical difference • Weak ties – Small-Worlds – Serendipity between low frequency objects in the two sectors: UNEXPECTED relations may occur Ingwersen

  36. Source: www.cybergeography.org Tie! Ingwersen

  37. Concluding remarks • One may be somewhat cautious on Web-IF applications without careful sampling via robotsdue to its incomprehensiveness and what it actually signifies • One might also try to investigate more the behavioural aspects of providing and receiving linksto understand what the impact might mean and how/why links are made • Understand the Web space structure better • Design workable robots, downloading & local analyses Ingwersen

  38. Concluding remarks - 2 • Issue tracking and web mining are:applications of Web IR • Combined IR and informetric methods seem promissing: • co-occurrence analyses – mapping - clustering • co-links and co-linking - transversal links • Knowledge discovery in diversified web spaces Ingwersen

  39. Additional References • Adamic, L. (1999). The small world Web. Lecture Notes in Computer Science, 1696: 443-452. • Almind, T.C. And Ingwersen, P. Informetric analyses on the World Wide Web: Methodological approaches to ”Webometrics”. Journal of Documentation, 53 (1997), 404-426. • Björneborn, L. (2001). Small-world linkage and co-linkage. Proceedings of the 12th ACM Conference on Hypertext, pp. 133-134. • Björneborn, L. & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1): 65-82. • Björneborn, L. & Ingwersen, P. (2004). Towards a basic framework of webometrics. (submitted) • Broder, A. et al. (2000). Graph structure in the Web. Computer Networks, 33(1-6): 309-320. • Chakrabarti, S. et al. (1999). Mining the Web’s link structure. IEEE Computer, 32(8): 60-67. • Chavan, Vishwas S. & Ingwersen, P. (2009). Towards a data publishing framework for primary biodiversity data. BMC Bioinformatics, 10(Supp. 14): S2 • Granovetter, M.S. (1973). The strength of weak ties. American Journal of Sociology, 78(6): 1360-1380. Ingwersen

  40. References -2 • Ingwersen, P. The calculation of Web Impact Factors. Journal of Documentation, 54 (1998), 236-243 • Kousha, K. & Thelwall, M. (2007). How is Science cited on the Web? A classification of Google unique Web citations. Journal of American Society for Information Science and Technology, 58(11): 1631-1644. • Matthews, R. (1998). Six degrees of separation. New Scientist, June 6. • Newman, M.E.J. (2001). The structure of scientific collaboration networks. PNAS, 98(2): 404-409. • Rousseau, R. Daily time series of common single word searches in AltaVista and Northern Light. Cybermetrics, 2/3, paper 2. ISSN: 1137-5019. (http://www.cindoc.csic.es/cybermetrics/articles/v2ilp2.html) • Small, H. (1999). A passage through science: crossing disciplinary boundaries. Library Trends, 48(1): 72-108. • Swanson, D.R. (1986). Undiscovered public knowledge. Library Quarterly, 56(2): 103-118. • Thelwall, M. Web impact factors and search engine coverage. Journal of Documentation, 56 (2000), 185-189. • Watts, D. J. & Strogatz, S.H.(1998). Collective dynamics of ‘small-world’ networks. Nature, 393 (June 4): 440-442. Ingwersen

More Related