400 likes | 413 Views
This guide explores Webometrics methods including link analyses, search engine performance, and cybermetrics. Discover dataset usage indicators and trends in web mining.
E N D
Webometrics – methods and perspectives Professor Peter Ingwersen, Ph.D. Information Interaction & Information Architecture Royal School of LIS, Denmark pi@db.dk - http://www.db.dk/pi
Table of Contents • Webometrics – Cybermetrics – a framework • Link topology & structural conceptions in webometrics • Overview of potentials • Search engine analysis • Link analyses – Web Impact Factor (Web-IF) • Dataset Usage Indicators • Web mining – Trend analyses (blog contents) • Concluding remarks Ingwersen
Webometrics • The study of quantitative aspects of the construction and use of information resources, structuresand technologieson the Web, drawing on bibliometric and informetric methods • search engine performance • link structures, e.g., WIFs, cohesiveness of link topologies, etc. • users’ information behaviour (searching, browsing, etc.) • web page contents – knowledge mining – blog trends • Dataset analyses & impact • cybermetrics: quantitative studies of the whole Internet • i.e. chat, mailing lists, news groups, MUDs, etc. - and WWW Ingwersen • Lennart Björneborn 2001
L. Björneborn & P. Ingwersen 2003 infor-/biblio-/sciento-/cyber-/webo-/metrics informetrics bibliometrics scientometrics cybermetrics webometrics Ingwersen
corona model (Björneborn 2004) SCC Strongest Connected Component OUT reachable from SCC IN traversable to SCC Disconnected IN-Tendrils connected from IN Tube connecting IN to OUT OUT-Tendrils connected to OUT Ingwersen
Source: www.cybergeography.org Ingwersen
L. Björneborn & P. Ingwersen 2003 Link terminologybasic concepts A B E G C D F • B has an outlink to C; outlinking : ~ reference • B has an inlink from A; inlinked : ~ citation • B has a selflink; selflinking : ~ self-citation • A has no inlinks; non-linked: ~ non-cited • E and F are reciprocally linked • A is transitively linked with H via B – DH is reachable from A by a directed link path • A has a transversal link to G : short cut • C and D are co-linked from B, i.e. have co-inlinks orshared inlinks: co-citation • B and E are co-linking to D, i.e. have co-out-links orshared outlinks: bibliog.coupling H co-links Ingwersen
d b a c Levels of web nodes • Lennart Björneborn 2002 • 3 basic levels of web nodes: pages , sites, TLDs • different levels of selflinks and outlinks • a = page selflink • b = page outlink andsite selflink • c = site outlink and TLD selflink • d = TLD outlink • more levels: frames (page sections), sub-sites, sub-TLDs ... Ingwersen
Search engine analyses • See e.g. Judith Bar-Ilan’s excellent longitudinal analyses • Mike Thelwall et al. in several case studies • Scientific material on the Web: • Lawrence & Giles (1999):approx. 6 % of Web sites contains scientific or educational contents • Increasingly:the Web is a web of uncertainty • Allen et al. (1999) – biology topics from 500 Web sites assessed for quality: • 46 % of sites were ”informative” – but: • 10-35 % inaccurate; 20-35 % misleading • 48 % unreferenced Ingwersen
http://searchenginewatch.com/3634992 Ingwersen
Jepsen et al. – JASIST 2004 • Accessibility (LIMITS) of publications for download: • AllTheWeb: 4.100 • Google: 1.000 • AltaVista: 200 – this is a problem! • Only the highest ranked Web publications • Sample of 3 x 200 = 600 publications: • to be classified as to quality of contents • sub sample (n=88): correlation between categories and links, references, metadata Ingwersen
Quality categorisation • Scientific: preprints, conference reports, articles, abstracts • Scientifically related: (potentially relevant) CVs, directories, institutional reports • Teaching: textbooks, tutorials, student papers, course programmes • Low grade:still on topic, but commercial, inaccurate, misleading, opinionated • Noise: not pertinent to topic • Unavailable: inaccessible web pages Ingwersen
Quality vs. links and references Ingwersen
The only valid webometric tool: Site Explorer Yahoo Search … • If one enters (old valid) commands like: • Link:URL or Domain: topdomain (edu, dk) or Site:URL you are transferred to: http://siteexplorer.search.yahoo.com/new/ • Or find it via this URL • The same facilities are available in click-mode, as one starts with a given URL: • Finding ‘all’ web pages in a site • Finding ‘all’ inlinks to that site/those pages • Also without selflinks! – this implies … Ingwersen
… to calculate Web Impact Factors • But one should be prudent in interpretations. • Note that external inlinks is the best indicator of recognition (see sample) • Take care of how many sub-domains (and pages) that are included in the click analysis. • Results can be downloaded Ingwersen
The Web-Impact Factor Ingwersen, 1998 • Intuitively (naively?) believed as similar to the Journal Impact Factor • Demonstrate recognition by other web sites - or simply impact – notnecessarilyquality • Central issue: are web sites similar to journals and web pages similar to articles? • Are in-links similar to citations – or simply road signs? • What is really calculated? • DEFINE WHAT YOU ARE CALCULATING: site or page IF Ingwersen
Search sample: www.db.dk/pi Ingwersen
Without selflinks … Ingwersen
Possible types of Web-IF: • E-journal Web-IF • Calculated as traditional JIF (citations) • Calculated by in-links • Scientific web site – IF(by link analyses) • National – regional (some size & URL-problems) • Institutions – single sites - Other entities • Top-Level Domains – not always applied • .com - .org - .edu - .ac • Sites in a country distributed over TLDs! Ingwersen
Consequences for Yahoo Site Expl. • Take care on which domain-level you are: • www.yahoo.com does not contain sub-domains like maps.yahoo.com – only those below its name directly. • Yahoo.com will thus contain maps… • Also beware of the path structure • Minor tests show that probably the inlink no. really implies inlinks – not inlinking web pages. Ingwersen
Web-links like citations? • Kleinberg (1998) between citation weights and Google’s PageRank: Hubs~ review article: have many outlinks (refs) to: Authority pages~ influential (highly cited) documents: have many inlinks fromHubs! Typical: Web index pages =homepage with self-inlinks = Table of contents Ingwersen
Reasons for outlinking … • Out-links mainly for functional purposes • Navigation – interest spaces… • Pointing to authority in certain domains? (Latour:rhetoric reasons for references-links) • Normative reasonsfor linking? (Merton) • Do we have negative links? • We do have non-linking (commercial sites) Ingwersen
Some additional reasons for providing links In part analogous to providing references (recognition) And, among others, • emphasising the own position and relationship (professional, collaboration, self-presentation etc.) • sharing knowledge, experience, associations … • acknowledging support, sponsorship, assistance • providing information for various purposes (commercial, scientific, education, entertainment) • drawing attention to questions of individual or common interest and to information provided by others (the navigational purpose) Ingwersen
Other differences between references, citations & links • The time issue: • Agingof sources are different on the Web: • Birth, Maturity & Obsolescence happens faster • Decline & Death of sources occur too– but • Mariages – Divorse – Re-mariage – Death & Resurrection…& alike liberalphenomena are found on the Web! (Wolfgang Glänzel) Ingwersen
Dataset usage indicators • Biodiversity datasets are: • Searchable • Downloadable … in • Open access • See e.g. GBIF websiteand 2009 publication:Vishwas S Chavan and Peter Ingwersen, BMC Bioinformatics, 2009, 10(Suppl 14):S2 Ingwersen
Denmark – GBIF datasetproviders • DanBioInfoFacility – many datasets • HerbariumUA: only two datasets • Comparable US dataset provider: • OBIS – Ocean Bio Info System Ingwersen
Data Capture Procedure • Go to: http://data.gbif.org/ • Click on ’Datasets’ (top right on page) • ’Data Providers’ in alphabetical order. – Select, e.g. OBIS • You will get to http://data.gbif.org/datasets/provider/82 • Below map you observe OBIS’ 181 datasets. Below that is a link: ’View Event log for Ocean …’ – Click on link • On log console select: • Datasets: ALL • Events: Usage - ALL -– 3. Level: ALL • Start date: 1st Jan 2010 -- End date: 31st Jan 2010 • Click on ’REFRESH’ Ingwersen
Data capturecont. • You obtain the logs of usage of all OBIS datasets • One may ’Download these logs’ • From the logs one may obtain all the necessary data to create the Dataset Usage Indicators through Excell by data import. Ingwersen
DanBIF distribution of datasets - sorted by Search Events Ingwersen
Sample of Dataset Usage Indicators (DUI) Ingwersen
Issue tracking – Web mining • Adequate sampling requires knowledge of the structure and properties of the population- the Web space to be sampled • Issue trackingof known properties / issues may help • Web mining the unknown is more difficult, due to • the dynamic, distributed & diverse nature • the variety of actors and minimum of standards • the lack of quality control of contents • Web archeology – study of the past Web Ingwersen
Nielsen Blog Pulse • Observes blogs worldwide by providing: • Trend search – development over time of terms/concepts – user selection! • Featured trends – predefined categories • Coversation tracker – blog conversations • BlogPulse profiles – blog profiles • Look into: http://www.blogpulse.com/tools.html Ingwersen
Home > ToolsTrend Search Ingwersen
Informetric methods useful • Co-occurrence analyses (terms; names…) • Co-link and co-linking analyses • Bradford-like (skewed) distributions of links probably found in sectors of web space • In order to define the strong ties…between top frequency web objects in two sectors of big topical difference • Weak ties – Small-Worlds – Serendipity between low frequency objects in the two sectors: UNEXPECTED relations may occur Ingwersen
Source: www.cybergeography.org Tie! Ingwersen
Concluding remarks • One may be somewhat cautious on Web-IF applications without careful sampling via robotsdue to its incomprehensiveness and what it actually signifies • One might also try to investigate more the behavioural aspects of providing and receiving linksto understand what the impact might mean and how/why links are made • Understand the Web space structure better • Design workable robots, downloading & local analyses Ingwersen
Concluding remarks - 2 • Issue tracking and web mining are:applications of Web IR • Combined IR and informetric methods seem promissing: • co-occurrence analyses – mapping - clustering • co-links and co-linking - transversal links • Knowledge discovery in diversified web spaces Ingwersen
Additional References • Adamic, L. (1999). The small world Web. Lecture Notes in Computer Science, 1696: 443-452. • Almind, T.C. And Ingwersen, P. Informetric analyses on the World Wide Web: Methodological approaches to ”Webometrics”. Journal of Documentation, 53 (1997), 404-426. • Björneborn, L. (2001). Small-world linkage and co-linkage. Proceedings of the 12th ACM Conference on Hypertext, pp. 133-134. • Björneborn, L. & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1): 65-82. • Björneborn, L. & Ingwersen, P. (2004). Towards a basic framework of webometrics. (submitted) • Broder, A. et al. (2000). Graph structure in the Web. Computer Networks, 33(1-6): 309-320. • Chakrabarti, S. et al. (1999). Mining the Web’s link structure. IEEE Computer, 32(8): 60-67. • Chavan, Vishwas S. & Ingwersen, P. (2009). Towards a data publishing framework for primary biodiversity data. BMC Bioinformatics, 10(Supp. 14): S2 • Granovetter, M.S. (1973). The strength of weak ties. American Journal of Sociology, 78(6): 1360-1380. Ingwersen
References -2 • Ingwersen, P. The calculation of Web Impact Factors. Journal of Documentation, 54 (1998), 236-243 • Kousha, K. & Thelwall, M. (2007). How is Science cited on the Web? A classification of Google unique Web citations. Journal of American Society for Information Science and Technology, 58(11): 1631-1644. • Matthews, R. (1998). Six degrees of separation. New Scientist, June 6. • Newman, M.E.J. (2001). The structure of scientific collaboration networks. PNAS, 98(2): 404-409. • Rousseau, R. Daily time series of common single word searches in AltaVista and Northern Light. Cybermetrics, 2/3, paper 2. ISSN: 1137-5019. (http://www.cindoc.csic.es/cybermetrics/articles/v2ilp2.html) • Small, H. (1999). A passage through science: crossing disciplinary boundaries. Library Trends, 48(1): 72-108. • Swanson, D.R. (1986). Undiscovered public knowledge. Library Quarterly, 56(2): 103-118. • Thelwall, M. Web impact factors and search engine coverage. Journal of Documentation, 56 (2000), 185-189. • Watts, D. J. & Strogatz, S.H.(1998). Collective dynamics of ‘small-world’ networks. Nature, 393 (June 4): 440-442. Ingwersen