360 likes | 372 Views
Explore the various forms of webometrics and their applications in understanding information resources, structures, and technologies on the web. Learn about link topology, search engine analysis, web mining, and more.
E N D
The Range of Webometrics: Forms of Digital Social Utility as Tools Professor Peter Ingwersen, Ph.D. Information Interaction & Information Architecture Royal School of LIS, Denmark pi@iva.dk - http://www.iva.dk/pi
Table of Contents • Webometrics – Cybermetrics – a framework • Link topology & structural conceptions in webometrics • Overview of potentials • Search engine analysis • Link analyses – Web Impact Factor (Web-IF) • Dataset Usage Indicators • Web mining – Trend analyses (blog contents) • Concluding remarks Ingwersen
Webometrics • The study of quantitative aspects of the construction and use of information resources, structuresand technologieson the Web, drawing on bibliometric and informetric methods • search engine performance • link structures, e.g., WIFs, cohesiveness of link topologies, etc. • users’ information behaviour (searching, browsing, etc.) • web page contents – knowledge mining – blog trends • Dataset analyses & impact • cybermetrics: quantitative studies of the whole Internet • i.e. chat, mailing lists, news groups, MUDs, etc. - and WWW Ingwersen • Lennart Björneborn 2001
L. Björneborn & P. Ingwersen 2003 infor-/biblio-/sciento-/cyber-/webo-/metrics informetrics bibliometrics scientometrics cybermetrics webometrics Ingwersen
corona model (Björneborn 2004) SCC Strongest Connected Component OUT reachable from SCC IN traversable to SCC Disconnected IN-Tendrils connected from IN Tube connecting IN to OUT OUT-Tendrils connected to OUT Ingwersen
Source: www.cybergeography.org Ingwersen
L. Björneborn & P. Ingwersen 2003 Link terminologybasic concepts A B E G C D F • B has an outlink to C; outlinking : ~ reference • B has an inlink from A; inlinked : ~ citation • B has a selflink; selflinking : ~ self-citation • A has no inlinks; non-linked: ~ non-cited • E and F are reciprocally linked • A is transitively linked with H via B – DH is reachable from A by a directed link path • A has a transversal link to G : short cut • C and D are co-linked from B, i.e. have co-inlinks orshared inlinks: co-citation • B and E are co-linking to D, i.e. have co-out-links orshared outlinks: bibliog.coupling H co-links Ingwersen
d b a c Levels of web nodes • Lennart Björneborn 2002 • 3 basic levels of web nodes: pages , sites, TLDs • different levels of selflinks and outlinks • a = page selflink • b = page outlink andsite selflink • c = site outlink and TLD selflink • d = TLD outlink • more levels: frames (page sections), sub-sites, sub-TLDs ... Ingwersen
Search engine analyses • See e.g. Judith Bar-Ilan’s excellent longitudinal analyses • Mike Thelwall et al. in several case studies • Scientific material on the Web: • Lawrence & Giles (1999):approx. 6 % of Web sites contains scientific or educational contents • Increasingly:the Web is a web of uncertainty • Allen et al. (1999) – biology topics from 500 Web sites assessed for quality: • 46 % of sites were ”informative” – but: • 10-35 % inaccurate; 20-35 % misleading • 48 % unreferenced Ingwersen
http://searchenginewatch.com/3634992 Ingwersen
www.internetworldstats.com 12 Ingwersen Knoxville 2010
Possible types of Web-IF: E-journal Web-IF Calculated by in-links Calculated as traditional JIF (citations) Scientific web site – IF (by link analyses) National – regional (some URL-problems in TDL) Institutions – single sites Other entities, e.g. domains Best nominator:no. of staff, beds – or simply use external inlinks (Thelwall et al., 2002) Blog IF: no. of external inlinks / blog entries Twitter IF: no of external inlinks / twitter entries (Holmberg, 2009) 13 Ingwersen Knoxville 2010
The only valid webometric tool: Site Explorer Yahoo Search … • If one enters (old valid) commands like: • Link:URL or Domain: topdomain (edu, dk) or Site:URL you are transferred to: http://siteexplorer.search.yahoo.com/new/ • Or find it via this URL • The same facilities are available in click-mode, as one starts with a given URL: • Finding ‘all’ web pages in a site • Finding ‘all’ inlinks to that site/those pages • Also without selflinks! – this implies … Ingwersen
… to calculate Web Impact Factors • But one should be prudent in interpretations. • Note that external inlinks is the best indicator of recognition (see sample) • Take care of how many sub-domains (and pages) that are included in the click analysis. • Results can be downloaded Ingwersen
Consequences for Yahoo Site Expl. Take care on which domain-level you are: www.yahoo.com does not contain sub-domains like maps.yahoo.com – only those below its name directly. Yahoo.com will thus contain maps… Also beware of the path structure Minor tests show that probably the inlink no. really implies inlinks – not inlinking web pages. Ingwersen 16 2010 Åbo
Search sample: www.db.dk/pi Ingwersen
Without selflinks … Ingwersen
The Web-Impact Factor Ingwersen, 1998 • Intuitively (naively?) believed as similar to the Journal Impact Factor • Demonstrate recognition by other web sites - or simply impact – notnecessarilyquality • Central issue: are web sites similar to journals and web pages similar to articles? • Are in-links similar to citations – or simply road signs? • What is really calculated? • DEFINE WHAT YOU ARE CALCULATING: site or page IF Ingwersen
Web-links like citations? • Kleinberg (1998) between citation weights and Google’s PageRank: Hubs~ review article: have many outlinks (refs) to: Authority pages~ influential (highly cited) documents: have many inlinks fromHubs! Typical: Web index pages =homepage with self-inlinks = Table of contents Ingwersen
Reasons for outlinking … • Out-links mainly for functional purposes • Navigation – interest spaces… • Pointing to authority in certain domains? (Latour:rhetoric reasons for references-links) • Normative reasonsfor linking? (Merton) • Do we have negative links? • We do have non-linking (commercial sites) Ingwersen
Some additional reasons for providing links In part analogous to providing references (recognition) And, among others, • emphasising the own position and relationship (professional, collaboration, self-presentation etc.) • sharing knowledge, experience, associations … • acknowledging support, sponsorship, assistance • providing information for various purposes (commercial, scientific, education, entertainment) • drawing attention to questions of individual or common interest and to information provided by others (the navigational purpose) Ingwersen
Other differences between references, citations & links • The time issue: • Agingof sources are different on the Web: • Birth, Maturity & Obsolescence happens faster • Decline & Death of sources occur too– but • Mariages – Divorse – Re-mariage – Death & Resurrection…& alike liberalphenomena are found on the Web! (Wolfgang Glänzel) Ingwersen
Dataset usage indicators: a novelwebometric approach • Biodiversity datasets are: • Searchable • Downloadable … in • Open access • See e.g. GBIF websiteand 2009 publication:Vishwas S Chavan and Peter Ingwersen, BMC Bioinformatics, 2009, 10(Suppl 14):S2 Ingwersen
Example:Denmark – GBIF dataset providers • DanBioInfoFacility – many datasets • HerbariumUA: only two datasets • Comparable US dataset provider: • OBIS – Ocean Bio Info System Ingwersen
DanBIF distribution of datasets – sampleselectionsorted by Search Events Ingwersen, P. & Vishwas, C. (under review): INDICATORS FOR A DATA USAGE INDEX: AN INCENTIVE FOR PUBLISHING PRIMARY BIODIVERSITY DATA THROUGH AGLOBAL INFORMATION INFRASTRUCTURE.BMC Bioinformatics. Ingwersen
Sample of Dataset Usage Indicators (DUI) Ingwersen
Issue tracking – Web mining • Adequate sampling requires knowledge of the structure and properties of the population- the Web space to be sampled • Issue trackingof known properties / issues may help • Web mining the unknown is more difficult, due to • the dynamic, distributed & diverse nature • the variety of actors and minimum of standards • the lack of quality control of contents • Web archeology – study of the past Web Ingwersen
Nielsen Blog Pulse – social utility indicator • Observes blogs worldwide by providing: • Trend search– development over time of terms/concepts – user selection! • Featured trends– predefined categories • Coversation tracker– blog conversations • BlogPulse profiles– blog profiles • Look into: http://www.blogpulse.com/tools.html Ingwersen
Home > ToolsTrend Search Ingwersen
Informetric methods useful • Co-occurrence analyses (terms; names…) • Co-link and co-linking analyses • Bradford-like (skewed) distributions of links probably found in sectors of web space • In order to define the strong ties…between top frequency web objects in two sectors of topical difference • Weak (low frequency) ties – Small-Worlds – Serendipity between objects in the two sectors: UNEXPECTED relations may occur Ingwersen
Source: www.cybergeography.org Weak Tie! Ingwersen
Concluding remarks • One may be somewhat cautious on Web-IF applications without careful sampling via robotsdue to its incomprehensiveness and what it actually signifies • One might also try to investigate more the behavioural aspects of providing and receiving linksto understand what the impact might mean and how/why links are made • Better to understand the Web space information structure • Design workable robots, downloading & local analyses • Move into the social media and open access genres with social utility indicators Ingwersen
Concluding remarks - 2 • Issue tracking and web mining are:applications of Web IR / Webometrics • Combined IR and informetric methods seem promissing: • co-occurrence analyses – mapping - clustering • co-links and co-linking - transversal links • Knowledge discovery and use in diversified web spaces Ingwersen
Additional References • Adamic, L. (1999). The small world Web. Lecture Notes in Computer Science, 1696: 443-452. • Almind, T.C. And Ingwersen, P. Informetric analyses on the World Wide Web: Methodological approaches to ”Webometrics”. Journal of Documentation, 53 (1997), 404-426. • Björneborn, L. (2001). Small-world linkage and co-linkage. Proceedings of the 12th ACM Conference on Hypertext, pp. 133-134. • Björneborn, L. & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1): 65-82. • Björneborn, L. & Ingwersen, P. (2004). Towards a basic framework of webometrics. (submitted) • Broder, A. et al. (2000). Graph structure in the Web. Computer Networks, 33(1-6): 309-320. • Chakrabarti, S. et al. (1999). Mining the Web’s link structure. IEEE Computer, 32(8): 60-67. • Chavan, Vishwas S. & Ingwersen, P. (2009). Towards a data publishing framework for primary biodiversity data. BMC Bioinformatics, 10(Supp. 14): S2 • Granovetter, M.S. (1973). The strength of weak ties. American Journal of Sociology, 78(6): 1360-1380. Ingwersen
References -2 • Ingwersen, P. The calculation of Web Impact Factors. Journal of Documentation, 54 (1998), 236-243 • Kousha, K. & Thelwall, M. (2007). How is Science cited on the Web? A classification of Google unique Web citations. Journal of American Society for Information Science and Technology, 58(11): 1631-1644. • Matthews, R. (1998). Six degrees of separation. New Scientist, June 6. • Newman, M.E.J. (2001). The structure of scientific collaboration networks. PNAS, 98(2): 404-409. • Rousseau, R. Daily time series of common single word searches in AltaVista and Northern Light. Cybermetrics, 2/3, paper 2. ISSN: 1137-5019. (http://www.cindoc.csic.es/cybermetrics/articles/v2ilp2.html) • Small, H. (1999). A passage through science: crossing disciplinary boundaries. Library Trends, 48(1): 72-108. • Swanson, D.R. (1986). Undiscovered public knowledge. Library Quarterly, 56(2): 103-118. • Thelwall, M. Web impact factors and search engine coverage. Journal of Documentation, 56 (2000), 185-189. • Watts, D. J. & Strogatz, S.H.(1998). Collective dynamics of ‘small-world’ networks. Nature, 393 (June 4): 440-442. Ingwersen