Evaluating (Scientific) Knowledge for people, documents, organizations/activities/communities

Evaluating (Scientific) Knowledgefor people, documents, organizations/activities/communities ICiS Workshop: Integrating, Representing and Reasoning over Human Knowledge Snowbird August 9 2010 Geoffrey Fox gcf@indiana.edu http://www.infomall.orghttp://www.futuregrid.orghttp://pti.iu.edu/ Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington

My Role • My research would be on building Cyberinfrastructure (MapReduce/Bigtable/Clouds/Information Visualization) for “Integrating, Representing and Reasoning over Human Knowledge” • Use FutureGrid to prototype Cloud/Grid environments • Here I talk in role as frustrated journal editor and School bureaucrat responsible for advising faculty how to get NSF grants and tenure.

Knowledge Evaluation is Important? • Review of journal or conference paper. • Several conference management systems but don’t offer reviewing tools • Supporting choice of panel reviewing proposals • And proposal review itself • Supporting choice of Program Committee for conference • Supporting promotion and tenure process • h-index appears in several referee reports • Supporting ranking of organizations such as Journals, Universities and (Computer Science) departments. • Deciding if some activity is useful such as TeraGrid; particular agency or agency program; a particular evaluation process (panel v individual reviews) • Deciding if some concept useful such as multidisciplinary research, theory, computing ……. • Evaluation of Knowledge evaluation methodologies

“Policy Informatics”aka “Command & Control” (military knowledge) • In Data-Information-Knowledge-Wisdom-Decision(Evaluation) pipeline, some steps are “dynamic” (can be redone if you save raw data) but decisions are often “final” or “irreversible” • We could (and do as preprints) publish everything as “disks free” and change our evaluations • But Finite amount of research funding and finite number of tenure positions

Citation Analysis • Use of Google Scholar (Publish or Perish) to analyze contribution of individuals is well established • #papers, # citations, h-index, hc-index (contemporary), g-index (square)….. • There is ambiguity as to “best metric” and if such metrics are sound at all but in some cases, perhaps most serious problem is calculating them in unbiased fashion • One can probably find metrics for “Geoffrey Fox” but it’s hard for more common names and for example most Asian names are hard • Google Scholar has crude approach to refine by including and excluding names e.g. include “Indiana University” or exclude “GQ Fox” (not clear where words are?) • “Automating” hard unless analysis for each name done by hand • Even the name nontrivial – need “GC Fox” and “Geoffrey Fox”

Evaluating Documents • As journal editor, I find choosing referees (and persuading them to write report) as hardest problem • Especially with increasing number of non traditional authors • Need to identify related work and find authors or previous referees of these related papers • Currently ScholarOne uses largely useless keyword system • Can also look at originality of articles examined from overlap in text between a given article and some corpus (typically resubmitting conference paper unchanged) • If unfamiliar with authors need to identify which of a multi-author paper is appropriate choice, where they are now and contact information • Current services DBLP, ACM Portal, LinkedIn, Facebook don’t tell you necessary information • Need tools to quantify reliability of referees

Is High Performance Computing Useful for improving Knowledge • Are papers that use TeraGrid “better” than those that don’t? • Does TeraGrid help enhance Knowledge? • Correlate quality and type of papers with “use of TeraGrid” • Possibly can be done by text analysis (does paper acknowledge TeraGrid) • Here use indirect mapTeraGrid  Projects/People  Papers

TeraGrid Analysis I: Bollen

TeraGrid Analysis II: Bollen

TeraGrid Web of Science

Need a Freely Available Toolkit • Firstly current tools as in Google Scholar and CiteSeer have insufficient scope • Google Scholar stuck in early stage of “perpetual beta (?alpha)” after killing Windows Academic Live • Secondly need to enable customization so that can explore evaluation choices • Current CS department rankings put Indiana in dungeon – partly because Fox/Gannon papers are not counted as not in approved journals • Don’t want to let Thomson to control Impact Factors (relevant for tenure especially in Asia?) without scientific scrutiny • As discussed ScholarOne (also Thomson) is dreadful but seems to have growing adoption • Want to explore new ideas such as evaluating TeraGrid

Tools Needed • More accurate scientific profiles; ACM Portal says I have 3 publications; DBLP 250; Google Scholar 1000 • Neither tells you my contact and professional information • Unbundled CiteSeer/Google Scholar allowing more accurate document analysis • e.g. Analyze document on hand (as in conference submission) • Open decomposition into Authors, Title, Institution, emails, Abstract, Paper, citations • Analyzers of Citations and/or text to suggest referees • Analysis of novelty of document • Tool to produce accurate h-index (etc.)

Some Research Needed • Open analysis of concepts like Impact Factor, h-index, Indexing services • Look at definitions and • Possibilities of making valid deductions • How do we evaluate “groups” (research, departments) as opposed to individuals • Can one automate the current time consuming manual steps • Identity confusion in Google Scholar • Research profiles • Compare traditional ethnography approach to evaluation (do a bunch of interviews) versus data deluge enabled version • Why are Web 2.0 tools like Delicious, Facebook etc. little used in science

Evaluating (Scientific) Knowledge for people, documents, organizations/activities/communities