Web 2.0 and Grids for Scholarly Research

Web 2.0 andGrids for Scholarly Research Peking University July 27 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org

Application Drivers • Science Informatics for document analysis as in case of chemistry which has very precise naming rules for compounds that allow accurate searches in documents • Suggesting how to tag scientific documents either when writing it or after the fact • Journal web site of the future as illustrated by Nature building social bookmarking tool Connotea • Conference support tools as can benefit from features needed by journals • This gives document enhanced Cyberinfrastructure (CI)

Community Tools • e-mail and list-serves are oldest and best used • Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration – text, audio-video conferencing, files • del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared bookmarks • MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create (upload) community resources and share them; Friendster, LinkedIn create networks • http://en.wikipedia.org/wiki/List_of_social_networking_websites • Writely, Wikis and Blogs are powerful specialized shared document systems • ConferenceXP and WebEx share general applications • Google Scholar tells you who has cited your papers while publisher sites tell you about co-authors • Windows Live Academic Search has similar goals • Note sharing resources creates (implicit) communities • Social network tools study graphs to both define communities and extract their properties

How to use Web2.0 Community tools in CI • Nearly all of them have “profiles”, “users”, “groups”, “friends” etc. • Need to integrate these • P2P File Sharing: Maybe this is useful for sharing files in research groups (virtual organizations) • Will modify Maze http://maze.pku.edu.cn– popular Chinese social P2P system with 2.5 million users • BitTorrent: more popular than FTP – why not use for higher performance fault tolerant cached file sharing? • MySpace etc.: Could consider MyGridSpace or MyScienceSpace that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest • Could include uploaded material in workflows • Social Bookmarking and linking: discuss later • http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/

MyResearchDatabase Bibliographic Database Web serviceWrappers Document-enhanced Cyberinfrastructure Del.icio.us Windows Live Academic Search TraditionalCyberinfrastructure Export:RSS, BibtexEndnote etc. CiteULike Google Scholar Connotea Citeseer Bibsonomy Science.gov Biolicious PubChem Generic Document Tools CMT ConferenceManagement PubMed Manuscript Central Community Tools Integration/Enhancement User Interface etc. Existing User Interface New Document-enhanced Research Tools Existing Documentbased Research Tools

Strategy • Doesn’t seem useful to build the 251st community tool • In fact a major barrier to use of existing tools is • What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web) • So assume use existing tools but wrap them all as web services so can transfer information to new tools and integrate information between tools • Need some “glue” logic, a “unification” database and minimal user interface • Bookmarking tools: del.icio.us, Connotea, CiteULike (includes plug-ins to major publisher sites) • Document: Google Scholar, Windows Live, Citeseer tools, OSCAR3 for Chemistry, Science.gov (later) • Journals: Manuscript Central • Conferences: CMT from Microsoft or ?

Delicious Semantic Web/Grid • http://del.icio.us purchased by Yahoo for ~$30M • http://www.CiteULike.org • http://www.connotea.org (Nature) • Associate metadata with Bookmarks specified by URL’s, DOI’s (Digital Object Identifiers) • Users add comments and keywords (called tags) • Users are linked together into groups (communities) • Information such as title and authors extracted automatically from some sites (PubMed, ACM, IEEE, Wiley etc.) • Bibtex like additional information in CiteULike • This is perhaps de facto Semantic Web – remarkable for its simplicity

Connotea

Connotea queried by SERVOGrid

Document-enhanced Cyberinfrastructureaka Semantic Scholar Grid I • Citeseer and Google Scholar scour the Internet and analyze documents for incidental metadata • Title, author and institution of documents • Citations with their own metadata allowing one to match to other documents • Science.gov extracts metadata from lots of US Government databases • These capabilities are sure to become more powerful and to be extended • Give “Citation Index” in real time • Tell you all authors of all papers that cite a paper that cites you etc. (Note it’s a small world so don’t go too far in link analysis) • Tell you all citations of all papers in a workshop

Document-enhanced Cyberinfrastructureaka Semantic Scholar Grid II • It is natural to develop core document Servicessuch as those used in Citeseer/Google Scholar but applied to “your” documents of interest that may not have been processed yet • As just submitted to a conference perhaps • These tools can help form useful lists such as authors of all cited or submitted papers to a journal • OSCAR2/3 (from Peter Murray-Rust’s group at Cambridge) augment the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms • This tool is a Service that can be applied to “your” document or to a set of documents harvested in some fashion • Other fields have natural application specific metadata and OSCAR like tools can be developed for them • Such high value tools could appear on “publisher” sites of future (or else publishers will disappear)

OSCAR3 Service from Cambridge UK • Oscar3 is a tool for shallow, chemistry-specific natural language parsing of chemical documents (i.e. journal articles). • It identifies (or attempts to identify): • Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms. • Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections. • Other entities: Things like N(5)-C(3) and so on. • Uses SMILES, InChI and CML • There is a larger effort, SciBorg, in this area • http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3

OSCAR2 Chemistry Document analysis • It detects “magic” chemical strings in text and then • Stores them as metadata associated with document • Queries ChemInformatics repositories to tell you lots of information about identified compounds • Tells you which other documents have this compound

Clustering Documents from chemicalproperties

Provenance and Delicious CI • We can use del.icio.us style interface to annotate Application Data with (extra) provenance and user comments of any type (describing quality of data or a keyword relating different data etc.) • All data should be labeled by a URI to enable this • One has in addition Citeseer/OSCAR metadata • Current major tagging systems support flat list of tags without name=value (RDF triple) or schema organization • Tradeoff between features and pervasive deployment • Some extra features are easy to add as a custom service • Features not supported by del.icio.us can be uploaded as comments

Current Status • Google Scholar, Windows Live Academic Search, del.icio.us, Connotea, CiteULike, OSCAR3 are Web Services • Debugging on 500 presentations and papers from my CGL research group • Experiment with GGF Presentations, Broad collection of Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience Web site (?business model for journals)

Web 2.0 and Grids for Scholarly Research

Web 2.0 and Grids for Scholarly Research

Presentation Transcript

Web 2.0 and Library 2.0

Global Grids Web 2.0 and Globalization

Linking Programming models between Grids, Web 2.0 and Multicore

Youth Research in Web 2.0

Scholarly Research

Scholarly Research

Web 2.0 and Historians – History 2.0 accessing and sharing scholarly information

Grids Challenged by a Web 2.0 and Multicore Sandwich

Web 2.0 Tools for Research and Resources

Web 2.0 for Research and Information Management

Funding for International Research and Scholarly Work

Grids/CI for Scholarly Research and application to Chemical Informatics

Web 2.0, Grids and Parallel Computing

Scholarly Research

Linking Programming models between Grids, Web 2.0 and Multicore

Grids Challenged by a Web 2.0 and Multicore Sandwich

Web Service based Community Grids for Research and Education

Library 2.0 And Web 2.0

Youth Research in Web 2.0

Web Service Grids for iSERVO

Grids/CI for Scholarly Research and application to Chemical Informatics

Web 2.0 and Grids