370 likes | 380 Views
This conference focuses on the integration of Web 2.0 and the Semantic Grid to enhance cyberinfrastructure for collaborative scientific research. Topics include semantic analysis of scientific documents, tagging tools, digital library integration, workflow support, community tools, and mashups.
E N D
SKG2006Introduction http://www.culturegrid.net/SKG2006/ Guilin China November 2 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org
SKG2006 • Last year saw the first conference of this series in Beijing covering • Knowledge sharing • Semantic networking • Grid computing • These areas underlie • Electronic Science (eScience) • Scholarship and • Communities (the real world) • This year we are pleased to present the second conference which had an 18% acceptance rate for regular papers • We look forward to the meeting next year in Xi’an • Listen and ask lots of questions! • Lets thank Hai Zhuge and CAS for their wonderful vision and implementation
Web 2.0, Knowledge and the Semantic Grid SKG 2006 http://www.culturegrid.net/SKG2006/ Guilin China November 2 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org
Motivation • Build Cyberinfrastructure (Grids) that • Support science from beginning (planning, instruments) through middle (analysis) and end (refereed publications, follow-on work) • Integrates with the popular Web 2.0 (community) tools whose successes point to interesting ways of working together • Integrate with Digital Library technology • Does not redo previous work but rather augments it • Assumes a heterogeneous fragmented world with multiple platforms • Allows one to specify and manage all the services and data that a project needs with a mix of synchronous, asynchronous, close (classic workflow) and loose (including zero) coupling
Application Drivers • Semantic analysis of scientific documents as in case of chemistry which has very precise naming rules for compounds that allow accurate searches in documents • Suggesting how to tag scientific documents either when writing it or after the fact • Journal web site of the future as illustrated by Nature building social bookmarking tool Connotea • Conference support tools as can benefit from features needed by journals • This gives Digital Library (document) enhanced Cyberinfrastructure (CI)
The Science Drivers • From Workshop on Challenges of Scientific Workflows http://vtcpc.isi.edu/wiki/index.php/Main_Page • Workflow is underlying support for current science model • Distributed interdisciplinary data deluged scientific methodology as an end (instrument, conjecture) to end (paper, Nobel prize) process is a transformative approach • Reproducibility core to scientific method and requires rich provenance, interoperable persistent repositories with linkage of open data and publication as well as distributed simulations, data analysis and new algorithms. • Distributed Science Methodology publishes all steps in a new electronic logbook capturing scientific process (data analysis) as a rich cloud of resources including emails, PPT, Wikis as well as databases, compiler options, build time/runtime configuration…
Community Tools • e-mail and list-serves are oldest and best used • Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration – text, audio-video conferencing, files • del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared bookmarks • MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create (upload) community resources and share them; Friendster, LinkedIn create networks • http://en.wikipedia.org/wiki/List_of_social_networking_websites • Writely, Wikis and Blogs are powerful specialized shared document systems • ConferenceXP and WebEx share general applications • Google Scholar tells you who has cited your papers while publisher sites tell you about co-authors • Windows Live Academic Search has similar goals • Note sharing resources creates (implicit) communities • Social network tools study graphs to both define communities and extract their properties • Mashups link resources together (federation/workflow)
How to use Web2.0 Community tools in CI • Nearly all of them have “profiles”, “users”, “groups”, “friends” etc. • Need to integrate these • P2P File Sharing: Maybe this is useful for sharing files in research groups (virtual organizations) • Will modify Maze http://maze.pku.edu.cn– popular Chinese social P2P system with 2.5 million users • BitTorrent: more popular than FTP – why not use for higher performance fault tolerant cached file sharing? • MySpace etc.: Could consider MyGridSpace or MyScienceSpace that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest • Could include uploaded material in workflows • Social Bookmarking and linking: discuss later • http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/
Mashups and Grids • http://www.programmableweb.com • There are 303 “commodity” service Web 2.0 API’s on October 30 2006 • Mashups are composed from JavaScript, AJAX and REST and not usually BPEL WSDL and SOAP • Architecture of Mashups and Grids “identical” • See Amazon S3 Storage and EC2 ElasticComputing services • Mashups enable everybody to contribute
MashUp API’s with use indicated by size • Note most Mashups are implemented client side inside Browser • Most Grid workflows are executed server side
MyResearchDatabase Bibliographic Database Web serviceWrappers Document-enhanced Cyberinfrastructure Del.icio.us Windows Live Academic Search TraditionalCyberinfrastructure Export:RSS, BibtexEndnote etc. CiteULike Google Scholar Connotea Citeseer Bibsonomy Science.gov Biolicious PubChem Generic Document Tools CMT ConferenceManagement PubMed Read Journals Community Tools Submit Journals Integration/Enhancement User Interface etc. Existing User Interface New Document-enhanced Research Tools Existing Documentbased Research Tools
Digital Library-enhanced Cyberinfrastructureaka Semantic Scholar Grid I • Citeseer and Google Scholar scour the Internet and analyze documents for incidental metadata • Title, author and institution of documents • Citations with their own metadata allowing one to match to other documents • Science.gov extracts traditional library metadata from lots of US Government databases • These capabilities are sure to become more powerful and to be extended • Give “Citation Index” in real time • Tell you all authors of all papers that cite a paper that cites you etc. (Note it’s a small world so don’t go too far in link analysis) • Tell you all citations of all papers in a workshop
Digital Library-enhanced Cyberinfrastructureaka Semantic Scholar Grid II • It is natural to develop knowledge extraction document Servicessuch as those used in Citeseer/Google Scholar but applied to “your” documents of interest that may not have been processed yet • As paper just submitted to a conference perhaps • These tools can help form useful lists such as authors of all cited or submitted papers to a journal • OSCAR3 (from Peter Murray-Rust’s group at Cambridge) augments the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms • This tool is a Service that can be applied to “your” document or to a set of documents harvested in some fashion • Other fields have natural application specific metadata and OSCAR like tools can be developed for them • Such high value tools could appear on “publisher” sites of future
OSCAR Chemistry Document analysis • It detects “magic” chemical strings in text and then • Stores them as metadata associated with document • Queries ChemInformatics repositories to tell you lots of information about identified compounds • Tells you which other documents have this compound
Scholar Grid III • Search and annotation provide unstructured and structured Semantic Web/Grid for documents • Other Web 2.0 tools address linkage of people together and people to information • Information is metadata as in profiles or personal publication as in Blogs, Wikis, YouTube, MySpace • All of these involve some sort of collaboration • Comments on Blogs and uploads to Collaborative editing in a Wiki • Our projects usually use Wikis as central control (group logbook) and each researecher (including students) can use Blogs to define progress (an experimental web 2.0 electronic notebook)) • I can comment on student progress with Blog comment • Other students can keep abreast of group progress • Security model not clear • There is also P2P file transfer with BitTorrent
Delicious Semantic Web/Grid • http://del.icio.us purchased by Yahoo for ~$30M • http://www.CiteULike.org • http://www.connotea.org (Nature) • Associate metadata with Bookmarks specified by URL’s, DOI’s (Digital Object Identifiers) • Users add comments and keywords (called tags) • Users are linked together into groups (communities) • Information such as title and authors extracted automatically from some sites (PubMed, ACM, IEEE, Wiley etc.) • Bibtex like additional information in CiteULike • This is perhaps de facto Semantic Web – remarkable for its simplicity
Biolicious automatically produces (interesting) scientific lists Advertising!
Chemical Informatics as a Grid Application • Chemical Informatics is the application of information technology to problems in chemistry. • Example problems: managing data in large scale drug discovery and molecular modeling • Building Blocks: Chemical Informatics Resources: • Chemical databases maintained by various groups • NIH PubChem, NIH DTP, http://nihroadmap.nih.gov/ • Application codes (both commercial and open source) • Data mining such as clustering • Quantum chemistry and molecular modeling • Screening centers (with HTS High Throughput Screening devices) measuring interaction of chemicals with biological samples • Visualization tools • Web resources: journal articles, etc. • Chemical Informatics Gridhttp://www.chembiogrid.org needs to integrate these into a common, loosely coupled, distributed computing environment.
OSCAR3 Service from Cambridge UK • Oscar3 is a tool for shallow, chemistry-specific natural language parsing of chemical documents (i.e. journal articles). • It identifies (or attempts to identify): • Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms. • Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections. • Other entities: Things like N(5)-C(3) and so on. • Uses SMILES, InChI and CML • There is a larger effort, SciBorg, in this area • http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3
Workflows Using Chemical Literature Find similar documents Bulk download of Pubmed abstracts Find similar molecules All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red PDBBind OSCAR3 Service OSCAR3 program PubChem Local DTP database Extract chemical structures SMILES NAME Pubmed ID CCC propane 1425356 CC ethane 3546453 ..... ............. ............. Searchable (structure/similarity) Grid database Clustering of documents linked to clustering of chemicals
Initial Results • We have a small sample (100) of full text Chemistry papers selected at random from 15 years of PubMed with over 5 million abstracts • OSCAR3 generates 4.17 compound names per abstract • and 36.7 compound names per full text • Illustrates how much knowledge journal publishers are hiding from us
Provenance and Delicious CI • We can use del.icio.us style interface to annotate Application Data with (extra) provenance and user comments of any type (describing quality of data or a keyword relating different data etc.) • All data should be labeled by a URI to enable this • One has in addition Citeseer/OSCAR metadata • Current major tagging systems support flat list of tags without name=value (RDF triple) or schema organization • Tradeoff between features and pervasive deployment • Some extra features are easy to add as a custom service • Features not supported by del.icio.us can be uploaded as comments
Implementation Strategy • Doesn’t seem useful to build the 251st community tool • In fact a major barrier to use of existing tools is • What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web) • So assume use existing tools but wrap them all as web services so can transfer information to new tools and integrate information between tools • Need some “glue” logic, a “unification” database and minimal user interface • Bookmarking tools: del.icio.us, Connotea, CiteULike (includes plug-ins to major publisher sites) • Document: Google Scholar, Windows Live, Citeseer tools, OSCAR3 for Chemistry, Science.gov (later) • Journals: Manuscript Central • Conferences: CMT from Microsoft or ?
Current Status • Google Scholar, Windows Live Academic Search, del.icio.us, Connotea, CiteULike, OSCAR3 are Web Services • Debugging on 500 presentations and papers from my CGL research group • Experiment with GGF Presentations, Broad collection of Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience Web site (?business model for journals) http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/
Knowledge Model for Scientific Journals • There are classes of scientific journals • Large circulation society journals effectively subsidized by fees of professional society membership; circulations can be more than 10,000 • “Popular” magazine style journals • A few prestigious journals • Many specialized journals publishing archival refereed papers with circulations from one hundred to a few thousand • The specialized journals largely sell a mix of paper and (a growing number of) electronic subscriptions to libraries and very few individuals subscribe • Access is limited and expensive • Even if one subscribes, one is often restricted on the number of full text papers one can access • Collections like PubMed only include abstracts • Systems like Google Scholar, Microsoft Academic Live and Citeseer cannot fully analyze knowledge in papers unless get access to full text • Current publishing model hindering and not helping science • Similar discussion for journal papers and research data
Internet Business Models • How to make money on the Internet has been debated for many years • One can offer content (data on web) and/or services (user customizable transformations of web data) • Advertising is dominant model in large sites. • Content and Services can be free or paid by Transactions or Subscriptions. • Often there is a mixed model with basic content/services frees and one pays for premium features • One can charge reader or publisher. • Advertising charges publisher of Advert • In the past, journals were funded by page charges i.e. one charged the authors (institution) that produced paper
Itunes and other music sources; at right price, people will pay for convenience News web sites supported by a mix of advertising and premium content. Not clear latter successful except in specialized areas Sites like http://www.chessbase.com/ with collections of Chess Games with occasional annotation Several Financial Service sites Yahoo Google etc. Financial Services with premium for real-time stock quotes Other sites feature commentary that is either free (supported by advertising) or premium content (such as Wall Street Journal and many stock picker sites) which you subscribe to Examples of Internet Information and Knowledge Content and Business Model
Google etc. online Office versus more sophisticated paid Microsoft Office which also has "history" advantage as owned field before Internet WebEx collaboration services paid by transaction or subscription; not obviously a viable long term model ICC Chess Sitehttp://www.chessclub.com/ supports the community of chess players with free basic access but valuable premium features including better game playing, rating and real-time commentary. Other gaming sites similar Amazon S3 and Computing Cloud paid services copuld be successful as alternative (buy your own computers) costs real money and perhaps less reliable Examples of Internet Information and Knowledge Services and Business Model
Publishing Business Model in the Internet Age • Journal publishing currently has a business model where the price reflects neither the cost nor the value-added • Publishers currently do not have significant internal expertise in new approaches/technologies to drive new business models • However much is outsourced already and so one can outsource to organizations with new expertise e.g. to those that know Web 2.0 rather than putting ink on paper • There is no clear new business model but plausible that current model will not survive for that long • So need to change even if less lucrative or success unclear • Note libraries provide funds to publishers and libraries will continue • Not clear how fast libraries will change as they also don’t obviously have expertise to support new models • Some think that one role of university libraries will be curation of data produced by university faculty
Strengths of Current Publishing Model • Permanent “guaranteed” archival storage but there are other approaches such as Amazon S3 to this • Uniform look and feel and copyediting to remove language errors. • Useful but not so valuable that we can trade access for this. • In particular can only correct some language errors as only a subject expert can really rewrite in good grammar and expression • Refereeing of a quality implied by the journal and the editorial board • Most important strength but business model does not directly reflect this as only a small part of subscription price goes to editorial function • For most papers cost of refereeing much less than other costs of producing paper • Not clear why viewer should pay for refereeing • Large amount of pre-existing papers from old issues of journals
Pressures on Current Publishing Model • Mandated open access to scholarly work funded by government • Cornyn-Lieberman bill in the US • NIH PubMed Central requires deposited of full text of articles after a length of time • Electronic access to publisher sites is not especially good • Division of articles into journals and publishers is not very helpful today where technology does not care about location of information • Location is just a rather simple annotation (meta data) specifying aspects of provenance of article • Note a special issue of SKG2006 is just an annotation roughly characterizing nature and quality of work • Publishing on the Internet is not a valuable service and has been addressed by Web servers in general and by Web 2.0 in attractive ways • Essentially nobody reads or even has access to paper copies of journal • Not clear it is useful to print specialized journals on paper
Scholarly Research Community Site • Best product should allow one to make best use of knowledge in scholarly publications and data • Should integrate journal and conference publications and services • Should contain integrated or support outside services for curation, annotation, analysis and search • Looking at Web 2.0 successes, one needs to conveniently share data and set up communities • Content is scholarly journals and data • Services include • Annotation as in Connotes, CiteULike, Del.icio.us • Semantic analysis for citations, authors, chemical compounds etc. • Biolicious style custom classifications including added value contacts • Search as in Google Scholar, Microsoft Academic Live • MySpace/Facebook/LinkedIn style services for existing or new contacts • Support of conference and journalrefereeing • Other conference/journal services such as registration, advertising • Integration with research such as electronic log books • Internal integration e.g. Authors in citations are linked to community • Links to more general document services such as: • Online Office style Tools • WebEx type collaboration
Business Model for Scholarly Journal/Research Community Site • One can charge for advertising, better content, better services or better implementation • Natural is to start with a basic free content and services with advertising. • Content must be free eventually “by law” • Services will have open source versions anyway so counter this with free basic services • One could use page charge model for charging for refereeing. • One charges user for features that add value. These include: • Better or better implemented community/digital library services • Premium Content possibly contracted by site owner • Problem with Advertising Business model: Audience specialized (i.e. small) but upscale • Problem with charging for Community Tools: Competing with free software but likely can offer much better service than free software just as WebEx does fine in spite of free VNC