480 likes | 576 Views
Information management, workflow and discovery /check-in for project definitions. Peter Fox Xinformatics Week 9, March 27, 2012. Review of reading. Information Integration
E N D
Information management, workflow and discovery /check-in for project definitions Peter Fox Xinformatics Week 9, March 27, 2012
Review of reading • Information Integration • Social issues in information discovery and sharing: http://ctovision.com/2008/04/information-discovery-and-sharing/, http://odni.gov/reports/IC_Information_Sharing_Strategy.pdf • Information integration in geo-informatics http://www.isi.edu/integration/TerraWorld/ • http://cseweb.ucsd.edu/~goguen/projs/data.html • http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839387/ • Information Life Cycle • MSDN Information Life Cycle • Information Life Cycle definition and context • http://www.computerworld.com/s/article/79885/The_new_buzzwords_Information_lifecycle_management • http://www.databasejournal.com/sqletc/article.php/3340301/Database-Archiving-A-Critical-Component-of-Information-Lifecycle-Management.htm • http://en.wikipedia.org/wiki/Information_Lifecycle_Management • http://msdn.microsoft.com/en-us/library/bb288451.aspx • Information Visualization • http://mastersofmedia.hum.uva.nl/2011/04/18/the-simple-ways-of-information-visualization/comment-page-1/ • http://www.siggraph.org/education/materials/HyperVis/domik/folien.html • http://www.visual-literacy.org/periodic_table/periodic_table.html • Information model development and visualization • http://www.acm.org/crossroads/xrds7-3/smeva.html • Outside the current box • Peter Fox and James Hendler, 2011, Changing the Equation on Scientific Data Visualization, Science, Vol. 331 no. 6018 pp. 705-708, DOI: 10.1126/science.1197654 online at http://www.sciencemag.org/content/331/6018/705.full or see: http://escience.rpi.edu/publications/visualization/fox_hendler_science2011.html
Logical Collections • The primary goal of a Management system is to abstract the physical collection into logical collections. The resulting view is a uniform homogeneous collection. • Note the analogy with logical models • Identifying naming conventions and organization • Aligning cataloguing and naming to facilitate search, access, use • Provision of **contextual** information
Physical Handling • This layer maps between the physical to the logical views. Here you find items like replication, backup, caching, etc. • Where and who does it come from? • How is it transferred into a physical form? • Backup, archiving, and caching… • Formats • More --- naming conventions • Note analogy to physical models
Interoperability Support • Normally the information does not reside in the same place, or various collections (like catalogues) should be put together in the same logical collection. • Programming or application interface access • Structure and vocabulary (metadata) conventions and standards
Security • Access authorization and change verification. This is the basis of trusting your information. • What mechanisms exist for securing? Who performs this task? • Change and versioning (yes, the information may change), who does this, how? • Who has access? • How are access methods controlled, audited? • Who and what – authentication and authorization? • Encryption and integrity
Ownership • Define who is responsible for quality and meaning • Rights and policies – definition and enforcement • Limitations on access and use • Requirements for acknowledgement and use • Who and how is quality defined and ensured? • Who may ownership migrate too? • How to address replication? • How to address revised/ derivative products?
Metadata • Recall metadata are data about data. • Metainformation are information about information • How to know what conventions, standards, best practices exist? • How to use them, what tools? • Understanding costs of incomplete and inconsistent metadata • Understanding the line between metadata and data and when it is blurred • Knowing where and how to manage metadata and where to store it (and where not to)
Persistence • Definition of lifetime. Deployment of mechanisms to counteract technology obsolescence. • Where will you put your information so that someone else (e.g. one of your class members) can access it? • What happens after the class, the semester, after you graduate? • What other factors are there to consider?
Discovery • Ability to identify useful relations and information inside the collection • If you choose (see ownership and security), how does someone find your information? • How would you provide discovery of collections, versus files, versus ‘bits’? • How to enable the narrowest/ broadest discovery? • More on this later in this class
Dissemination • Mechanism to make aware the interested parties of changes and additions to the collections. • Who should do this? • How and what needs to be put in place? • How to advertise? • How to inform about updates? • How to track use, significance?
Summary of Information Management • Creation of logical collections • Physical handling • Interoperability support • Security support • Ownership • Metadata collection, management and access. • Persistence • Knowledge and information discovery • Dissemination and publication
Note for your project writeup! • Information management! Cover the 9 areas.
Information Workflow • What is a workflow? • Why would you use it? • Key considerations for information, cf. data • Some pointers to workflow systems
What is a workflow? • General definition: series of tasks performed to produce a final outcome • Information workflow – involves people but potentially want to • Automate tedious jobs that a person traditionally performed manually for each dataset • Process large volumes of information faster than one could do by hand • NB difference from data workflows – it reaches out to encompass the user (e.g. ‘unrecorded actions’)
Background: Business Workflows • Example: planning a trip • Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. • Each task may depend on outcome of previous task • Days you reserve the hotel depend on days of the flight • If hotel has shuttle service, may not need to rent a car • Prior information, experience, preferences…
What about information workflows? • Perform a set of transformations/ operations on information source(s) – could also the on ‘data’ • Examples • Generating images from raw data • Identifying areas of interest from a large information source (e.g. word cloud) • Classifying set of objects • Querying a web service for more information on a set of objects • Many others…
More on Workflows • Can process many information types: • Archives • Web pages • Streaming/ real time • Images • Semiotic systems • Robust workflows depending on formal (concept and logical) models of the flow of information among components • May be simple and linear or very complex
Challenges • Questions: • What are some challenges for users in implementing workflows? • What are some challenges to executing these workflows? • What are limitations of writing a program? • Mastering a programming language • Visualizing workflow • Sharing/exchanging workflow • Formatting issues • Locating datasets, services, or functions
Workflow Management Systems • Graphical interfaces for developing and executing scientific workflows • A (single, typically) user can create workflows by dragging and dropping • Can automate low-level processing tasks • Can provide access to repositories, compute resources, workflow libraries • Again: can work well for some tasks
Benefits of Workflows • Documentation of aspects of analysis • Visual communication of analytical steps • Ease of testing/debugging • Reproducibility • Reuse of part or all of workflow in a different project
Additional Benefits • Integration of and between multiple computing environments • ‘Automated’ access to distributed resources via other architectural components, e.g. web services and Grid technologies • System functionality to assist with information integration of heterogeneous components and source
Why not just use a script? • Script does not specify low-level task scheduling and communication • May be platform-dependent • Can’t be easily reused • May not have sufficient documentation to be adapted for another purpose
Why can a GUI be useful? • No need to learn a programming language • Visual representation of what workflow does • Allows you to monitor workflow execution • Enables user interaction (though not necessarily collaboration) • Facilitates sharing of workflows
Some workflow systems • Kepler • SCIRun • Sciflo • Triana • Taverna • Pegasus • Some commercial tools: • Windows Workflow Foundation • Mac OS X Automator • http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf • http://www.isi.edu/~gil/AAAI08TutorialSlides/ • See reading for this week
Discovery • Recall forms of information • Structured/ un-structured • Presentation and organization • Syntax-semantics-pragmatics • Managed, designed and architected. • Goal of this part of the class is to understand how discovery is enabled or disabled based on these factors
Discovery • How does someone find your information? • How would you provide discovery of • collections • files • ‘bits’ • How would you find ->
Discovery • Search (Federated Search) • Helped by • Folksonomies (user contributed) • Intelligent Agents • Search Engines • Taxonomies • Find photos of Kim • Boy or girl?
Use cases • Find a sound recording of a swallow. • Excuse me?
Use cases • Find a sound recording of an African Swallow • Find a sound recording of a bird that sounds like an African Swallow • Media types – how can you discover them?
Use cases • Find the movie that Jean Tripplehorn first starred in/ that was her most successful/ was lead actress? • Has anyone gene sequenced a mouse? • Find images of primary productivity in the North Atlantic • Discovery can often involve information integration (or is it *almost always*?)
Three level ‘metadata’ solution for DATA Data Discovery Data Integration Level 1: Data Registration at the Discovery Level, e.g. Volcano location and activity Level 2: Data Registration at the Inventory Level, e.g. list of datasets, times, products Level 3: Data Registration at the Item Detail Level, e.g. access to individual quantities Earth Sciences Virtual Database A Data Warehouse where Schema heterogeneity problem is Solved; schema based integration Ontology based Data Integration Using scientific workflows A.K.Sinha, Virginia Tech, 2006
Three level ‘metadata’ solution? Information Integration Information Discovery Level 1: Registration at the Discovery Level, e.g. Find the upper level entry point to a source Level 2: Registration at the Inventory Level, e.g. list of datasets, using the logical organization Level 3: Registration at the Item Detail Level, i.e. annotation e.g. tagging Catalog/ Index Schema based integration Integration using mapping management A.K.Sinha, Virginia Tech, 2006
Information discovery • What makes discovery work? • Metadata • Logical organization • Attention to the fact that someone would want to discover it • It turns out that file types are a key enabler or inhibitor to discovery • Result ranking using *tuned* algorithm • What does not work? • Result ranking algorithms that depend on unconventional information types (icon, index, symbol)
Federated search • “is the simultaneous search of multiple online databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine.” wikipedia • Libraries have been doing this for a long time (Z39.50, ISO23950) • Key is consistent search metadata fields (keywords) • E.g. Geospatial One Stop http://www.geodata.gov
Search engines (1) • Contains an automated spider or crawler • No theoretical limits in the amount of indexing (limited by hardware) • Support remote indexing • Continual background indexing of content • Custom metatag support (some low-end products do not support this feature) • Support for indexing PDF, .doc, etc (some low-end products do not support this feature) • Supports URL and word exclusions & inclusions
Search engines (2) • Server-Side Includes (SSI) supported • Search by custom metatags • Case sensitive or insensitive searching • Simple Customizable search/results pages • Boolean Searching capabilities • Provide users meta description and page title in search results • Inexpensive – ~$200 (2010) • Easily customizable search/results interface
Search engines (3) • Result weighting feature • URL Inclusion list • Require significant memory (RAM) and disk space as the collection grows • Low-end alternatives often do not possess the capabilities to do phrase or natural language searching.
Improve www discovery? • Implement metatags on your and your partners web sites • Update content frequently • Utilize annotation, e.g. RDFa, or support schema.org • Register your site with the major search engines (tools exist to aid in this process) • Perform a basic study of where your site results within the major search engine providers • Do not spam the search engine providers • Re-evaluate your web site directory structure to ensure information is appropriately categorized/ described within your URL strings
Improve www discovery • Look through your server log files to determine what users are trying to find on your site and/or the path they are using to find information • Perform basic usability testing of your site to determine what users expect and can easily gather from your site. This also may determine why users go to an Internet search engine provider versus accessing your site directly. • Realize that Internet search engines dont all act the same, index at the same time period, and often value a particular metatag, document date, etc. more than another vendor product.
Smart search • Semantically aware search, e.g. http://noesis.itsc.uah.edu , http://eie.cos.gmu.edu (Water -> Semantic Search) • Faceted search, e.g. mspace (http://mspace.fm ), exhibit (MIT), S2S (RPI; http://aquarius.tw.rpi.edu/s2s )
Faceted search logd.tw.rpi.edu
Summary - discovery • Useful to write a few discovery use cases to drive how your design is developed • Evolution of your role in facilitating discovery and what/ how others implement access to your information
Reading for this week • Is retrospective
Check in for Project Assignment • Analysis of existing information system content and architecture, critique, redesign and prototype redeployment • Or a new use case, development, etc.
What is next • April 3 - Week 10 – A worked example… • April 10 - Week 11 – surprise! • April 17 – No class GM week • April 24 – guest lectures, chem-, astro-, geo-? • May 1 – final project presentations