510 likes | 640 Views
Visual Analytics and Biological Information. Chris Shaw School of Interactive Arts & Technology, Simon Fraser University. ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT] | WWW.SIAT.SFU.CA.
E N D
Visual Analytics and Biological Information Chris Shaw School of Interactive Arts & Technology, Simon Fraser University ______________________________________________________________________________________SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT] | WWW.SIAT.SFU.CA
Visual Analytics: Integrated Interdisciplinary R&D Cognitive Science Information Systems Visual Analytics Graphic & Interaction Design Mathematical & Statistical Methods
Interdisciplinary Know-how • SFU School of Interactive Arts & Tech • Design focus • Technology and Science • Cross-disciplinary Ph.D, MSc. BSc. • UBC Media & Graphics Interdisciplinary Centre (MAGIC) • 15 years use-inspired basic research,Co-development with industry & government
People IMAS: Interactive Multigenomic Analysis System
Visual Analytics • Two broad task domains: • Analysis of large datasets • Overall interaction time: hours to weeks • EG. the VAST Contest: Find the threat represented in this large collection of text, images, emails… • Monitoring and emergency response • Overall interaction time: seconds to minutes • EG. Airport security screening: Find the smuggled weapon
NSERC Strategic Grant: “Visual Analytics for Safety & Security” • Application-aware basic research on human-computer cognitive systems • Perceptual & spatial cognition stream • Sensemaking, levels/types of user expertise • New cognitive collaboration & coordination stream • Application development stream • Goal is to co-develop human & tech aspects • Understanding users, system customization and training is as important as new technology
New Research Methods • Conventional qualitative and quantitative methods (e.g. grounded theory, stats) • Add advanced statistical, computational and math models • Integrated mixed-methods analysis • Calls for visual analytics tools for visual analytics research-- we are our testbed
Goal is integrated Visual Analytics R&D improved understanding improved technology pure basic research use-inspired basic research applied research and development existing understanding existing technology
Interaction Science Design for key human information processing systems Walkthrough or experiment Assess specific aspects of interaction Implement prototype Science in the development process
Steps From Research to Practice • VA science gives us design know-how • VA-aware designers work with user community to apply principles in design • VA users trained to get maximum advantage from VA • VA-sophisticated organizations work with designers to co-evolve new technology and work practices
Biological Sequence Analysis • Visual Analytics in the domain of Biology • Large data • Non-spatial • Many different layers of abstraction
Biological Sequence Analysis • DNA Sequencing projects • Visualization Systems • IMAS • Zoomable Sequence visualization • Gene Finding • BLAST Pairwise alignments • Multialignments • Results
DNA Sequencing Projects Similar sequence yields similar structure Similar structure yields similar function
IMAS supports • Initial stages of analyzing DNA sequence: • Find genes • Find and Analyze similar genes • Multialign like genes to find active sites • Pipeline structure
Existing Tools • Typically web-based • Copy and paste sequence into text entry box • Await search or analysis on remote database • Get an isolated report that the user must organize • Visualization often done as a reporting function • UCSC Genome browser, LLNL ECR browser, NCBI annotation viewer
Desktop Workbenches • Local sequence data • Mix of local and remote analyses • Web queries to remote data • Bluejay, Apollo, Vector NTI, CLC Workbench • User must work to integrate analyses • Workbench is point of collection
IMAS Integrates analysis and display Horizontally Zoomable along sequence Selectable detail vertical Maintains a sequence analysis data collection Visual display aligned to sequence
IMAS Contents • DNA Sequence (Nucleotide, or NT Seq) • GC % plot • 3 forward & 3 Reverse complement Amino Acid sequences:
Genes Built-in access to Glimmer 3.02 gene finder The labelled boxes are anchors for sequence analysis Segments of DNA can also be marked as a Feature for further analysis
Analyses • Rricke104 gene has • 2 NT BLAST pairs • 1 AA BLAST pair • 1 NT multialignment • 1 AA multialignment
Pairwise Sequence Analysis Activated by selecting a Gene/Feature and selecting NT or AA similarity search NCBI’s BLAST is called to search local databases of NT or AA sequences Can also search NCBI central database
BLAST Alignments • High Scoring Pairs are stacked from most to least significant score • Detail shown when zoomed in • Pair similarity is shown using background color • Darker blue indicates higher similarity • When zoomed out, text is hidden and only similarity is shown
Multialignments Originating Gene BLAST Results • Select BLAST alignments to be multialigned • Clustal-W performs multialignment • Aligns • The originating IMAS gene sequence • The “Full” sequence found by BLAST • Not just the high-quality section • Useful to align entire genes, or entire corresponding segments of DNA
IMAS: Interactive Multigenomic Analysis System Oct 30, 2007 25
Results • Analyzed Orientia Tsutsugamushi (Scrub Typhus) • Found not much similarity in NT sequence • Found a large number of SMART domains not found in the related Rickettsia organisms • IMAS Benefit was data organization
Discussion • Visualization Problems • Pair alignments need better organization • Local visibility and organization needed • Overlap in X causes stacking layout problems • Need selective relaxation of vertical alignment rule
Discussion • Analysis Problems • More flexible access to tools: Restriction enzyme sites, methylation sites, Motifs, Primers, Transcription regulation, Intergenic signals...... • Database mediation problem: Please use XML! • More flexible manipulation of sequence parts • Right now IMAS is somewhat rigid in its worldview
Multiple Genomes • Lots of organisms now sequenced: • Learn from individual similarities • Learn from similar gene organization • Co-location “Synteny” of genes helps infer similar function: • Located together -> expressed together
Synteny Visualization • Line up the similar organisms below primary organism • Draw links to connect them • Take care to manage visual salience
IMAS Synteny • Not so good with reversals:
Alternative: Spring Synteny • Orthologs as a node-link diagram • 2 Link types • Neighbors on same organism • Sequence alignment (orthologous) links
Alignment Links • Percent Identity Plot along sequence • Framed to show PIP range • RRickettsia linked to RConorii, RProwazekii, RTyphi, RAkari
Springs • Primary organism is central spine • Secondary sequence have parallel track connected by similarity links • Each secondary sequence has its own resting length for similarity links • Length of neighbor links is blend of • NT coordinate difference • ln(length) * ln(length)
Neighbor links • Using NT distance gives network shapes with many acute angles • Directly displays relative lengths of genomes
Rrickettsii Genomes Genomic Spring-Synteny Visualization with IMAS
Results • Advantages: • Shows reversals clearly • Shows gene “splits” with respect to primary genome • Shows insertions/deletions • Disadvantages • Obscures length relationships • Force-directed layout requires fiddling • Rotating the similarity edge makes comparing similarity difficult
Results • Trade-off: • Free 2 dimensions for gene placement • Get to locate similar items close to each other • Get ability to see gross rearrangements • Lose ability to see detailed similarity along DNA sequence • Lose geometric location information • Lose regulatory info (not represented)
IMAS • Supports annotation pipeline • Tree or DAG visualization, where • Branches are individual BLAST runs • Branches converge on multialignments • Biologists want more! • Analyze arbitrary collections of sequence
More • Want ability to interactively cut, edit, and analyze sequence • “Genomic Spreadsheet” where • Manage Sequences • Compare & Align sequences • Search for similar sequences • Manage sequences at levels of abstraction higher than sequence + annotation text
CzSaw • A Visual Analytics System for Text Data • Built by the SIAT CzSaw group • Victor Chen, Dustin Dunsmuir, Nazanin Kadivar, Eric Lee, Cheryl Qian • John Dill, Chris Shaw, Rob Woodbury
Exploring Data Data Analysis Process Data Visualizing Analysis Model Capturing Analysis Process Analysis Model Analysis History
CzSaw Data Views Exploring Data Script Visualizing Analysis Model Capturing Analysis Process Analysis History Dependency Graph History View
Data Views Script Dependency Graph History View
Script Dependency Graph History View
Dependency Graph History View
Script History View
Conclusions • Building IMAS helped us discover that IMAS is not yet what you want • Supports pipeline • Need to analyze with respect to many data types • Genome & other ontologies • Phylogeny • Metabolic networks • Regulatory networks