690 likes | 1.74k Views
VISTA family of computational tools for comparative genomics. How can we leverage genome sequences from many species to learn about genome function? Microbial applications Inna Dubchak, Genomics Division LBNL, JGI ildubchak@lbl.gov vista@lbl.gov. Human Genome Annotation. Gene A.
E N D
VISTA family of computational tools for comparative genomics • How can we leverage genome sequences from many species to learn about genome function? • Microbial applications Inna Dubchak, Genomics Division LBNL, JGI ildubchak@lbl.gov vista@lbl.gov
Human Genome Annotation Gene A • only 1–2% coding • efficient identification of regulatory sequences?
functional region = conservation 80 million years Last Common Ancestor conservation sequence Sequence conservation implies function divergence = non functional AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T CTATAAATGC A C CTATAAATGC A C TACATTTCG ACTGTATCGCCTCG CAACCCT A potentialfunctional region
Human Chimp Urchin Mouse Drosophila Sequence Alignment Similar Genes Synteny Comparative Genomics Introduction
VISTAis an integrated system forglobal sequence alignment and visualization for comparative genomic analysis http://genome.lbl.gov/vista
How does VISTA Work: Global Genomic Aligments 1- anchoring: identify regions of strong similarity 2- chaining: join regions of weak or no similarity sequence 1 sequence 2 AlgorithmFeature AVID* can handle draft sequence LAGAN** produces true multiple alignments Shuffle-LAGAN**handles rearrangements (inversions, translocations) * Lior Pachter, UC Berkeley ** Michael Brudno, U. Toronto
Global Genomic Aligner Output 104670599 TCCCCAACTATAAATGGATGAAATTGCAGGAAATGACAGGTA-----TGACCCCTTCTCT 104670653 >>>>>>>>> ||| ||| | |||||| | || || | | | ||||||| || <<<<<<<<< 052328645 TCCTCAATTCAGAATGGAGGGAAGCACACAGGACACAGAGATCCCTTTACCCCCTTCGCT 052328704 104670654 ACCAGAGGCTTGGATTTTTTTTCTTCTTCTCCTCCCTTAGCCCGTGTTGAGCTATTTCGG 104670713 >>>>>>>>> | | | || | | | <<<<<<<<< 052328705 ATGT----------------------------------------TATCAGGCCACTCAAG 052328724 104670714 AGTTTCCTGGCAGGGAAGAGCGAGTGAGGCTGCCTTACCTTCAGGATGACCACTAGCAGG 104670773 >>>>>>>>> |||| | || || | ||||| ||||||| | ||| ||||||| ||||||||| |||||| <<<<<<<<< 052328725 AGTTCCTTGTCAAG-AAGAGTGAGTGAGTCCACCTCACCTTCAAGATGACCACCAGCAGG 052328783 104670774 CCAGCGCTCACAAGAAGAGGAATGAGGCTACTAATGAACCAGCTAAACCAGAGGATGCTG 104670833 >>>>>>>>> |||||||||||||| ||||| |||||||| |||| |||||||||||||||||||||| <<<<<<<<< 052328784 CCAGCGCTCACAAGCAGAGGGATGAGGCTGCTAACAAACCAGCTAAACCAGAGGATGCCA 052328843 104670834 TTGTCCAGGCCCATGATCCGCATGGTCTCTTTCAGCCGTGCCTCCTTCTCATACACGATG 104670893 >>>>>>>>> |||||||| |||||||||||||||||||| |||||||| ||||||||||||||||| ||| <<<<<<<<< 052328844 TTGTCCAGACCCATGATCCGCATGGTCTCCTTCAGCCGAGCCTCCTTCTCATACACAATG 052328903 104670894 CCCTTGATGATCACAGCCACTGAGTAAATCCAGGCCAGCGTCATGAAGAGGGGCATTGAC 104670953 >>>>>>>>> | ||||||||||||||| || ||||| |||||||| || ||||||||||||||||||||| <<<<<<<<< 052328904 CTCTTGATGATCACAGCGACAGAGTAGATCCAGGCTAGAGTCATGAAGAGGGGCATTGAC 052328963 104670954 CGGCTCATCACCCGCAGAAAGCTGGAGGCCCCAAGGAAGGACAAGGGGAGAAAGAAAGAC 104671013 >>>>>>>>> |||||||| ||||||||||| |||||||| | || || | || ||| | || |||| <<<<<<<<< 052328964 CGGCTCATGACCCGCAGAAAACTGGAGGCACAGAGAAAAGGCATGGGAAAAATGAAAAGT 052329023 104671014 ACACGTGAGCCAGGGTGATGGGCCAAGGCCTCTGAGCCTGCATGCTAGAGGGAGCACCAC 104671073 >>>>>>>>> ||||||| || | ||||||||| |||| || |||| ||| | <<<<<<<<< 052329024 ----GTGAGCCCGG-CACCGATCCAAGGCCT-------TGCACACTGGAGGACAAACCTC 052329071 104671074 ATCTGGGCCACAGAAGGACAGGCCCTCTAGACTCTGAAATGTACGTATGATCCAATGCTT 104671133 >>>>>>>>> ||| ||| | | | | | |||||| || ||||| ||||| | | || | || <<<<<<<<< 052329072 ATCAGGGTCGCTTATGAA-AGGCCCACTGAACTCTCAAATG--------ACCAAAGGTTT 052329122 104671134 CACGAGCAATGCAATGTAGAGAGAAAAACGAGGCTAACAAAGTGTTGCCAAACCAAATTT 104671193 >>>>>>>>> || |||| || | ||||| ||| | || | | || | ||| | |||||| <<<<<<<<< 052329123 CATTAGCAGTGGA---CAGAGATGAAACCTGGGTTTCGAGGGTATGGCCGTGCAAAATTT 052329179 104671194 CTTTGGGGGCTTGCTTCAGTAACTAGGTAACTGTGAGCGATAC-TTAAACTAAAGGTAGA 104671252 >>>>>>>>> || |||||| ||| | || ||||| || | || | | |||| |||| || <<<<<<<<< 052329180 TTTCAGGGGCTCTCTTTAATAGCTAGGAAATGGATAGGGTAATATTAAGATAAATATAAG 052329239 104671253 TTATGTTA--AAGTACTAAAAACCAAAACA------AAAAAACAACTCATTCTCTCACAA 104671304 >>>>>>>>> ||| || |||||||||| || || | || ||||| ||| | | | <<<<<<<<< 052329240 TTACTCTACTAAGTACTAAACACAAAGGGCGGGGGCAGAATCCAACTTGGTCTTCCGCTA 052329299
Graphical presentation of sequence conservation as “peaks-and-valley” curve % identity base sequence coordinates VISTA visualization 104637349 GTAGTGCCACTGAGTGTGACAGGGATGGCAAGAAAAGCATTAAGTTCCAAGGGGAAAGAA 104637408 >>>>>>>>> | || ||| ||| |||| |||||||||| | || || |||| | |||||||| <<<<<<<<< 052290302 GAGATGTCACCAAGTA-AACAGAGATGGCAAGAGGACCAATAGGTTCTAGTGGGAAAGAC 052290360 “sliding window” to measure sequence conservation (default window size 100bp) >70% identity
VISTA homepage: http://genome.lbl.gov/vista • Access servers, browsers, other information VISTA Servers (submit your own data) VISTA Browsers (precomputed alignments) Other VISTA-related Projects
wgVISTA mVISTA Align and compare sequences, including microbial assemblies Align and compare sequences rVISTA Search for TFBS combined with a comparative sequence analysis VISTA Servers GenomeVISTA Align DNA sequence to a genome
VISTA-Point VISTA Browser Browse through pre-computed whole-genome alignments Browse and obtain sequence and alignment data Whole Genome rVISTA Whole genome analysis for conserved TFBS over-represented in upstream regions of genes Precomputed Alignments
VISTA Browser VISTA tracks on UCSC Browser VISTA Browser: Input Menu • Choose “base” genome • Select location • Determine visualization preference genome position visualization Java 2, if needed VISTA-Point
VISTA Browser: Alignment Details direction exon gene repeats SNPs alignment
VISTA Browser: Result Menu & Icons Control Panel Position on chromosome Graphical display of genome alignments 1 row Cursor Info Color Legend Curve annotation (species)
VISTA Browser: Zooming vs. rhesus vs. dog
VISTA Point: AlignmentsTable sequence
Principal components • RegTransBase – experimental data • manually curated database of regulatory interactions captured from literature; • 6000 papers NAR database issue, 2007 • RegPrecise – computational predictions • manually curated database of regulons inferred by comparative genomics approach NAR database issue, 2010; Featured Article • RegPredict – web tool for regulon inference • integrated system for fast and accurate inference of regulons by comparative genomics NAR Web Server issue, 2010; Featured Article
mVISTA: Interface • Our example will show 3 sequences • Align up to 100 sequences
mVISTA: Input of Sequences • Provide your email address • Upload your sequences • Or enter GenBank ID your email upload file or GenBank ID
mVISTA: Input Parameters • Shuffle-LAGAN • multiple pair wise alignments • detects sequence rearrangements and inversions • AVID • multiple pair wise alignments • accepts finished or draft sequences • LAGAN • true multiple alignments
PDF VISTA Browser mVISTA: Results VISTA-Point
wgVISTA: Microbial Assemblies Comparison • wgVISTA: whole genome VISTA • Compares 2 sequences (up to 10 Mb) • Draft or finished microbial assembly sequences can be used
Regulatory VISTA (rVISTA): prediction of transcription factor binding sites Simultaneous searches of the major transcription factor binding site database (Transfac) and the use of global sequence alignment to sieve through the data • rVISTA search is automatically run when submitting: • mVISTA • genomeVISTA
Ikaros-2 Ikaros-2 NFAT Ikaros-2 Human TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTG Mouse TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCA Dog TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCA Rat TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCA Cow TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCT Rabbit TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA 20 bp dynamic shifting window >80% ID Regulatory VISTA (rVISTA): 1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC) 2. Identify aligned sites using VISTA 3. Identify conserved sites using dynamic shifting window
rVISTA: Interface • rVISTA sequence submission: set number • Submit email address, sequences, and set parameters • Key step: click the box for: Find potential transcription factors your email sequences
rVISTA: Mailed Results • Emailed results will provide a link • Choose which binding sites matrices to display • You can then choose visualization options display
rVISTA: Results Graphic sites • Blue all transcription factor (TF) binding sites • Red TF sites which are aligned in both sequences • Green TF sites which are aligned & in conserved regions sequences
Whole Genome rVISTA: Select Alignment upstream range IDs or symbols
Whole Genome rVISTA: Results sites found view genes
Examples of VISTA usage • Non-coding regulatory regions, for example enhancers • Genes from the same gene families • Alternative splicing • Transcriptional regulation • Genetic studies References collected are available through the Publications link at the VISTA home page http://genome.lbl.gov/vista
VISTA thanks Biology Genomics Division, LBNL lead by Dr. Edward Rubin Dario Boffelli Kelly Frazer Gaby Loots Len Pennacchio Marcelo Nobrega Axel Visel Bioinformatics Michael Brudno Olivier Couronne Simon Minovitsky Igor Ratner Alexander Poliakov Lior Pachter (UCB) Shyam Prabhakar Dmitriy Ryaboy Nameeta Shah Inna Dubchak