1 / 40

Evolution and the Santa Cruz Genome Browser

Evolution and the Santa Cruz Genome Browser. Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7.

valmai
Download Presentation

Evolution and the Santa Cruz Genome Browser

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University

  2. Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7

  3. Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7

  4. Known Gene Details Page

  5. Known Gene Details Page

  6. PDB Ribbon Diagram 4 clicks away by the wonder of the world wide web

  7. Hox A Cluster, Many Tracks

  8. Track Controls are Now Grouped

  9. Packed mode saves space, makes labels easier to find.

  10. Squished mode is ideal for ESTs and mouse/human homology

  11. Squished mode is ideal for ESTs and mouse/human homology ESTs hint at a smallerversion of exon2

  12. Publication Quality Output

  13. Comparative Genomics

  14. Chaining Alignments • Chaining bridges the gulf between syntenic blocks and base-by-base alignments. • Local alignments tend to break at transposon insertions, inversions, duplications, etc. • Global alignments tend to force non-homologous bases to align. • Chaining is a rigorous way of joining together local alignments into larger structures.

  15. Chains join together related local alignments Protease Regulatory Subunit 3

  16. Affine penalties are too harsh for long gaps Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.

  17. Gaps are needed in Both Sequences in the General Case of Pair-Wise Alignment otherwise non-homologous bases can be forced to pair

  18. 2-D histogram of observed gaps. The horizontal axis is gaps in human, the vertical axis is gaps in mouse. The logarithm of counts of gaps in bins of 10 (left) and bins of 500 (right) are plotted as levels of gray with black representing the highest counts. Note the concentration of gaps along the axis, particularly for shorter gaps.

  19. Before and After Chaining

  20. Chaining Algorithm • Input - blocks of gapless alignments from blastz • Dynamic program based on the recurrence relationship:score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) • Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i

  21. Netting Alignments • Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. • Net finds best match mouse match for each human region. • Highest scoring chains are used first. • Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

  22. Net Focuses on Ortholog

  23. Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

  24. Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

  25. Mouse/HumanRearrangement Statistics Number of rearrangements of given type per megabase.

  26. A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

  27. Rat Genome year of the rat - 2008

  28. Rat/Mouse/Human Genome-Wide Multiz Alignments Available Eye lense protein gamma crystallin a. Upstream region (on right) is highly conserved but not a CpG island. Alignments are interrupted by numerous recent transposon insertions.

  29. Details page offers quick access to browsers on corresponding regions of other genomes. It also highlights exons in base-by-base alignments.

  30. Zoom to Base Level Detail near translation start of tubulin 8

  31. Zoom to Base Level Intron consensus sequence visible.

  32. Zoom to Base Level Possible alt-splice not consensus and not conserved.

  33. Tiling the genome in MicroarraysNew genes on 21 and 22?

  34. Cross-hybridization at Work Zoomed in on right side:

  35. 200 Bases Upstream of Known Genes 5’ Extended by RNA/EST clusters >hg15_rnaCluster_chr22.246 range=chr22:25204375-25204574 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none aactccgcctcggggccccggggcgccgcctctctcccccggggcgccgc ctctctcccccggggcgccgcctccctccgccgcggccgtcgagccgcgg agcgcctcttccgcggagccgccgcctgccaggattccagcgccgcagct gcggccgcagccattggtctctgacgtcagcggcgtgcggcgcactcggc >hg15_rnaCluster_chr22.234 range=chr22:24125896-24126095 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ccagggcagggcgaggagcgcggggaggggccgcggggacccgggccgct ggggccgtggggcccgcccggccgccggccggctccctggggcgcgggcg gctgcgtcagcggggggcggagacgcggcgctgcttccgctcacgcgcgc cctgctccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga >hg15_rnaCluster_chr22.313 range=chr22:29356156-29356355 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none gccctcccggtccgggggcggggcttggcctggggcggggcttggctggg gtgctcagcccaattttccgtgtagggagcgggcggcggcgggggaggca gaggcggaggcggagtcaagagcgcaccgccgcgcccgccgtgccgggcc tgagctggagccgggcgtgagtcgcagcaggagccgcagccggagtcaca >hg15_rnaCluster_chr22.337 range=chr22:30433286-30433485 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none actcagaagctaagataccgacggtgttcctctgaacttcttccaatggc taaaagctacaagcgcctcagatataaaagactcctggacggattttcat ccagcacagagcagctgaatccatatttggcagctagtggatgggataag aggcctaacagtaagcccatggcactttattctctcgaatccatcaagat >hg15_rnaCluster_chr22.356 range=chr22:32640965-32641164 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ggccccgcgccccaggccggggcgaggccttttccggcgcttctttcccg cggagccgcgggcgggcggcgcaggccctgggggagagcgcgccgcggcc ggttgcagccccccccgcgccgccgcgttcggcgcccggcccggccagtc tgctcctgccccgccgccgcgccggagcccgggcgcccgaagctgggggc

  36. Individuals Institutions Acknowledgements Webb Miller, Chuck Sugnet, Robert Baertsch, Scott Schwartz, Fan Hsu, Terry Furey, Ross Hardison, David Haussler, Richard Gibbs, Bob Waterston, Eric Lander, Francis Collins, LaDeana Hillier, Roderic Guigo, Michael Brent, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, James Gilbert, Greg Schuler, Deanna Church, the Gene Cats. Everyone else! NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide. Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Oklahoma U and the international sequencing centers. UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.

  37. THE END

  38. A Cautionary Note • Infant digestive systems very permeable, uptake antibodies • ~10% of infants are allergic to cow’s milk based formula • These infants get soy/corn based formula • As we engineer plants, let’s be careful what we put in infant formula

  39. New Algorithms and Data • ‘Chaining’ and ‘netting’ of mouse/human alignments precisely define orthology and quantify rearrangements. • Rat genome is browsable and used in rat/mouse/human multiple alignments. • Cross-hybridization potential of Affymetrix-style microarrays calculated and displayed.

  40. Ideal Gap Penalties • Would allow gaps in both sequences at once • Would penalize long gaps less than affine gap scores. • Still would be quick to compute. • We use a piecewise linear function of the sum of gap sizes plus a substantial penalty for gaps that are in both sequences at once.

More Related