340 likes | 716 Views
Comparative Genomics. Overview. Orthologues and paralogues Protein families Genome-wide DNA alignments Syntenic blocks. Comparative Genomics. Allows us to achieve a greater understanding of vertebrate evolution
E N D
Overview • Orthologues and paralogues • Protein families • Genome-wide DNA alignments • Syntenic blocks
Comparative Genomics • Allows us to achieve a greater understanding of vertebrate evolution • Tells us what is common and what is unique between different species at the genome level • The function of human genes and other regions may be revealed by studying their counterparts in lower organisms • Helps identify both coding and non-coding genes and regulatory elements
MYBP 505 438 408 360 286 245 208 144 570 65 CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA Species in Ensembl PLACENTALS MAMMALS MONOTREMES MARSUPIALS OTHER BIRDS BIRDS PALEOGNATHS REPTILES PASSERINES CROCODILES TURTLES LIZARDS AMPHIBIANS TELEOSTS FISHES SHARKS RAYS LATIMERIA BICHIR/POLYPTERUS LUNGFISHES AGNATHANS NON-VERTEBRATES
Orthologue / Paralogue Prediction Algorithm (1) Load the longest translation of each gene from all species used in Ensembl. (2) Run WUBLASTp+SmithWaterman of every gene against every other (both self and non-self species) in a genome-wise manner. (3) Build a graph of gene relations based on Best Reciprocal Hits (BRH) and Blast Score Ratio (BSR) values. (4) Extract the connected components (=single linkage clusters), each cluster representing a gene family. (5) For each cluster, build a multiple alignment based on the protein sequences using MUSCLE. (6) For each aligned cluster, build a phylogenetic tree using PHYML. An unrooted tree is obtained at this stage. (7) Reconcile each gene tree with the species tree to call duplication event on internal nodes and root the tree, using RAP. (8) From each gene tree, infer gene pairwise relations of orthology and paralogy types.
Homologue Relationships • Orthologues : any gene pairwise relation where the ancestor node is a speciation event • Paralogues : any gene pairwise relation where the ancestor node is a duplication event
GeneTreeView MUSCLE protein alignment GeneTree
GeneTreeView Speciation node (blue) Duplication node (red)
Protein Dataset More than 1,500,000 proteins clustered: • All Ensembl protein predictions from all species supported ~ 670,000 protein predictions • All metazoan (animal) proteins in UniProt: ~ 80,000 UniProt/Swiss-Prot ~ 830,000 UniProt/TrEMBL
Clustering Strategy • BLASTP all-versus-all comparison • Markov clustering • For each cluster: • Calculation of multiple sequence alignments with ClustalW • Assignment of a consensus description
GeneView / TransView / ProtView Link to FamilyView
FamilyView Consensus annotation JalView multiple alignments Ensembl family members within human UniProt family members Ensembl family members in other species
Whole Genome Alignments • Functional sequences evolve more slowly than non-functional sequences, therefore sequences that remain conserved may perform a biological function. • Comparing genomic sequences from species at different evolutionary distances allows us to identify: • Coding genes • Non-coding genes • Non-coding regulatory sequences
Human vs.. Chimpanzee Mouse Opossum Pufferfish Size (Gbp) 3.0 2.5 4.2 0.4 Time since divergence ~5 MYA ~ 65 MYA ~150 MYA ~450 MYA Sequence conservation (in coding regions) >99% ~80% ~70-75% ~65% Aids identification of… Recently changed sequences and genomic rearrangements Both coding and non-coding sequences Both coding and non-coding sequences Primarily coding sequences Selection of Species for DNA comparisons
Alignment Algorithm • Should find all highly similar regions between two sequences • Should allow for segments without similarity, rearrangements etc. • Issues • Heavy process • Scalability, as more and more genomes are sequenced • Time constraint
BLASTZ-net, tBLAT and PECAN • BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse • Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish • PECAN is used for multispecies alignments • 7 eutherian mammals • 10 amniota vertebrates
BLASTZ-net, tBLAT and PECAN For which combinations of species whole genome alignments have been done is shown on the Comparative Genomics page (Help & Documentation > Genomic Data > Comparative Genomics):
ContigView Constrained elements Conservation score PECAN alignments Blastz mouse tBLAT zebrafish
MultiContigView Conserved sequences human Conserved sequences dog
AlignSliceView Human Mouse Dog Rat
Syntenic Blocks • Genome alignments are refined into larger syntenic regions • Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent • Any clusters less than 100 kb are discarded
SyntenyView Human chromosome Orthologues Mouse chromosomes Mouse chromosomes
CytoView Syntenic blocks Orientation Chromosome
Q & A Q U E S T I O N S A N S W E R S