1 / 50

finding genes by comparing genomes

Explore gene prediction techniques comparing genomic sequences using various algorithms, improving accuracy and identification of novel genes.

rtan
Download Presentation

finding genes by comparing genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona

  2. número de genes en el cromosoma 22 • initial annotation 545 Dunham et al., 1999 • genscan+RT-PCR 590 Das et al., 2001 • genscan+microarrays 730 Shoemaker et al., 2001 • reviewed annotation 726 chr22 team, sanger, 2001 • mouse shotgun data +20 (our data) • geneid predictions 794 • genscan predictions 1128

  3. número de genes en el genoma humano • Consortium 30.000-40.000 2001 • Celera 27.000-38.000 2001 • Consortium+Celera 50.000 Hogenesch et al. 2001 • DBsearches 65.000-75.000 Wrigth et al., 2001 • HumanGenomeSciences 90.000-120.000 Haseltine, 2001

  4. sequence conservation andcoding function

  5. sequence conservation and coding function

  6. comparative gene prediciton • rosseta (Batzoglou et al., 2000) • cem (Bafna and Huson, 2000) • sgp1 (Wiehe et al., 2000) • twinscan (Korf et al., 2001) • slam (Patcher et al., 2001) • doublescan (Meyer and Durbin, 2002) • sgp2 (Parra et al., 2003)

  7. comparative gene prediction • THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT • given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences. • This problem is usually solved through a complex extension of the classical • dynamic programming algorithm for sequence alignment. • blayo et al., 2002 • pedersen and scharl, 2002

  8. comparative gene prediction • 2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY • given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs • progen – novichkov et al., 2001 • slam – pachter et al, 2001 • doublescan – meyer and durbin, 2002

  9. comparative gene prediction • 3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT • first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass • then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions. • rosseta – batzoglou et al., 2000 • cem – bafna and huson, 2000 • sgp-1 – wiehe et al., 2001

  10. comparative gene prediction • 4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT • This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced • by underlying ``ab initio'' gene prediction algorithms. • twinscan – korf et al., 2001 • sgp-2 – parra et al., 2003

  11. tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons syntenic gene prediction (sgp2)

  12. programs based on mouse human genome sequence comparisons improve gene predictions Accuracy on human chromosome 22

  13. how accurate are the sgp predictionsnucleotide level

  14. how accurate are the sgp predictionsexon level

  15. gene predicition programs predict a large number of genes predictions in the mouse genome

  16. and a large number of novel genes ... predictions in the mouse genome

  17. ...with exons... predictions in the mouse genome

  18. that look fine proteins predictions in the mouse genome

  19. almost every mouse gene has the human orthologue counterpart predictions in the mouse genome

  20. orthologous human mouse genes have conserved exonic structure |1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b |2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR

  21. orthologous human mouse genes have conserved exonic structure. data on 1506 human/mouse refseq orthologues • 85% of the orhologous pairs have identical number of exons • 91% of the orthologous exons have identical length • 99.5% of the orthologous exons have identical phase • there are a few cases of intron insertion/deletion (22) • U12 introns appear to be strongly conserved between human and mouse • non-canonical GC-AG are less conserved.

  22. we will target genes with conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c

  23. sequence conservation andcoding function

  24. ortholgous splice sites are more conserved than expected solely from their splicing function

  25. ortholgous splice sites are more conserved than expected solely from their splicing function

  26. prediction of splice sites

  27. we will target genes with conserved intron positions

  28. the final pools predictions in the mouse genome

  29. rtpcr: targeting conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c

  30. rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers

  31. rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers

  32. about 1000 human genes not in ensembl • low support by ESTs: 34% match EST sequences • low representation in other vertebrate genomes: 33% have sequence matches in fish genomes • restricted expression patterns

  33. restricted expression patterns

  34. Code B H K Y V S M L T K E O %Id Homology 3B1 ● ● 38% Dystrophin-like; with ZZ domain 3B3 ● ● ● ● ● 25% Novel aquaporin; similar to Drosophila CG12251 3C3 ● ● ● ● ● 25% TEP1 (telomerase associated); probable ATPase 3C5 ● ● 47% Voltage-dependent calcium channel gamma subunit 4B3 ● ● ● 34% Interferon-induced / fragilis transmembrane family 4C6 ● ● ● ● ● 30% Interleukin 22-binding protein CRF2-10 4G4 ● ● ● ● 64% Nna1p, nuclear ATP/GTP-binding protein 5B5 ● ● ● 43% Likely aminophospholipid flippase (transporting ATPase) 1E3 ● ● ● ● ● 40% N-acetylated-α-linked-acidic dipeptidase (NAALADase) 6C4 ● ● 42% Not-type homeobox; poss. involved in notochord development 6F5 ● ● ● 66% Drosophila brain-specific homeobox protein (bsh) 11F2 ● ● ● ● ● 29% Human GABA-B receptor 2, neurotransmitter release regulator 5A2 ● ● ● ● 41% Skate liver organic solute transporter beta 11B6 ● ● ● 55% Interferon-activatable protein 203; nuclear protein 12B3 ● ● ● ● ● ● ● ● 25% Fatty acid desaturase; maintains membrane integrity 11F6 ● ● ● ● ● ● ● 44% Rat vanilloid receptor type 1 like protein 1 12E3 ● ● 52% Fizzy/CDC20; modulates degradation of cell-cycle proteins 12F1 ● ● ● ● ● 43% Otoferlin (mutated in DFNB9, nonsyndromic deafness) 12H1 ● ● ● 45% Fruitfly additional sex combs; a Polycomb group protein 12C4 ● ● ● 43% C. elegans C15C8.2; single-minded-like; HLH and PAS domains 12D2 ● 41% Cytosolic phospholipase A2, group IVB 12A5 ● 38% Fruitfly GH15686p; Ent2-like nucleoside transporter 12E5 ● ● ● ● 32% Relaxin 3 preproprotein; prohormone of the insulin family 11A1 ● ● ● ● ● 89% Mouse BET3, involved in ER to Golgi transport 11A2 ● ● ● ● ● ● 70% Vacuolar ATP synthase subunit S1 11B2 ● ● ● ● ● ● 54% Myosin light chain kinase, skeletal muscle. 11G2 ● ● ● ● ● ● ● ● ● ● 36% Dapper / frodo (transduces Wnt signals by interacting with Dsh.

  35. limitations:sensitivity of the procedure

  36. specificity of the prediction can be improved: Ka/Ks ratio

  37. further work • scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed • intronless genes • human specific gene families (if any) • genes with non-canonical splicing

  38. selenoproteins Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid. • Function: mostly redox enzymes • Distribution: 3 domains of life • Number: 22 families in mammals

  39. selenoproteins • UGA (STOP) is the codon for Sec • There is a tRNAsec with the UGA anticodon • Recoding: • RNA structure: the SECIS element • SECIS binding proteins

  40. selenoproteins

  41. the SECIS element. computational search for selenoproteins SECIS Pattern dSelG

  42. using geneid to search for selenoproteins • Predict SECIS (PatScan) • Gene prediction with • TGA in-frame • SECIS

  43. genome wide search in drosophila

  44. dSelG

  45. dSelM

  46. dSelG and dSelM: experimental verification

  47. dSelM has selenoprotein homologues in vertebrates

  48. COMPARATIVE GENE PREDICTION SELENOPROTEINS

More Related