500 likes | 513 Views
Explore gene prediction techniques comparing genomic sequences using various algorithms, improving accuracy and identification of novel genes.
E N D
finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona
número de genes en el cromosoma 22 • initial annotation 545 Dunham et al., 1999 • genscan+RT-PCR 590 Das et al., 2001 • genscan+microarrays 730 Shoemaker et al., 2001 • reviewed annotation 726 chr22 team, sanger, 2001 • mouse shotgun data +20 (our data) • geneid predictions 794 • genscan predictions 1128
número de genes en el genoma humano • Consortium 30.000-40.000 2001 • Celera 27.000-38.000 2001 • Consortium+Celera 50.000 Hogenesch et al. 2001 • DBsearches 65.000-75.000 Wrigth et al., 2001 • HumanGenomeSciences 90.000-120.000 Haseltine, 2001
comparative gene prediciton • rosseta (Batzoglou et al., 2000) • cem (Bafna and Huson, 2000) • sgp1 (Wiehe et al., 2000) • twinscan (Korf et al., 2001) • slam (Patcher et al., 2001) • doublescan (Meyer and Durbin, 2002) • sgp2 (Parra et al., 2003)
comparative gene prediction • THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT • given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences. • This problem is usually solved through a complex extension of the classical • dynamic programming algorithm for sequence alignment. • blayo et al., 2002 • pedersen and scharl, 2002
comparative gene prediction • 2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY • given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs • progen – novichkov et al., 2001 • slam – pachter et al, 2001 • doublescan – meyer and durbin, 2002
comparative gene prediction • 3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT • first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass • then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions. • rosseta – batzoglou et al., 2000 • cem – bafna and huson, 2000 • sgp-1 – wiehe et al., 2001
comparative gene prediction • 4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT • This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced • by underlying ``ab initio'' gene prediction algorithms. • twinscan – korf et al., 2001 • sgp-2 – parra et al., 2003
tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons syntenic gene prediction (sgp2)
programs based on mouse human genome sequence comparisons improve gene predictions Accuracy on human chromosome 22
gene predicition programs predict a large number of genes predictions in the mouse genome
and a large number of novel genes ... predictions in the mouse genome
...with exons... predictions in the mouse genome
that look fine proteins predictions in the mouse genome
almost every mouse gene has the human orthologue counterpart predictions in the mouse genome
orthologous human mouse genes have conserved exonic structure |1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b |2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR
orthologous human mouse genes have conserved exonic structure. data on 1506 human/mouse refseq orthologues • 85% of the orhologous pairs have identical number of exons • 91% of the orthologous exons have identical length • 99.5% of the orthologous exons have identical phase • there are a few cases of intron insertion/deletion (22) • U12 introns appear to be strongly conserved between human and mouse • non-canonical GC-AG are less conserved.
we will target genes with conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
ortholgous splice sites are more conserved than expected solely from their splicing function
ortholgous splice sites are more conserved than expected solely from their splicing function
we will target genes with conserved intron positions
the final pools predictions in the mouse genome
rtpcr: targeting conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers
rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers
about 1000 human genes not in ensembl • low support by ESTs: 34% match EST sequences • low representation in other vertebrate genomes: 33% have sequence matches in fish genomes • restricted expression patterns
Code B H K Y V S M L T K E O %Id Homology 3B1 ● ● 38% Dystrophin-like; with ZZ domain 3B3 ● ● ● ● ● 25% Novel aquaporin; similar to Drosophila CG12251 3C3 ● ● ● ● ● 25% TEP1 (telomerase associated); probable ATPase 3C5 ● ● 47% Voltage-dependent calcium channel gamma subunit 4B3 ● ● ● 34% Interferon-induced / fragilis transmembrane family 4C6 ● ● ● ● ● 30% Interleukin 22-binding protein CRF2-10 4G4 ● ● ● ● 64% Nna1p, nuclear ATP/GTP-binding protein 5B5 ● ● ● 43% Likely aminophospholipid flippase (transporting ATPase) 1E3 ● ● ● ● ● 40% N-acetylated-α-linked-acidic dipeptidase (NAALADase) 6C4 ● ● 42% Not-type homeobox; poss. involved in notochord development 6F5 ● ● ● 66% Drosophila brain-specific homeobox protein (bsh) 11F2 ● ● ● ● ● 29% Human GABA-B receptor 2, neurotransmitter release regulator 5A2 ● ● ● ● 41% Skate liver organic solute transporter beta 11B6 ● ● ● 55% Interferon-activatable protein 203; nuclear protein 12B3 ● ● ● ● ● ● ● ● 25% Fatty acid desaturase; maintains membrane integrity 11F6 ● ● ● ● ● ● ● 44% Rat vanilloid receptor type 1 like protein 1 12E3 ● ● 52% Fizzy/CDC20; modulates degradation of cell-cycle proteins 12F1 ● ● ● ● ● 43% Otoferlin (mutated in DFNB9, nonsyndromic deafness) 12H1 ● ● ● 45% Fruitfly additional sex combs; a Polycomb group protein 12C4 ● ● ● 43% C. elegans C15C8.2; single-minded-like; HLH and PAS domains 12D2 ● 41% Cytosolic phospholipase A2, group IVB 12A5 ● 38% Fruitfly GH15686p; Ent2-like nucleoside transporter 12E5 ● ● ● ● 32% Relaxin 3 preproprotein; prohormone of the insulin family 11A1 ● ● ● ● ● 89% Mouse BET3, involved in ER to Golgi transport 11A2 ● ● ● ● ● ● 70% Vacuolar ATP synthase subunit S1 11B2 ● ● ● ● ● ● 54% Myosin light chain kinase, skeletal muscle. 11G2 ● ● ● ● ● ● ● ● ● ● 36% Dapper / frodo (transduces Wnt signals by interacting with Dsh.
further work • scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed • intronless genes • human specific gene families (if any) • genes with non-canonical splicing
selenoproteins Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid. • Function: mostly redox enzymes • Distribution: 3 domains of life • Number: 22 families in mammals
selenoproteins • UGA (STOP) is the codon for Sec • There is a tRNAsec with the UGA anticodon • Recoding: • RNA structure: the SECIS element • SECIS binding proteins
the SECIS element. computational search for selenoproteins SECIS Pattern dSelG
using geneid to search for selenoproteins • Predict SECIS (PatScan) • Gene prediction with • TGA in-frame • SECIS
dSelM has selenoprotein homologues in vertebrates
COMPARATIVE GENE PREDICTION SELENOPROTEINS