170 likes | 413 Views
Bioinformatics Workshop 1 Sequences and Similarity Searches. Open a web browser and type in the URL: informatics.gurdon.cam.ac.uk/online/workshops Bookmark this page Click on the link to the file: useful-websites.html Bookmark this page too
E N D
Bioinformatics Workshop 1Sequences and Similarity Searches • Open a web browser and type in the URL: • informatics.gurdon.cam.ac.uk/online/workshops • Bookmark this page • Click on the link to the file: • useful-websites.html • Bookmark this page too • It also contains links to the example sequence files used in the workshop, and the presentations themselves
The Universe of Biological Data linear genes GENES expressed sequences mapping markers 3D structures polymorphisms expression data from EST libraries assembled genomes similarity searches expression data from (e.g.) in situ hybridisation ONTOLOGIES hidden Markoff models regulatory elements expression data from microarrays examples guilt by association models interaction data actual interaction
Sequence Biology ~ gene gene model locus genome primary transcript mRNA protein
Genes and Loci Even on a conceptual level we’re not quite clear about what a gene is… Locus: a region on the genome that is transcribed Gene? by function of protein? so what if more than one locus produces identically functioning proteins? a single locus (transcript) may produce two quite different proteins?
5’ EST 3’ EST Derivative Sequences mRNA clone into cDNA library Single pass sequence from each end of the clone cDNA sequence Multiple pass sequencing over whole length of the clone
Initial Growth of Databases • Lots of ESTs were generated • Some clones were selected for full-insert sequencing -> cDNAs • cDNAs were translated to yield presumed protein sequences
Then Came Genomes • With increasing larger fragments of genomic sequence came the ability to align cDNAs to create gene models • And then to apply our understanding of exon/intron structure to predict theoretical genes…
genome exon intron exon intron exon Introns and Exons mRNA CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA gene model splice sites CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA GTAAG.donor .TTTCAG acceptor
Gene Predictions • Given: • coding sequence must run from ATG – STOP codon in-frame • introns GT. . . . . . AG can be spliced out • Also take a statistical approach: • coding and non-coding sequence are slightly different in composition • some ‘possible’ splice sites are more likely than others scan genomic sequence … . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . most likely gene model . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
Supporting Evidence! exons: 1 2 3 4 gene model genome EST evidence We note that even though there is good evidence for the existence of all four exons, there is no evidence that all the exons would appear on a real transcript. An alternative transcript, skipping exon 3, would be plausible, if a little unlikely. This gets less ambiguous as more ESTs are available, and clones are sequenced at both ends (which helps put distant exons into the same transcripts), and eventually full-length transcript sequences are available.
Theoretical/Predicted Sequences exons: 1 2 3 4 gene model genome predicted transcript predicted protein We’ve now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence, but we shouldn’t lose sight of the fact that we don’t really know if these predicted proteins exists – especially where supporting EST evidence is weak, or nonexisten t. Predicted transcripts may lack UTR sequence, depending on how EST data was used….
So What’s in the Databases Now? • At NCBI • 15,000,000 EST sequences • 3,329,110 non-redundant DNA sequences (excluding ESTs, etc.) • 2,693,904 non-redundant translated coding sequences • 954,378 Protein Reference Sequences sequences (RefSeq) • But the majority of RefSeq may be translations of theoretical transcripts…
Main Data Axes • Europe: EBI/EMBL • Swiss-Prot/Trembl/Ensembl/UniProt • US: NIH/NCBI • GenBank/UniGene/RefSeq/Entrez • Japan: DNA Data Bank of Japan • National Institute of Genetics
Synchronisation… You submit a sequence ATCGATCGATCATAGTATGCTAGCTGCTA GenBank EMBL BC009638.1 ATCGATCGATCATAGTATGCTAGCTGCTA BC009638.1 ATCGATCGATCATAGTATGCTAGCTGCTA BC009638.1 ATCGATCGATCATAGTATGCTAGCTGCTA DDBJ
Sequences and Accession Numbers NM_001015922.1 gi=62860271 GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA NM_001015922.2 gi=62860589 GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA BC009638.1 gi=16307106 GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA
Sequences and Genes Currently a gene can be represented by many sequences. Some are more representative than others. Also a position on the genome. Names are fluid – as sometimes are structures. It is difficult to point to any one things and say ‘that is the gene’. Genes need some sort of conceptual representation, and then we hang all the other bits and pieces of that. But for the moment it’s a bit untidy…
Main Data Portals • NCBI Entrez Databases • ExPASy Proteomics Server • DNA Data Bank of Japan DDBJ • EBI Ensembl Genome Browser • Santa Cruz Genome Browser • Model Organism Databases, e.g. FlyBase