100 likes | 422 Views
A Quest for Genes. gene (jēn) n. A hereditary unit consisting of a sequence of DNA that occupies a specific location on a chromosome and determines a particular characteristic in an organism. Genes undergo mutation when their DNA sequence changes.
E N D
A Quest for Genes gene (jēn) n. A hereditary unit consisting of a sequence of DNA that occupies a specific location on a chromosome and determines a particular characteristic in an organism. Genes undergo mutation when their DNA sequence changes. [German Gen, from gen-, begetting, in Greek words (such as genos, race, offspring).] from: http://www.answers.com/topic/gene Genes are the units of heredity in living organisms. They are encoded in the organism's genetic material (usually DNA or RNA), and control the development and behavior of the organism. The word "gene" ... is shared by many disciplines, including classical genetics, molecular genetics, evolutionary biology and population genetics. Because each discipline models the biology of life differently, the usage of the word gene varies between disciplines. It may refer to either material or conceptual entities. Following the discovery that DNA is the genetic material, and with the growth of biotechnology and the project to sequence the human genome, the common usage of the word "gene" has increasingly reflected its meaning in molecular biology, namely the segments of DNA which cellstranscribe into RNA and translate, at least in part, into proteins. The Sequence Ontology project defines a gene as: "A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions". From: http://en.wikipedia.org/wiki/Gene What’s a gene?
And so the gene is • it has certain coordinates and sequence • A segment of DNA • It is inherited • It is transcribed • It may be translated • It has regulatory and other functional regions • genes in different organisms may be homologous • its sequence may be optimized for transcription • its sequence may be optimized for translation • the structure and sequence of these regions may have something in common some/all of these features can be used to find genes
How to find rRNA- and tRNA-coding genes? • Homology • Secondary structure RNA- and protein-coding genes • All genes are transcribed. Transcripts are messenger RNAs (mRNAs), ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), etc. • Only mRNAs are translated into proteins
Protein-coding genes • are transcribed and translated • their sequence is optimized for transcription and translation
Transcription • Is performed by a special enzyme – RNA polymerase • RNA polymerase binds strongly to a specific promoter sequence, where DNA is unwound • RNA is polymerized in 5’>3’ direction (DNA is read from 3’>5’) • RNA polymerization is stopped upon reaching a terminator Signals in DNA sequence that can be recognized: promoters and terminators. Promoters are A+T rich for easy strand separation, terminators are G+C rich. Most important contribution: transcription-coupled DNA repair
Transcription-coupled DNA repair and mutational drift • Mutations: transitions (purine to purine or pyrimidine to pyrimidine) and transversions (purine to pyrimidine or pyrimidine to purine) • Transitions are more common than transversions • Transitions change G:C to A:T or C:G to T:A • If unrepaired, spontaneous mutations will enrich DNAsequence with A+T, while G+C will disappear • Sequences repaired more efficiently will have higher G+C content than the rest of the genome • The rate-limiting step in DNA repair is recognition of damaged base pairs • Damaged base pairs are efficiently recognized by RNA polymerase, which stalls and attracts DNA repair machinery • Actively transcribed sequences are repaired more efficiently than the rest of the genome Protein-coding regions will have higher G+C content than non-coding regions
Translation • Ribosome binds to a region upstream of the start codon (Shine-Dalgarno sequence) • The start codon defines the reading frame for subsequent codons • mRNA is read in 5’>3’ direction • Translation is terminated upon reaching a stop codon Signals in DNA sequence that can be recognized: • Shine-Dalgarno • start codons (AUG, GUG, UUG) • stop codons (UAG, UGA, UAA) • open reading frames (ORFs) • codon, di-codon and tri-codon usage • amino acid composition
Computational protein-coding gene prediction algorithms • GeneMark • Glimmer • GeneScan • GeneLook • GeneHacker • EasyGene • GS-Finder • ZCurve • Orpheus • Fgenes • Critica • ... • Find a set of ORFs • Consider some ORFs as “typical protein-coding sequences” and some other regions as “typical non-coding sequences” for use as a training set • Find some features that would distinguish “typical coding sequences” from “typical non-coding sequences” (GC content, codon usage, di-codon usage, patterns, motifs, etc.) • Find other ORFs with similar features How typical is your “typical coding” sequence?
Guinness Book of protein-coding genes • The longest human gene is 2,220,223 nucleotides long. It has 79 exons, with a total of only 11,058 nucleotides, which specify the sequence of the 3,685 amino acids and codes for a protein dystrophin. It is part of a protein complex located in the cell membrane, which transfers the force generated by the actin-myosin structure inside the muscle fiber to the entire fiber. • The smallest human gene is 252 nucleotides long, it specifies a polypeptide of 67 amino acids and codes for an insulin-like growth factor II. • The longest bacterial gene is 110,418 nucleotides long, which specify the sequence of 36,805 amino acids. Its function is unknown, most likely a surface protein. • The smallest bacterial gene is 54 nucleotides long, it specifies a polypeptide of 17 amino acids and codes for a regulatory protein in cyanobacteria
Computational protein-coding gene prediction algorithms identify correctely 90-95% of genes. Manual analysis is required to find the rest.