940 likes | 1.11k Views
Bio-Medical Informatics. Instructor : Hanif Y a ghoobi Website : site444703.44.webydo.com E-mail : Hyiautcourse@gmail.com My personal Mail: hanifeyaghoobi@gmail.com. About this Course. Activities during the semester 5 score : 1)Home Works 2) MATLAB exercises Your Final Projects 3 score
E N D
Bio-Medical Informatics Instructor : HanifYaghoobi Website: site444703.44.webydo.com E-mail : Hyiautcourse@gmail.com My personal Mail: hanifeyaghoobi@gmail.com
About this Course • Activities during the semester 5 score: 1)Home Works 2) MATLAB exercises • Your Final Projects 3score • Final Exam 12 score
Shortliffe “ Medical informatics is the rapidly developing scientific field that deals with resources, devices and formalized methods for optimizing the storage, retrieval and management of biomedical information for problem solving and decision making” Edward Shortliffe, MD, PhD 1995
Organisms • Classified into two types: • Eukaryotes: contain a membrane-bound nucleus and organelles (plants, animals, fungi,…) • Prokaryotes: lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria) • Not all single celled organisms are prokaryotes!
Cells • Complex system enclosed in a membrane • Organisms are unicellular (bacteria, baker’s yeast) or multicellular • Humans: • 60 trillion cells • 320 cell types • Example Animal Cell • www.ebi.ac.uk/microarray/ biology_intro.htm
DNA Basics – cont. • DNA in Eukaryotes is organized in chromosomes.
Chromosomes • In eukaryotes, nucleus contains one or several double stranded DNA molecules orgainized as chromosomes • Humans: • 22 Pairs of autosomes • 1 pair sex chromosomes • Human Karyotype • http://avery.rutgers.edu/WSSP/StudentScholars/ • Session8/Session8.html
What is DNA? • DNA: Deoxyribonucleic Acid • Single stranded molecule (oligomer, polynucleotide) chain of nucleotides • 4 different nucleotides: • Adenosine (A) • Cytosine (C) • Guanine (G) • Thymine (T)
Nucleotide Bases • Purines (A and G) • Pyrimidines (C and T) • Difference is in base structure • Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm
The Central DogmaProtein Synthesis Transcription Translation Cell Function Transcriptome Genome Proteome Gene Expression Level
Genome • chromosomal DNA of an organism • number of chromosomes and genome size varies quite significantly from one organism to another • Genome size and number of genes does not necessarily determine organism complexity
ORGANISM CHROMOSOMES GENOME SIZE GENES Homo sapiens (Humans) 23 3,200,000,000 ~ 30,000 Mus musculus (Mouse) 20 , 2600,000,000 ~30,000 Drosophila melanogaster(Fruit Fly) 4 180,000,000 ~18,000 Saccharomyces cerevisiae (Yeast) 16 14,000,000 ~6,000 Zea mays (Corn) 10 2,400,000,000 ??? Genome Comparison
DNA Basics – cont. • The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca…)
DNA Basics – cont. • In genes (protein-coding region), during the construction of proteins by amino acids, these nucleotides (letters) are read as triplets (codons). Every codon signals one amino acid for the protein synthesis (there are 20 aa).
…CATTGCCAGT… DNA Basics – cont. • There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions).
…CATTGCCAGT… Exon Intron Exon Intron Exon Exon DNA Basics – Cont. • Start: ATG • Stop: TAA, TGA, TAG • gene
Understanding Genome Sequences ~3,289,000,000 characters: aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga tttaggttgtttccagtttttactggcacagatacggcaatgaatataat tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca. . . Goal: Identify components encoded in the DNA sequence
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA M L S V T S . . . Q R STP Open Reading Frame • Protein-encoding DNA sequence consists of a sequence of 3 letter codons • Starts with the START codon (ATG) • Ends with a STOP codon (TAA, TAG, or TGA)
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA M L S V T S . . . Q R STP Finding Open Reading Frames Try all possible starting points • 3 possible offsets • 2 possible strands Simple algorithm finds all ORFs in a genome • Many of these are spurious (are not real genes) • How do we focus on the real ones?
Using Additional Genomes Basic premise “What is important is conserved” Evolution = Variation + Selection • Variation is random • Selection reflects function Idea: • Instead of studying a single genome, compare related genomes • A real open reading frame will be conserved
S. cerevisiae ~10M years S. paradoxus S. mikatae S. bayanus C. glabrata S. castellii K. lactis A. gossypii K. waltii D. hansenii C. albicans Y. lipolytica N. crassa M. graminearum M. grisea A. nidulans S. pombe Phylogentic Tree of Yeasts Kellis et al, Nature 2003
Evolution of Open Reading Frame S. cerevisiae S. paradoxus S. mikatae S. bayanus ATGCTCAGCGTGACCTCA . . . ATGCTCAGCGTGACATCA . . . ATGCTCAGGGTGACA--A . . . ATGCTCAGG---ACA--A . . . Frame shift changes interpretationof downstream seq Conserved positions Variable positions A deletion
Conserved Examples Variable Frame shift Spurious ORF ATG notconserved Confirmed ORF Greedy algorithm to find conserved ORFs surprisingly effective (> 99% accuracy) on verified yeast data Sequencingerror [Kellis et al, Nature 2003]
Defining Conservation Conserved Variable A A A A C C C C C A A A A A A A A A A C A G T C G G T C C C A C A A A C Naïve approach • Consensus between all species Problem: • Rough grained • Ignores distances between species • Ignores the tree topology Goal: • More sensitive and robust methods % conserv 100 33 55 55
Bioinformatics– an area of emerging knowledge • Each cell of the body contains the whole DNA of the individual (about 40,000 genes in the human genome, each of them comprising from 50 to a mln base pairs – A,T,C or G) • The Main Dogma in Genetics: DNA->RNA->proteins • Transcription: DNA (about 5%) -> mRNA • DNA -> pre-RNA -> splicing -> mRNA (only the exons) • Translation: mRNA -> proteins • Proteins make cells alive and specialised (e.g. blue eyes) • Genome -> proteome N.Kasabov, 2003
Bioinformatics • The area of Science that is concerned with the development and applications of methods, tools and systems for storing and processing of biological information to facilitate knowledge discovery. • Interdisciplinary: Information and computer science, Molecular Biology, Biochemistry, Genetics, Physics, Chemistry, Health and Medicine, Mathematics and Statistics, Engineering, Social Sciences. • Biology, Medicine -- Information Science --> IT, Clinics, Pharmacy, I____________________I • Links to Health informatics, Clinical DSS, Pharmaceutical Industry N.Kasabov, 2003
Bioinformatics: challenging problems for computer and information sciences • Discovering patterns (features) from DNA and RNA sequences (e.g. genes, promoters, RBS binding sites, splice junctions) • Analysis of gene expression data and predicting protein abundance • Discovering of gene networks – genes that are co-regulated over time • Protein discovery and protein function analysis • Predicting the development of an organism from its DNA code (?) • Modeling the full development (metabolic processes) of a cell (?) • Implications: health; social,… N.Kasabov, 2003
Problems in Computational Modeling for Bioinformatics • Abundance of genome data, RNA data, protein data and metabolic pathway data is now available (see http://www.ncbi.nlm.nih.gov) and this is just the beginning of computational modeling in Bioinformatics • Complex interactions: • between proteins, genes, DNA code, • between the genome and the environment • much yet to to be discovered • Stability and repetitiveness: Genes arerelativelystable carriers of information. • Many sources of uncertainty: • Alternative splicing • Mutation in genes caused by: ionising radiation (e.g. X-rays); chemical contamination, replication errors, viruses that insert genes into host cells, aging processes, etc. • Mutated genes express differently and cause the production of different proteins • It is extremely difficult to model dynamic, evolving processes N.Kasabov, 2003
Transcription Translation Bioinformatics Important Challenges • Protein Function • Protein 3D Structure • Gene • Predication • Gene Function
Transcription Translation Public Data Base • Protein sequence • KMLSLLMARTYW • DNA • sequence • {A,T,C,G} • Microarray Gene Expression Level
Microarray • What can it be used for? • How does it work? • What are the Advantages? • An Example Application