910 likes | 2.04k Views
Bioinformatics and Machine Learning. Contents. What is Bioinformatics ? Why Bioinformatics ? Problems in Bioinformatics Machine Learning Methods ML Methods for Bio Data Mining Terminologies in Bioinformatics Bioinformatics in Korea. What is Bioinformatics ? (1).
E N D
Contents • What is Bioinformatics ? • Why Bioinformatics ? • Problems in Bioinformatics • Machine Learning Methods • ML Methods for Bio Data Mining • Terminologies in Bioinformatics • Bioinformatics in Korea (2003-2) Bioinformatics
What is Bioinformatics ? (1) • Bioinformatics = 생물정보학 = 생명정보학 • Biology (생물학) + Informatics (정보학) • 컴퓨터를 이용하여 생명체 관련 자료를 체계적으로 정리하고, 분석, 이용하는 방법을 연구하는 분야 • 좁은 의미 • DNA나 단백질의 서열 및 구조에 관한 정보를 저장, 관리, 이용하고자 하는 분야 • 분자생명정보학 (molecular bioinformatics) • 넓은 의미 • 컴퓨터를 이용하여 생명 과학을 연구하는 모든 분야 (2003-2) Bioinformatics
What is Bioinformatics ? (2) • 연구 내용 • Genome에 대한 총체적인 연구 • Genome을 문자로 나타낸 서열(sequence)에 대한 연구 • Proteomics: 단백체를 연구 • Functional genomics: 단백체의 2D, 3D 구조 연구 • 계통 연구 • 정상 세포와 병든 세포(대표적인 것 : 암세포)의 비교 연구 (2003-2) Bioinformatics
What is Bioinformatics ? (3) • 전산 생물학 (computational biology) • 복잡한 계산이 필요한 생명 과학 분야에서 컴퓨터를 이용하여 연구 • 전산 구조 생물학 (computational structural biology) • 단백질의 구조를 컴퓨터를 이용하여 연구 • Biocomputing • Biological computing • Molecular computing • Biological computation • 생물정보학과 정반대 개념의 학문 • 생명과학에서 얻어지는 지식으로 컴퓨터를 발전시켜보자는 것이 목적 • 예: DNA 컴퓨터 (2003-2) Bioinformatics
What is Bioinformatics ? (4) • Steps of bioinformatics • Data collection and organization • GDB, SWISS-PROT, GenBank, PDB, … • Analysis of the collected data using computational tools • Prediction of the biological functions of genes and proteins based on structural data (2003-2) Bioinformatics
Why Bioinformatics ? (1) • 출현 배경 • 분자 유전학, 분자 생물학, 유전 공학 등의 급격한 발전 • 유전체 사업 (Genome project) • 방대한 생명 정보의 축적 • 컴퓨터를 이용한 체계적인 데이터베이스의 구축과 효율적인 분석, 활용 노력 증대 • Flood of data (SWISS-PROT): sequence data • What can we do by analyzing these data ? • Ancestors of organisms • Phylogenetic trees • Protein structures • Protein function (2003-2) Bioinformatics
80 70 60 50 40 Number of sequences x 1000 30 20 10 0 1988 1990 1992 1994 1996 Year of release (2003-2) Bioinformatics
Why Bioinformatics ? (2) • Bioinformatics is About • Elicitation of DNA sequences from genetic material • Sequence annotation (e.g. with information from experiments) • Understanding the control of gene expression (i.e. under what circumstances proteins are transcribed from DNA) • The relationship between the amino acid sequence of proteins and their structure. (2003-2) Bioinformatics
Why Bioinformatics ? (3) • Aim of researches in Bioinformatics • Understand the functioning of living things – to “improve the quality of life” • Drug design • Identification of genetic risk factor • Gene therapy • Genetic modification of food crops and animals • Biological warfare, crime etc. • Personal Medicine? • E-Doctor? (2003-2) Bioinformatics
Why Bioinformatics ? (4) • Bioinformatics market size • Sources • Cognia (www.cognia.com) • Biovista (www.biovista.com) (2003-2) Bioinformatics
Software offering Data offering Data visualization Data management Gene and protein analysis Data filtering and transformation Clustering and classification Tools supporting laboratory experiment DNA sequence data Gene expression data Protein data Medical genetics data Biological text data Business structure offering Networking and service solution Supercomputer High performance storage system Why Bioinformatics ? (5) • Bioinformatics as business (2003-2) Bioinformatics
Why Bioinformatics ? (6) • Current trend • Integration of multiple data sources • Description of causal relationship • Simulation of biological processes • Prediction of anomaly • Generation of hypotheses • Literature summary for automatic data collection Collaboration for Research and Development (2003-2) Bioinformatics
ACTGG Leu DNA RNA Protein Function Ala Ser A Arg Phe Cys Lys Cys A Cys Asp G G T G T G DNA Protein C Problems and Issues in Bioinformatics (1) • Central dogma of information flow in biology • The sequence of amino acids making up a protein and hence its structure (folded state) and thus its function, is determined by transcription from DNA via RNA (2003-2) Bioinformatics
Problems and Issues in Bioinformatics (2) • 3 main classes of problem areas • Central dogma related: sequence, structure or function • Data related: storage, retrieval & analysis (exponential growth of knowledge in molecular biology) • Simulation of biological processes: protein folding (molecular dynamics) of metabolic pathways (2003-2) Bioinformatics
Problems and Issues in Bioinformatics (3) • Topics in Bioinformatics • Sequence analysis • Sequence alignment • Structure and function prediction • Gene finding • Structure analysis • Protein structure comparison • Protein structure prediction • RNA structure modeling • Expression analysis • Gen expression analysis • Gene clustering • Pathway analysis • Metabolic pathway • Regulatory networks (2003-2) Bioinformatics
Problems and Issues in Bioinformatics (4) • Sequence analysis • Finding evolutionary relationships • Finding coding regions of genomic sequences • Translating DNA to protein • Finding regulatory regions • Assembling genome sequences Finding information and patterns in DNA and protein data (2003-2) Bioinformatics
Problems and Issues in Bioinformatics (4) • Structure analysis • Amino acid sequences of protein determine its 3D conformation MNIHRSTPITIARYGRSRNKTQDFEELSSIRSAEPSQSFSPNLGSPSPPETPNLSHCVSCIGKYLLLEPLEGDHVFRAVHLHSGEELVCKVFDISCYQESLAPCF Sequence Structure Function (2003-2) Bioinformatics
Problems and Issues in Bioinformatics (5) • Gene expression analysis (2003-2) Bioinformatics
Problems and Issues in Bioinformatics (6) • Pathway analysis • The one of the declarative way representing biological knowledge (2003-2) Bioinformatics
(2003-2) Bioinformatics Metabolic pathway
GenBank SWISS-PROT Database Information Retrieval Hardware Supercomputing Bioinformatics Biomedical text analysis Algorithm Agent Information filtering Monitoring agent Sequence alignment Machine Learning Pattern recognition Clustering Rule discovery Problems and Issues in Bioinformatics (7) • Bioinformatics as information technology
The experimental process sample hybridization array scanner Data management relational database web interface results and summaries download data to other applications image analysis links to other information resources Data analysis and interpretation Problems and Issues in Bioinformatics (8) • Bioinformatics on the Web
Machine Learning (1) • Types of ML • Supervised learning • Estimate an unknown mapping from know input-output pairs • Learn fw from training set D = {(x, d)} s. t. fw(x) = y = d = f(x) • Classification: y is discrete • Regression: y is continuous • Unsupervised learning • Only input values are provided • Learn fw from D = {(x)} s. t. fw(x) = x • Compression • Clustering • Reinforcement learning • Input + reward r are provided sequentially with possible delay • Learn fw from D = {(x, r(x,y))} s. t. • Maximize the total reward (2003-2) Bioinformatics
Machine Learning (2) • Why machine learning ? • Recent progress in algorithms and theory • Growing flood of online data • Computational power is available • Budding industry • Three niches for machine learning • Data mining: using historical data to improve decisions • Medical records medical knowledge • Software applications we can’t program by hand • Autonomous driving • Speech recognition • Self customizing programs • Newsreader that learns user interests (2003-2) Bioinformatics
Machine Learning (3) • Methods in machine learning • Symbolic learning • Version space learning • Case-based learning • Neural learning • Multi-layer perceptrons (MLPs) • Self-organizing maps (SOMs) • Support vector machines (SVMs) • Evolutionary learning • Evolution strategies • Evolutionary programming • Genetic algorithms • Genetic programming (2003-2) Bioinformatics
Machine Learning (4) • Probabilistic learning • Bayesian networks (BNs) • Helmholtz machines (HMs) • Latent variable models (LVMs) • Generative topographic mapping (GTM) • Other machine learning methods • Decision trees (DTs) • Reinforcement learning (RL) • Boosting algorithms • Mixture of experts (ME) • Independent component analysis (ICA) (2003-2) Bioinformatics
ML Methods for Bio Data Mining (1) • Sequence Alignment • Simulated Annealing • Genetic Algorithms • Structure and Function Prediction • Hidden Markov Models • Multi-layer Perceptrons • Decision Trees • Molecular Clustering and Classification • Support Vector Machines • Nearest Neighbor Algorithms • Expression (DNA Chip Data) Analysis • Self-Organizing Maps • Bayesian Networks (2003-2) Bioinformatics
Problems in Biological Science Machine Learning Methods Sequence alignment (homology search) Pairwise sequence alignment Database search for similar sequences Multiple sequence alignment Phylogenetic tree reconstruction Protein 3D structure alignment Optimization algorithms - Dynamic programming - Simulated annealing - Genetic algorithms - Neural networks - Hidden Markov models Structure/function prediction RNA secondary structure prediction RNA 3D structure prediction Protein 3D structure prediction Motif extraction Functional site prediction Cellular localization prediction Coding region prediction Transmembrane segment prediction Protein secondary structure prediction Protein 3D structure prediction Pattern recognition and learning algorithms - Discriminant analysis - Hierarchical neural networks - Hidden Markov models - Formal grammar (2003-2) Bioinformatics
Problems in Biological Science Machine Learning Methods Molecular Clustering /Classification Superfamily classification Ortholog/paralog grouping of genes 3D fold classification Clustering algorithms - Hierarchical cluster analysis - Kohonen neural networks Classification algorithms - Bayesian Networks - Neural Networks - Support Vector Machines - Decision Trees Expression (DNA Chip Data) Analysis • Support Vector Machimes • Bayesian Networks • Latent Variable Models - Generative Topographic Mapping (2003-2) Bioinformatics
ATTGGCCA | | | | A—GG—A 4+2*10=24 ATTGGCCA | | AGG ——A 6+1*10=16 ATTGGCCA | | AG ———A 6+1*10=16 ATTGGCCA A———GA 6+1*10=16 ML Methods for Bio Data Mining (2) • Sequence alignment • Compare new sequence with all know sequences • Find similar sequences • Reasonable infer that new sequence has a similar function to the previously known genes • Method of alignment: dynamic programming • Example: (2003-2) Bioinformatics
E x (e.g. all possible alignments) ML Methods for Bio Data Mining (3) • Simulated annealing • For multiple sequence alignment (2003-2) Bioinformatics
DNA AUG TAA Non-coding region Non-coding region Regulatory region Protein coding region DNA ML Methods for Bio Data Mining (4) • Structure and function prediction • Hidden Markov Modelsfor Protein Modeling • Multi-layer Perceptrons for Internal Exon Prediction: GRAIL • Decision Trees for Gene Finding (2003-2) Bioinformatics
ML Methods for Bio Data Mining (5) • Molecular clustering and classification • Clustering (unsupervised learning) • Hierarchical cluster analysis • Kohonen neural networks = SOM • Classification (supervised learning) • Hidden Markov Model • Neural networks • Bayesian networks • Support vector machines: functional classification of genes • Nearest Neighbor Algorithm: 3D protein classification • Decision trees (2003-2) Bioinformatics
ML Methods for Bio Data Mining (6) • Expression (DNA Chip Data) analysis • Gene discovery: gene/mutated gene • Growth, behavior, homeostasis … • Disease diagnosis • Drug discovery: Pharmacogenomics • Toxicological research: Toxicogenomics (2003-2) Bioinformatics
Problems Tools or Database Sequence alignment BLAST, FASTA Multiple Sequence alignment Clustal W, Macaw Pattern finding GRAIL, FGENEH, tRNAscan-SE, NNPP, eMOTIF, PROSITE, ChloroP Structure prediction Bend.it, RNA Draw, NNPREDICT, SWISS-MODEL DNA microarray GeneX, GOE, MAT, GeNet Hardwares for proteomics 2D Gel, MALDI-TOE Bioinformatics Tools (2003-2) Bioinformatics
Terminologies in Bioinformatics (1) • DNA (deoxyribonucleic acid) • Double-helical macromolecule • Heredity material • Passes from one generation to the next • Dictates the inherent properties of a species • RNA • Single-stranded nucleotide chain, not double helix • Genomics • Studying the “complete gene content” of living organisms • Genome • The genetic material of an organism • Contained in one haploid set of chromosomes (2003-2) Bioinformatics
Terminologies in Bioinformatics (2) • Human Genome Project • A large, federally funded collaborative project • Sequencing of the entire human genome • DNA Chip = biochip = DNA microarray = gene array • Samples of DNA (2003-2) Bioinformatics
The intensity and color of each spot encode information on a specific gene from the tested sample (2003-2) Bioinformatics
search DNA “gene” compute RNA compute Protein sequence ?how? Folded Protein Terminologies in Bioinformatics (3) • Gene structure (2003-2) Bioinformatics
TATA start Termination stop control statement control statement gene Transcription (RNA polymerase) Ribosome binding 5’utr mRNA 3’utr Transcription (Ribosome) Protein Terminologies in Bioinformatics (4) • DNA (gene) RNA Protein (2003-2) Bioinformatics
Bioinformatics in Korea (1) • 서울대 • Biointelligence laboratory: bi.snu.ac.kr • Center for Bioinformation Technology (CBIT): cbit.snu.ac.kr • 생물정보학 협동과정: ipbi.snu.ac.kr • 포항공대 • Biological Research Information Center (BRIC): bric.postech.ac.kr • 한국과학기술정보연구원 • CCBB 바이오인포매틱스센터: www.ccbb.re.kr • 한국 생명공학 연구원 • KRIBB (Korea Research Institute of Bioscience and Biotechnology) • http://www.kribb.re.kr (2003-2) Bioinformatics
Bioinformatics in Korea (2) • 마크로젠: www.macrogen.com • 아이디알: www.idrtech.com (2003-2) Bioinformatics