Machine Learning for Bioinformatics

Machine Learning for Bioinformatics

Sequence analysis • Sequence alignment • Structure and function prediction • Gene finding • Structure analysis • Protein structure comparison • Protein structure prediction • RNA structure modeling • Expression analysis • Gene expression analysis • Gene clustering • Pathway analysis • Metabolic pathway • Regulatory networks Topics in Bioinformatics

Sequence Alignment • Bio-sequence의 유사성을 측정 • Sequence alignment는 bioinformatics의 여러 분야에서 널리 사용되는 개념

Sequence Alignment의 응용 • Database 검색 • Sequence가 유사한 gene은 기능도 유사할 확률이 높음 • 새로 밝혀낸 gene이 어떤 기능을 하는지 알아내기 위해서 이미 기능이 밝혀진 gene들이 저장되어 있는 database에서 유사한 sequence를 갖는 gene을 검색할 필요가 있음 • Genome sequencing • Overlap이 있는 sequence 조각들로부터 긴 sequence를 재조합 할 때 sequence alignment를 이용함 • Comparative genomics • 실험이 용이한 하등 동물의 정보를 이용하여 고등 동물의 gene을 밝혀냄 • 예: 인간과 쥐의 gene은 매우 유사함  실험을 통해 쥐의 gene들의 기능을 밝힘  기능이 밝혀진 쥐의 gene의 서열과 유사한 인간의 서열을 찾으면 이 부분이 같은 기능을 하는 인간의 gene일 가능성이 높음

Sequence Alignment의 응용 • Gene finding • 인간과 쥐의 exon 부분의 유사도는 평균 85%이지만 intron 부분의 평균 35% 정도임 • 인간과 쥐의 genome sequence에서 유사도가 높은 부분을 찾으면, 이 부분은 exon일 확률이 크다. • 단백질의 기능, 구조 예측 • 아미노산 sequence가 유사하면 단백질의 기능, 3차원 구조도 유사할 가능성이 높음 • 단백질 아미노산 sequence를 새로 밝혀냈을 때, 유사한 아미노산 sequence를 갖는 단백질을 찾아 단백질의 기능, 3차원 구조를 예측할 수 있음.

str1: G C T G A T A T A G C T Edit distance = 3 str2: G G G T G A T T A G C T Global Alignment • Global distance alignment problem: 두 sequence 사이의 distance의 최소값을 찾는 문제 • Global similarity alignment problem: 두 sequence 사이의 similarity의 최대값을 찾는 문제 • 참고: edit distance • 하나의 string을 다른 string으로 바꾸는데 필요한 insertion, deletion, substitution 연산의 수

Local Alignment • Local alignment • 두 sequence 사이에 similarity가 가장 높은 부분을 찾는 문제 • Multiple alignment • Sequence alignment algorithms • Smith-Waterman algorithm • FASTA • BLAST

starting position Genetic Algorithms:Representation • For sequence assembly • The sorted order representation • Operators • A simple swap operation as the mutation operator • Permutation crossover • Transposition operator • Inversion operator

Structure and Function Prediction • Protein structure prediction • Protein modeling

Hidden Markov Models for Protein Modeling

Gene Finding • Prokaryotes • One continuous stretch • Eukaryotes • Exon, intron

Coding and Non-coding Regions DNA  RNA  Protein DNA AUG TAA Non-coding region Non-coding region Regulatory region Protein coding region DNA GENE promoter, start(stop) codon, splice site(donor site, acceptor site)

Coding potential value GC Composition bases Length Donor Discrete exon score Acceptor Intron vocabulary 1 score 0 sequence Multilayer Perceptronsfor Internal Exon Prediction: GRAIL

d+a<3.4? by Markov Chains yes no d+a<1.3? d+a<5.3? (6,560) hex<16.3? hex<0.1? hex<-5.6? (9,49) (18,160) asym<4.6? (737,50) (142,73) donor<0.0? (24,13) (1,5) (5,21) (23,16) Decision Trees for Gene Finding • MORGAN: A decision tree system for gene finding. Coding and non-coding regions finding/exon finding donor: donor site score d+a: donor and acceptor site score hex: in-frame hexamer freq. asym: Fickett’s position assy- metry statistic

Gene Expression Analysis(Section 1-2) • Gene expression • Transcription과 translation 과정을 통하여 gene이 protein으로 발현되는 것 • Gene expression level은 gene의 기능에 대한 단서를 제공 • DNA chip을 통해 세포의 gene expression level을 효율적으로 알아낼 수 있음 • Gene expression analysis 과정 • 알려진 gene sequence를 이용하여 DNA chip을 제작 • Target 세포에서 mRNA를 추출하여 cDNA를 만들고 DNA chip에 가하면 hybridization이 일어남 • Hybridization이 일어난 정도를 분석하면 gene expression 정도를 알 수 있음

Gene Expression Analysis • cDNA Microarray

Gene C Gene B Learning algorithm Data Processed data Gene D Gene A Preprocessing Target Gene C Gene B Gene C Gene B Gene C Gene B Gene D Gene A Gene D Gene A Gene D Gene A Target Target Target Belief propagation The values of Gene C and Gene B are given. Probability for the target is computed. Disease Diagnosis:Bayesian Networks Based on Gene Expression Levels • Learning • Inference

Disease Diagnosis:Cancer Classification with DNA Microarray

Machine Learning for Bioinformatics