Genomic Sequence alignments and its application

Genomic Sequence alignments and its application 조환규 교수 부산대학교 공과대학 정보 컴퓨터 공학부 Hgcho@pusan.ac.kr

Biology and Informatics • Mathematics : Physics = X : Biology • X = ? • Bioinformatics • Understanding Biological System with Informatics • Molecular Biology • Computational Biology • Genomics • Proteomics • And several –omics

Main features of This Talk • 이미 잘 정리된 Computing methodology를 어떻게 Bioinformatics에서 활용하는가 ? • Bioinformatics에서 잘 정리된 방법론은 CS쪽에서 어떻게 활용하는가 ? • Case Study • Genomic sequencing alignment와 program-copy detection과의 연관

Computing Space Transform • Normal Space Jewelry Space PLUS MINUS

Computing Space Transform(2) • Normal Space log Space log X , Y log X , log y multiply exponent log x + log y X * Y

Computing Space Transform(3) • Program Space Genome Seq Space Program b Program a Protein a Protein b Basic keyword CLUSTAL-W( a, b ) Pairwise Alignment Similarity( , ) Similarity( , )

Namely “우물론” • 한 우물을 팔 것인가 ? • 그러나 만일 끝끝내 물이 안 나올 경우라면 • 여러 우물을 돌아가며 조금씩 팔 것인가 ? • 이것도 저것도 아니라면 ? • 그렇다면 어떻게 팔 것인가 ? • Avoid “reinventing wheel”

Genomic Sequence Alignments • Genomic Sequences, linear • DNA, RNA, Protein(amino acids) • Why linear ? • Goal of Molecular Biology or Life Science • Characterizing functions of genes • Understanding internal gene interactions • Understanding internal & external interactions • Drug targeting(protein interaction) • Why Alignment ?

Human Binome BANK in 3280 • Year 3280, dooms day • So many binary data files • Chips( cell-phone, cooker, TV… et al) • Computer disks • Figure out the contents of followings • 010111010101000100101000000001010101010….? • 100000001010111111111001010101010010101…? • 101011111111111110001010010100101010100…?

Human-Binome-Project • They decided to establish Binary BANK. • Some are partially annotated • Object code, text data , garbage • Protein , Gene , Junk-DNA • HUMAN-Binome PROJECT….. Starts… • 목적: 각종 binary sequence의 기능을 탐색 • Mini-binary project ~~ E.Coli, C.Elegans • Cell-phone, calculator, PDA… • Full sequencing of 300 Giga bytes DISK • HUMAN BINOME BANK(HBB)!

For an Unknown Seq. X • Sequence X from a hardware • Several error bits included • Fragment sequencing • Find a similar pattern in HBB • Function ? • Region ? • Size ? • Write a BIG paper…. • Here it comes………

반도지역 SM-6-5에서 흔히 발견되는Seq-45-6-X 종의 기능에 대한 고찰 팔봉이, 2급 원숭이 우주대학, 분자고생물학과 이번에 SM-6-5지역에서 빈번하게 발견된 유전종들의 기능은 …. 아마도 이 부족들이 집중적으로 사용한 언어를 표기하기 위한 도구의 일부로 추정된다. 또한 이들은 매우 다양한 표기형식을 가지고 있는 것으로 추정되는데, 그 이유는 각 단어들이 나타나는 빈도에서 추정되는 엔트로피가 PR-99지역에 나타나는 약 26개보다 월등히 많은 곳으로 추정되므로 ……………………. Journal of Science & Sciences No.187 (Vol. 213), pp.232-265, 3288

Dynamic Programming • A Basic Methodology • For all kinds of alignment • Solution from all sub-partial solution • 준비물 • Objective function • Dynamic programming formula(recursion) • F(n) = F(n-1) + F(n-2), • Base condition, F(0)=F(1)=1 • Table, multi-dim. array structure

Global Alignment(1) • Basic scoring: • Match: 1, Mismatch: -1, Space: -2 • How? • To find the alignment of two sequences of maximal score • Sequence alignment problem corresponds to the longest path problem form the source to the sink in this directed acyclic graph.

1 -1 -3 -5 -7 -9 -11 -13 -1 2 0 -2 -4 -6 -8 -10 -3 0 1 -1 -1 -3 -5 -7 -5 -2 -1 0 0 -2 -2 -4 -7 -4 -3 -2 -1 1 -1 -1 C A C A G T G T C A - - G - G T Global Alignment(2) • CACAGTGT 와 CAGGT C A C A G T G T 0 -2 -4 -6 -8 -10 -12 -14 -16 C -2 A -4 G -6 G -8 T -10

Local Alignment(1) • An alignment between a substring of s and a substring of t • Each entry of (I,j) will hold the highest score of an alignment between a suffix of s[1…i] and a suffix of t[1…j] 예) AGGTATTGA -CCTATGGC

0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 2 0 0 2 0 0 0 1 0 3 2 0 0 1 1 0 0 1 2 1 0 0 0 0 0 0 0 1 A G G T A T T A - - C T A T G C Local Alignment(2) • AGGTATTG 와 CTATGC A G G T A T T A 0 0 0 0 0 0 0 0 0 C 0 T 0 A 0 T 0 G 0 C 0

Semi-global Alignment(1) • Given two sequences, check if one of them has a substring similar to the other entire sequence. • How? • Find alignments ignoring the beginning and end spaces of the sequences • Global alignment 와 비교 • CAGCA -CTTGGATTCTCGG <-semi-global • - - -CAGCGTGG- - - - - - - (score: -19) • CAGCACTTGGATTCTCGG <-global • CAGC- - - - -G -T- - - -GG (score: -12)

1 -1 -3 -5 -7 -9 -11 -13 -1 2 0 -2 -4 -6 -8 -10 -3 0 1 -1 -1 -3 -5 -7 -5 -2 -1 0 0 -2 -2 -4 -7 -4 -3 -2 -1 1 -1 -1 C A C A G T G T - - - C A G G T - - Semi-global Alignment(2) • CACAGTGT 와 CAGGT C A C A G T G T 0 -2 -4 -6 -8 -10 -12 -14 -16 C -2 A -4 G -6 G -8 T -10

General Gap Penalty • Definition • Gap: consecutive number k > 1 of spaces • When mutations are involved, the occurrence of a gap with k spaces is more probable than the occurrence of k isolated spaces • w(k) : penalty associated with a gap with k spaces

Affine Gap Penalty Function • Penalty for consecutive spaces <= isolated spaces • Sub-additive function • w(k1 +k2+…_kn) <= w(k1) + w(k2) +…+w(kn) • Three arrays for dynamic programming • a[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in s[i] matched with t[j] • b[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in a space matched with t[j] • c[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in s[i] matched with a space

Heuristic Alignment • Main difficulties • Search space, O(n^2) space or O(n^2 log n)time • Optimality or Biologically-good Distance metric • Multiple alignment • Local search • Diagonal region searching • Visualization., e.g., Dotlet • BLAST approach for long sequence • Small word matching • And Extending from a highly matched region

Multiple Alignment • ------AG---T----CGCTGC---- • ------AGCGAT--CGCGCTGC--- • ---TCGAGGCAA--GCTGCTGC----- • ------GGCGAT----CGCTGC----- • Problem hardness: • Optimal alignment : NP-hard • Almost kinds of object functions • What if more than 1000 sequence ? • SPACE COMPELXITY IN PRACTICE • Pairwise alignment • Star Alignment • Tree alignment

Why multiple alignment ? • Finding Conserved regions • Computer virus phylogeny constructing • 300 sp./year • More than 10000 sp. : N • Number of files in a system : M > 100000 • Detecting a CV takes O( N*M) checks! • tuberculosis • 8 종, a conserved region, and polymorphic sites • 김철민 교수님(부산의대) – 진단용 칩 제작 • Phylogeny construct

Phylogeny 1 : hard Version

Phylogeny 2 : Probable version

Constructing Phylogenetic Tree • Distance matrix • Optimal Tree ? • Degree constraint, Steiner points, Quartet method B D C E A

PART 2: Application Detecting Source Code Plagiarism

Plagiarism, Plagiarism, Plagiarism • Linear Structure • Genomic sequences • Plain articles • Programs • Human behaviors on the time-line • Time-series data sets • Student Reports Plagiarism • Assignment Program copying • Where is the original version of this one ? • Web searching redundancy elimination

지문법 기반 표절 검사 시스템

구조 기반 표절 검사 시스템

Fingerprinting Method • Keyword frequency similarity • 특정한 단어의 사용횟수 • Fingerprinting object • Fixed size fingerprint • Easy to making Database • Quick searching • High false positive rate • Example, fingerprint vector A c x t u x g r N …..

Attacking • Inserting redundant words • Shuffling • Cons and Pros • Easy to use in document application • Hard to use in program file • Recent trends • Structure-oriented similarity measure • Greedy-Block-Removing methods… • Is this a basic concept of local alignment ? • Sample-Report-Server Building

Undergraduate Assignment • Programming Assignment cheating: • 이론적으로는 그 구별방법이 없다. • Assignment cheating은 비용이 크다 • Password breaking by Mafia • Assignment의 출력은 동일하다. • Correct program들끼리만 비교 • 과제에 주어진 시간은 비교적 짧다(3-4일 정도). • 수강생의 수는 적절하다(300 명 이하). • 프로그래밍 언어는 모두 동일하다.

Program Cheating Techniques • Complete Copying • Variable exchange • Garbage code insertion • Function transpose • Code rewriting(partially) • Library code replacing • Merging different codes • Function resolving • Function rewriting

Computing Space Transform • Program Space Genome Seq Space Program b Program a Protein a Protein b Basic keyword CLUSTAL-W( a, b ) Pairwise Alignment Similarity( , ) Similarity( , )

PROGRAM to PROTEIN • Program Language • Keyword = { int, float, class….. } • Block Structure = “}”, “{“ • Program Chromosome • Location independent code, JAVA class, C files • Non-Coding region • /* this is a sample non-coding region */ • Promoter • Variable declaration, class definition • DNA = keyword sequence

Extracting Program DNA • Syntactic level • Semantic level • Program Flow-graph Syntactic running syntax Real running

Flow-Graph Linearization $ A B S R S B W Q % $ A B S R R R R S B W Q % $ A B W W W W Q % $ A W W Q % A S B R W Q

Example main( ) { int i, j , k ; ………… for( I = 1 . I <= 100 , i++) { ……… if ( ) x = y ; else ……. while( ccccc ) { } x = 23984 ; } // end of for ……….. } int for if = else while = AGTCGCTTCGAAGCAA

Why Protein mapping ? • DNA sequence overlap • if = AA, then = AG, * = GA, return = GG • AAGGA = AG + GA or AA + GG + A • Ambiguity resolving • 20 Amino acid bases • About 20 keywords • 2-3 groups • polar, non-polar • hydrophobic, hydrophilic • Charged, uncharged • Small, large

Amino Acids classification

Keyword Mapping Strategy • Convertibility = { for , while } Easy • Convertibility = { for, then } Hard • Convertibility = { if, ‘=‘ } Impossible? • Procedure • Preprocessing • Chromosome arrangement • Keyword selection • Protein mapping

Copy Detecting System • CDS components = [K, M, P, T, A, G ] • Keyword table 20 keyword • Matching Score matrix borrow from Protein(PAM) • Affine Gap Penalty • Threshold length • Alignment Set Scoring • maXimum Gap allowing

Experiment Overview • Sample programs “data structure” • Students , 60 • Programming assignments, 12 set • 1 semester • On-line evaluation system = ESPA • Java-based on-line evaluation system • Due, 1 week • We do not monitor all programs

Clustal-W (1)(www2.ebi.ac.uk/clusterw) • Input Fasta file

Clustal-W(2) (www2.ebi.ac.uk/clusterw) • Output

PhyloDraw (cho et al, Bioinformatics 2001.)

Experiment Result 1-0 • 유사한 그룹이 있는11개의 프로그램

Experiment Result 1-1 • Unit distance topological Representation

Genomic Sequence alignments and its application