Work @ Fudan University

Work @ Fudan University Chen, Yaoliang

Engineering work • TTS System • A Chinese Text-To-Speech system • SafeDB • Bug backlog • SMemoHelper • A small tool that helps learn English words. • Fraud Detecting • Time series tech

Research Work • CGAP-align: A high performance DNA short read alignment tool • Coauthor with BCM. Bioinformatics in progress • NDBC Demo • On Encoding Shortest Paths in Large Graphs • Coauthor with Jian Pei. VLDB in progress • Coauthor with Haixun Wang. Sigmod in progress • NDBC • Other Projects

CGAP-align: Background • Baylor College of Medicine • 序列比对及意义 • Reference & Reads • ACTAGCGATATAACCCTTTCCCTTTCCCTTT • CACGAT • Given a number zreference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z. • ACTAGCGATATAACCCTTTCCCTTTCCCTTT • CACGAT

Challenges • A human genome sequence • 2000 € 1,000,000,000 in ~10 years • 2008 € 50 - 100,000 in ~4 months • 2010 € 5 - 10,000 in ~2 weeks • ...2015 € 1,000 in ~1 day • ...2020 € 10 in ~1 hour to minutes DNA sequences in GenBank

Performance of BWA • Burrows-Wheeler Alignment Tool • 一个流行的在大型参照序列上对基因片段进行比对工具 • Optimization of BWA • Code level • Algorithm level • BWA Performance: T = N × Taln • N: enumerate all mismatches and gaps of the read • Taln: time to locate the modified reads in the reference during the alignment stage

Optimization • Optimizing Taln: efficiency for matching • Suffix Tarray • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps • Data-Conscious D-Array Calculating

Suffix TArray Suffix Tree Suffix Array Based on BWT (FM-index) Comparison

i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i From Yuval Rikover Burrows-Wheeler Transform L F # mississipp i mississippi# i #mississipp ississippi#m i ppi#mississ ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi

Reminder: Recovering T from L Burrows-Wheeler Transform F L #iiiimppssss ipssm#pissii • Find F by sorting L • First char of T? m • Find m in L • L[i] precedes F[i] in T. Therefore we get mi • How do we choose the correct i in L? • The i’s are in the same order in L and F • As are the rest of the char’s • i is followed by s: mis • And so on….

Next: Count P in T • Backward-search algorithm • Uses only L (output of BWT) • Relies on 2 structures: • C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars) • Occ(c,q): number of occurrences of char c in prefix L[1,q] • Example • C[ ] for T = mississippi# • occ(s, 5) = 2 • occ(s,12) = 4 • Occ Rank 1 2 3 4 5 6 7 8 9 10 11 12 i m p s

C P[ j ] # 1 i 2 m 7 p 8 S 10 P = si L Available info First step #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i fr rows prefixed by char “i” lr mississippi Inductive step: Given fr,lr for P[j+1,p] • Take c=P[j] fr occ=2 [lr-fr+1] lr SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES) unknown s s • Find the first c in L[fr, lr] • Find the last c in L[fr, lr] Occ() oracle is enough

Suffix T-Array • Backward search • Store “First” and “Last” (k and l) values

Backward-search example 2 1 3 • P = CAA • i = • c = • First = • Last = ‘A’ ‘C’ ‘A’ C[‘T’] + Occ(‘C’,First(AA)) +1 First(AA) C[‘T’] + Occ(‘C’,Last(AA)) Last(AA) Root A A FM-index

Optimization • Optimizing Taln: efficiency for matching • Suffix Tarray • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps • Data-Conscious D-Array Calculating

D-Array: Motivation • e(W) • minimal number of the edit operations that is needed to make W exactly align onto the reference X. • D-array • D[i] : Lower bound of e(W[0…i]) … 0 i 3 4

D array: Motivation • Given a string W and an arbitrary combination strings of W = w1,w2,…,wk, we have e(W)> • D array in BWA • split W into several small strings like W=w1w2…wk with e(wi)=1 for all i. The correctness of the algorithm depends on the inequality: e(W) > .

D array: Motivation • Example Reference X = “AACGTATCGACG” • W • D • A better segmentation: Consider e(·)= 2 • W • D • calculating e(·) costs exponential time • Need to pre-compution

Solution - Frequent Pattern Train Reads Train Reads • Mining Frequent Patterns (FPs) • Art of State Methods • Our solution: A simple DFS on FM-index • Count=Last-First+1 • Fastafile F containing training reads • Should be similar to the reads in practice • Data Concious • Generate prefix trie T for the FPs with e(w)=2. • Refine T to a DFA GT Frequent Patterns Frequent Patterns Trie DFA Trie DFA

Trie Deterministic Finite Automaton • Why Trie DFA? • When online doing alignment, we need to find all the FPs contained in a read • This operation should be no more expensive than O(|W|)

Trie Deterministic Finite Automaton Offline Index: Construction R • String Set(FP set) • AA • C • G • T • AC • AG • The prefix trie done. We start to construct DFA. 5 1 3 4 7 6 2 T

Re-Ordering • DFS order – minimize the average hop between each jump. (7% up) 5 6 7 2 3 4

Trie Deterministic Finite Automaton Online Query • String Set(FP set) • AA • AC • AG • C • G • T • W=“CACAT” R LT LC 1 1 LAC

Experiment • Optimizing Taln: efficiency for matching • Suffix Tarray (20% up) • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps • Data-Conscious D-Array Calculating (0-200% up)

On Encoding Shortest Paths in Large Graphs • Background • Consider a graph G = (V,E), where V is a set of vertices and E =VxVis a set of edges. • FH-Partition

Examples 7 4 7->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10

Problem Statement • Numbering Function

MCN is NP-Hard!!

WorkFlow • Reduce to TSP • Region tree • Multi numbering functions Compute FH-Partitions Compute FH-Partitions • Compute a naïve numbering function • Store the FH-partitions • Further Compression • Answering query efficiently Get Numbering Function(s) Get Numbering Function(s) Encoding FH-Partitions Encoding FH-Partitions

Experiments

Thank you!

Work @ Fudan University