1 / 31

Work @ Fudan University

Work @ Fudan University. Chen, Yaoliang. Engineering work. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English words . Fraud Detecting Time series tech. Research Work.

lynn
Download Presentation

Work @ Fudan University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Work @ Fudan University Chen, Yaoliang

  2. Engineering work • TTS System • A Chinese Text-To-Speech system • SafeDB • Bug backlog • SMemoHelper • A small tool that helps learn English words. • Fraud Detecting • Time series tech

  3. Research Work • CGAP-align: A high performance DNA short read alignment tool • Coauthor with BCM. Bioinformatics in progress • NDBC Demo • On Encoding Shortest Paths in Large Graphs • Coauthor with Jian Pei. VLDB in progress • Coauthor with Haixun Wang. Sigmod in progress • NDBC • Other Projects

  4. CGAP-align: Background • Baylor College of Medicine • 序列比对及意义 • Reference & Reads • ACTAGCGATATAACCCTTTCCCTTTCCCTTT • CACGAT • Given a number zreference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z. • ACTAGCGATATAACCCTTTCCCTTTCCCTTT • CACGAT

  5. Challenges • A human genome sequence • 2000 € 1,000,000,000 in ~10 years • 2008 € 50 - 100,000 in ~4 months • 2010 € 5 - 10,000 in ~2 weeks • ...2015 € 1,000 in ~1 day • ...2020 € 10 in ~1 hour to minutes DNA sequences in GenBank

  6. Performance of BWA • Burrows-Wheeler Alignment Tool • 一个流行的在大型参照序列上对基因片段进行比对工具 • Optimization of BWA • Code level • Algorithm level • BWA Performance: T = N × Taln • N: enumerate all mismatches and gaps of the read • Taln: time to locate the modified reads in the reference during the alignment stage

  7. Optimization • Optimizing Taln: efficiency for matching • Suffix Tarray • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps • Data-Conscious D-Array Calculating

  8. Suffix TArray Suffix Tree Suffix Array Based on BWT (FM-index) Comparison

  9. i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i From Yuval Rikover Burrows-Wheeler Transform L F # mississipp i mississippi# i #mississipp ississippi#m i ppi#mississ ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi

  10. Reminder: Recovering T from L Burrows-Wheeler Transform F L #iiiimppssss ipssm#pissii • Find F by sorting L • First char of T? m • Find m in L • L[i] precedes F[i] in T. Therefore we get mi • How do we choose the correct i in L? • The i’s are in the same order in L and F • As are the rest of the char’s • i is followed by s: mis • And so on….

  11. Next: Count P in T • Backward-search algorithm • Uses only L (output of BWT) • Relies on 2 structures: • C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars) • Occ(c,q): number of occurrences of char c in prefix L[1,q] • Example • C[ ] for T = mississippi# • occ(s, 5) = 2 • occ(s,12) = 4 • Occ Rank 1 2 3 4 5 6 7 8 9 10 11 12 i m p s

  12. C P[ j ] # 1 i 2 m 7 p 8 S 10 P = si L Available info First step #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i fr rows prefixed by char “i” lr mississippi Inductive step: Given fr,lr for P[j+1,p] • Take c=P[j] fr occ=2 [lr-fr+1] lr SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES) unknown s s • Find the first c in L[fr, lr] • Find the last c in L[fr, lr] Occ() oracle is enough

  13. Suffix T-Array • Backward search • Store “First” and “Last” (k and l) values

  14. Backward-search example 2 1 3 • P = CAA • i = • c = • First = • Last = ‘A’ ‘C’ ‘A’ C[‘T’] + Occ(‘C’,First(AA)) +1 First(AA) C[‘T’] + Occ(‘C’,Last(AA)) Last(AA) Root A A FM-index

  15. Optimization • Optimizing Taln: efficiency for matching • Suffix Tarray • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps • Data-Conscious D-Array Calculating

  16. D-Array: Motivation • e(W) • minimal number of the edit operations that is needed to make W exactly align onto the reference X. • D-array • D[i] : Lower bound of e(W[0…i]) … 0 i 3 4

  17. D array: Motivation • Given a string W and an arbitrary combination strings of W = w1,w2,…,wk, we have e(W)> • D array in BWA • split W into several small strings like W=w1w2…wk with e(wi)=1 for all i. The correctness of the algorithm depends on the inequality: e(W) > .

  18. D array: Motivation • Example Reference X = “AACGTATCGACG” • W • D • A better segmentation: Consider e(·)= 2 • W • D • calculating e(·) costs exponential time • Need to pre-compution

  19. Solution - Frequent Pattern Train Reads Train Reads • Mining Frequent Patterns (FPs) • Art of State Methods • Our solution: A simple DFS on FM-index • Count=Last-First+1 • Fastafile F containing training reads • Should be similar to the reads in practice • Data Concious • Generate prefix trie T for the FPs with e(w)=2. • Refine T to a DFA GT Frequent Patterns Frequent Patterns Trie DFA Trie DFA

  20. Trie Deterministic Finite Automaton • Why Trie DFA? • When online doing alignment, we need to find all the FPs contained in a read • This operation should be no more expensive than O(|W|)

  21. Trie Deterministic Finite Automaton Offline Index: Construction R • String Set(FP set) • AA • C • G • T • AC • AG • The prefix trie done. We start to construct DFA. 5 1 3 4 7 6 2 T

  22. Re-Ordering • DFS order – minimize the average hop between each jump. (7% up) 5 6 7 2 3 4

  23. Trie Deterministic Finite Automaton Online Query • String Set(FP set) • AA • AC • AG • C • G • T • W=“CACAT” R LT LC 1 1 LAC

  24. Experiment • Optimizing Taln: efficiency for matching • Suffix Tarray (20% up) • Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps • Data-Conscious D-Array Calculating (0-200% up)

  25. On Encoding Shortest Paths in Large Graphs • Background • Consider a graph G = (V,E), where V is a set of vertices and E =VxVis a set of edges. • FH-Partition

  26. Examples 7 4 7->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10

  27. Problem Statement • Numbering Function

  28. MCN is NP-Hard!!

  29. WorkFlow • Reduce to TSP • Region tree • Multi numbering functions Compute FH-Partitions Compute FH-Partitions • Compute a naïve numbering function • Store the FH-partitions • Further Compression • Answering query efficiently Get Numbering Function(s) Get Numbering Function(s) Encoding FH-Partitions Encoding FH-Partitions

  30. Experiments

  31. Thank you!

More Related