1 / 63

Advances in DNA Sequence Alignment and Functional Analysis

Explore genetic sequences through alignment for disease prediction, crop yield genes, and machine learning implications. Utilizing k-band DP and center star strategy for MSA, alongside tools like ClustalΩ and HAlign. Discover microRNA relationships and novel findings.

roscoea
Download Presentation

Advances in DNA Sequence Alignment and Functional Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 基因序列的比对、挖掘和功能分析 邹权 (PH.D.&Professor) 天津大学 计算机科学与技术学院 2017.10

  2. Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes

  3. Multiple Sequence Alignment(MSA) VS BLAST Output input Output Database Query

  4. Multiple Sequence Alignment(MSA): What & Where Multiple Sequence Alignment Phylogenetic tree Virus sequences Multiple DNA Sequence Alignment Population SNV calling Multiple SimilarDNA Sequence Alignment … Application Our Focus

  5. Techniques for similar DNA MSA 1. k-band Dynamic Programming K-band -4 -5 0 -1 -1

  6. How to set k for k-band?

  7. Greedy search with suffix tree S=GTCCGAAGCTCCGG (1,1,4) (5,6,9) T=GTCCTGAAGCTCCGT 1234567890123456

  8. Techniques for similar DNA MSA 2. Center star strategy S3 S1 S1 S3 S5 S2 S4 S2 S4 S5 tree alignment Center star strategy

  9. Extreme MSA for Very Similar DNA Sequences final result update sum up

  10. Experiments • 100 human mitochondria genome sequences • 16k length (1555KB) • Our output 1558KB • ClustalΩ 1627KB

  11. Time cost of every steps

  12. Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes

  13. Multiple sequence alignment in Hadoop

  14. Multiple sequence alignment in Spark

  15. Running time of different software tools on mtDNA datasets

  16. Running time with HPTree on 16S rRNA datasets

  17. Comparison with CPUs-based and Spark-based Memory Limit Exceeded Running time (sec) • CPUs-based MSA can only address small datasets (~ 10% memory size) slowly. • GPUs-based MSA can address small datasets in shorter time than the former. • Spark-based MSA can address ultra-large datasets in acceptable time.

  18. Software http://lab.malab.cn/soft/halign/

  19. 2. Web Server Step 1: After you click the link(http://cluster.malab.cn/Halign/) as shown in above, you will see the HAlign web server.

  20. 2. Web Server Step 2: After you submit your experiment task successfully, wait a second, you will see the results.

  21. 2. Web Server Step 3: Now, you can visit your multiple sequences alignment results visualization by click "View" link.

  22. 2. Web Server Step 4: Now, you can visit your phylogenetic tree visualization by click "Generate" link.

  23. References on MSA • Quan Zou, Qinghua Hu, Maozu Guo, Guohua Wang. HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy. Bioinformatics. 2015,31(15): 2475-2481 • Xi Chen, Chen Wang, Shanjiang Tang, Ce Yu, Quan Zou. CMSA: A heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment. BMC Bioinformatics. 2017, 18: 315 • Shixiang Wan, Quan Zou*. HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms for Molecular Biology. 2017,12: 25 • Wenhe Su, Quan Zou, etc. MASC: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework. Journal of Computational Biology. Accepted

  24. Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes

  25. Identification of microRNA AUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGA CUAGACUGACAUCGUGCAGAGACUAG ACUGAC >1 tgcgcgaauucacccauggauccauucaucuuccaagggcaccagc >2 agcgcgaauuccaagucacccauggauccauucaucuggcagcgu >3 agucgcgaauucaucaucuuccaagggcacccauggauccaucca

  26. microRNA prediction based on machine learning obvious differences weak generalization

  27. Human CDs Extend Blast 100nt 100nt Human Mature microRNAs Mature-like Reads Compute Secondary Structures Extract Parameter Filter Prediction Model Rebuilt Original Negative Set Mined Sequences innovation point Replace

  28. microRNA family identification

  29. http://lab.malab.cn/~wly/mirnaDetect.html

  30. Novel miRNA found by our method 1

  31. Dinoflagellates genome (甲藻) Lin, et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science. 2015, 350(6261): 691-694.

  32. Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes

  33. Machine learning frame in gene identification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 -0.12972021 -0.10267122 -0.02537533 -0.02327581 -0.04431615 -0.03793824 -0.09035013 -0.04484774 -0.01150325 -0.02400325 -0.13563429 -0.15971042 -0.34972021 -0.10267784 -0.02537533 -0.02356713 -0.57316152 -0.43227931 -0.09881432 -0.09100432 -0.23156745 -0.07830325 -0.13563472 -0.15957833 -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177

  34. Ensemble learning: Make weak classifiers to strong one h1( ) h2() h3( ) h4( ) h5( ) h6() h7() Classification Result Combine to form the Final strong classifier

  35. Ensemble learning for Class Imbalance Problem

  36. http://lab.malab.cn/soft/LibD3C/

  37. http://lab.malab.cn/soft/MRMD/

  38. Application in Bioinformatics • DNA Binding proteins • Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo*, Quan Zou*. nDNA-prot: Identification of DNA-binding Proteins Based on Unbalanced Classification. BMC Bioinformatics. 2014, 15:298. • tRNA • Quan Zou, et al. Improving tRNAscan-SE annotation results via ensemble classifiers.Molecular Informatics. 2015,34(11-12):761-770 • miRNA • Leyi Wei, Minghong Liao, Yue Gao, Rongrong Ji, Zengyou He*, Quan Zou*. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11(1):192-201 • circleRNA • Xiangxiang Zeng, Wei Lin, Maozu Guo, Quan Zou*. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Computational Biology. 2017,13(6): e1005420

  39. 利用邹权副教授提出的集成学习方法

  40. zouquan@tju.edu.cn

  41. References • Leyi Wei, Minghong Liao, Yue Gao, Rongrong Ji, Zengyou He*, Quan Zou*. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11(1):192-201 • Quan Zou*, Yaozong Mao, Lingling Hu, Yunfeng Wu, Zhiliang Ji*. miRClassify: An advanced web server for miRNA family classification and annotation. Computers in Biology and Medicine. 2014, 45:157-160 • Chen Lin, Wenqiang Chen, Cheng Qiu, Yunfeng Wu, Sridhar Krishnan, Quan Zou*. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing. 2014,123:424-435.  • Quan Zou, Jiancang Zeng, Liujuan Cao, Rongrong Ji. A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification. Neurocomputing. 2016, 173:346-354

  42. Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes

More Related