630 likes | 639 Views
Explore genetic sequences through alignment for disease prediction, crop yield genes, and machine learning implications. Utilizing k-band DP and center star strategy for MSA, alongside tools like ClustalΩ and HAlign. Discover microRNA relationships and novel findings.
E N D
基因序列的比对、挖掘和功能分析 邹权 (PH.D.&Professor) 天津大学 计算机科学与技术学院 2017.10
Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes
Multiple Sequence Alignment(MSA) VS BLAST Output input Output Database Query
Multiple Sequence Alignment(MSA): What & Where Multiple Sequence Alignment Phylogenetic tree Virus sequences Multiple DNA Sequence Alignment Population SNV calling Multiple SimilarDNA Sequence Alignment … Application Our Focus
Techniques for similar DNA MSA 1. k-band Dynamic Programming K-band -4 -5 0 -1 -1
Greedy search with suffix tree S=GTCCGAAGCTCCGG (1,1,4) (5,6,9) T=GTCCTGAAGCTCCGT 1234567890123456
Techniques for similar DNA MSA 2. Center star strategy S3 S1 S1 S3 S5 S2 S4 S2 S4 S5 tree alignment Center star strategy
Extreme MSA for Very Similar DNA Sequences final result update sum up
Experiments • 100 human mitochondria genome sequences • 16k length (1555KB) • Our output 1558KB • ClustalΩ 1627KB
Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes
Comparison with CPUs-based and Spark-based Memory Limit Exceeded Running time (sec) • CPUs-based MSA can only address small datasets (~ 10% memory size) slowly. • GPUs-based MSA can address small datasets in shorter time than the former. • Spark-based MSA can address ultra-large datasets in acceptable time.
Software http://lab.malab.cn/soft/halign/
2. Web Server Step 1: After you click the link(http://cluster.malab.cn/Halign/) as shown in above, you will see the HAlign web server.
2. Web Server Step 2: After you submit your experiment task successfully, wait a second, you will see the results.
2. Web Server Step 3: Now, you can visit your multiple sequences alignment results visualization by click "View" link.
2. Web Server Step 4: Now, you can visit your phylogenetic tree visualization by click "Generate" link.
References on MSA • Quan Zou, Qinghua Hu, Maozu Guo, Guohua Wang. HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy. Bioinformatics. 2015,31(15): 2475-2481 • Xi Chen, Chen Wang, Shanjiang Tang, Ce Yu, Quan Zou. CMSA: A heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment. BMC Bioinformatics. 2017, 18: 315 • Shixiang Wan, Quan Zou*. HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms for Molecular Biology. 2017,12: 25 • Wenhe Su, Quan Zou, etc. MASC: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework. Journal of Computational Biology. Accepted
Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes
Identification of microRNA AUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGA CUAGACUGACAUCGUGCAGAGACUAG ACUGAC >1 tgcgcgaauucacccauggauccauucaucuuccaagggcaccagc >2 agcgcgaauuccaagucacccauggauccauucaucuggcagcgu >3 agucgcgaauucaucaucuuccaagggcacccauggauccaucca
microRNA prediction based on machine learning obvious differences weak generalization
Human CDs Extend Blast 100nt 100nt Human Mature microRNAs Mature-like Reads Compute Secondary Structures Extract Parameter Filter Prediction Model Rebuilt Original Negative Set Mined Sequences innovation point Replace
Dinoflagellates genome (甲藻) Lin, et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science. 2015, 350(6261): 691-694.
Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes
Machine learning frame in gene identification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 -0.12972021 -0.10267122 -0.02537533 -0.02327581 -0.04431615 -0.03793824 -0.09035013 -0.04484774 -0.01150325 -0.02400325 -0.13563429 -0.15971042 -0.34972021 -0.10267784 -0.02537533 -0.02356713 -0.57316152 -0.43227931 -0.09881432 -0.09100432 -0.23156745 -0.07830325 -0.13563472 -0.15957833 -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177
Ensemble learning: Make weak classifiers to strong one h1( ) h2() h3( ) h4( ) h5( ) h6() h7() Classification Result Combine to form the Final strong classifier
Application in Bioinformatics • DNA Binding proteins • Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo*, Quan Zou*. nDNA-prot: Identification of DNA-binding Proteins Based on Unbalanced Classification. BMC Bioinformatics. 2014, 15:298. • tRNA • Quan Zou, et al. Improving tRNAscan-SE annotation results via ensemble classifiers.Molecular Informatics. 2015,34(11-12):761-770 • miRNA • Leyi Wei, Minghong Liao, Yue Gao, Rongrong Ji, Zengyou He*, Quan Zou*. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11(1):192-201 • circleRNA • Xiangxiang Zeng, Wei Lin, Maozu Guo, Quan Zou*. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Computational Biology. 2017,13(6): e1005420
References • Leyi Wei, Minghong Liao, Yue Gao, Rongrong Ji, Zengyou He*, Quan Zou*. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11(1):192-201 • Quan Zou*, Yaozong Mao, Lingling Hu, Yunfeng Wu, Zhiliang Ji*. miRClassify: An advanced web server for miRNA family classification and annotation. Computers in Biology and Medicine. 2014, 45:157-160 • Chen Lin, Wenqiang Chen, Cheng Qiu, Yunfeng Wu, Sridhar Krishnan, Quan Zou*. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing. 2014,123:424-435. • Quan Zou, Jiancang Zeng, Liujuan Cao, Rongrong Ji. A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification. Neurocomputing. 2016, 173:346-354
Outline • Sequence alignment • Algorithm • Parallel • Identification and mining • microRNA • machine learning related works • Function prediction • miRNA disease relationship • crops yield related genes