1 / 40

Hierarchical learning and high dimensionality problems in bioinformatics

Hierarchical learning and high dimensionality problems in bioinformatics. 邹权 (PH.D.&Professor) 天津大学 计算机科学与技术学院 2017.9.8. Outline. Hierarchical learning in bioinformatics Protein fold pattern Enzyme identification microRNA family

hollyduncan
Download Presentation

Hierarchical learning and high dimensionality problems in bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical learning and high dimensionality problems in bioinformatics 邹权 (PH.D.&Professor) 天津大学 计算机科学与技术学院 2017.9.8

  2. Outline • Hierarchical learning in bioinformatics • Protein fold pattern • Enzyme identification • microRNA family • High dimensionality problems • Gene expression • Methylation profile • GWAS

  3. Proteins

  4. Proteins >A1JRR3|1.1.1.79|1.1.1.81 MNIIFYHPFFEAKQWLSGLQSRLPTANIRQWRRGDTQPADYALVWQPPQEMLASRVELKGVFALGAGVDA ILDQERRHPGTLPAGVPLVRLEDTGMSLQMQEYVVATVLRYFRRMDEYQLQQQQKLWQPLEPHQHDKFTI GILGAGVLGKSVAHKLAEFGFTVRCWSRTPKQIDGVTSFAGQEKLPAFIQGTQLLINLLPHTPQTAGILN QSLFSQLNANAYIINIARGAHLLERDLLAAMNAGQVAAATLDVFAEEPLPSMHPFWSHPRVTITPHIAAV TLPEVAMDQVVANIQAMEAGREPVGLVDVVRGY >Q9NZB8|4.1.99.22|4.6.1.17 MAARPLSRMLRRLLRSSARSCSSGAPVTQPCPGESARAASEEVSRRRQFLREHAAPFSAFLTDSFGRQHS YLRISLTEKCNLRCQYCMPEEGVPLTPKANLLTTEEILTLARLFVKEGIDKIRLTGGEPLIRPDVVDIVA QLQRLEGLRTIGVTTNGINLARLLPQLQKAGLSAINISLDTLVPAKFEFIVRRKGFHKVMEGIHKAIELG YNPVKVNCVVMRGLN >Q8JU62|2.1.1.56|2.7.7.50 MAAVFGIQLVPKLNTSTTRRTFLPLRFDLLLDRLQSTNLHGVLYRALDFNPVDRSATVIQTYPPLNAWSP HPAFIENPLDYRDWTEFIHDRALAFVGVLTQRYPLTQNAQRYTNPLVLGAAFGDFLNARSIDIFLDRLFY GPTQESPITSITKFPYQWTIDFNVTADSVRTPAGCKYITLYGYDPSRPSTPATYGKHRPTYATVFYYSTL

  5. Protein Classification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 >P04635|3.1.1.3|3.1.1.32 MKETKHQHTFSIRKSAYGAASVMVASCIFVIGGGVAEANDSTTQTTTPLEVAQTSQQETHTHQTPVTSLH TATPEHVDDSKEATPLPEKAESPKTEVTVQPSSHTQEVPALHKKTQQQPAYKDKTVPESTIASKSVESNK ATENEMSPVEHHASNVEKREDRLETNETTPPSVDREFSHKIINNTHVNPKTDGQTNVNVDTKTIDTVSPK DDRIDTAQPKQVDVPKENTTAQNKFTSQASDKKPTVKAAPEAVQNPENPKNKDPFVFVHGFTGFVGEVAA KGENHWGGTKANLRNHLRKAGYETYEASVSALASNHERAVELYYYLKGGRVDYGAAHSEKYGHERYGKTY >Q9NWT6|1.14.11.30|1.14.11.n4 MAATAAEAVASGSGEPREEAGALGPAWDESQLRSYSFPTRPIPRLSQSDPRAEELIENEEPVVLTDTNLV YPALKWDLEYLQENIGNGDFSVYSASTHKFLYYDEKKMANFQNFKPRSNREEMKFHEFVEKLQDIQQRGG EERLYLQQTLNDTVGRKIVMDFLGFNWNWINKQQGKRGWGQLTSNLLLIGMEGNVTPAHYDEQQNFFAQI KGYKRCILFPPDQFECLYPYPVHHPCDRQSQVDFDNPDYERFPNFQNVVGYETVVGPGDVLYIPMYWWHH IESLLNGGITITVNFWYKGAPTPKRIEYPLKAHQKVAIMRNIEKMLGEALGNPQEVGPLLNTMIKGRYN >P04418|3.2.2.17|4.2.99.18 MTRINLTLVSELADQHLMAEYRELPRVFGAVRKHVANGKRVRDFKISPTFILGAGHVTFFYDKLEFLRKR QIELIAECLKRGFNIKDTTVQDISDIPQEFRGDYIPHEASIAISQARLDEKIAQRPTWYKYYGKAIYA >A1JRR3|1.1.1.79|1.1.1.81 MNIIFYHPFFEAKQWLSGLQSRLPTANIRQWRRGDTQPADYALVWQPPQEMLASRVELKGVFALGAGVDA ILDQERRHPGTLPAGVPLVRLEDTGMSLQMQEYVVATVLRYFRRMDEYQLQQQQKLWQPLEPHQHDKFTI GILGAGVLGKSVAHKLAEFGFTVRCWSRTPKQIDGVTSFAGQEKLPAFIQGTQLLINLLPHTPQTAGILN QSLFSQLNANAYIINIARGAHLLERDLLAAMNAGQVAAATLDVFAEEPLPSMHPFWSHPRVTITPHIAAV TLPEVAMDQVVANIQAMEAGREPVGLVDVVRGY >Q9NZB8|4.1.99.22|4.6.1.17 MAARPLSRMLRRLLRSSARSCSSGAPVTQPCPGESARAASEEVSRRRQFLREHAAPFSAFLTDSFGRQHS YLRISLTEKCNLRCQYCMPEEGVPLTPKANLLTTEEILTLARLFVKEGIDKIRLTGGEPLIRPDVVDIVA QLQRLEGLRTIGVTTNGINLARLLPQLQKAGLSAINISLDTLVPAKFEFIVRRKGFHKVMEGIHKAIELG YNPVKVNCVVMRGLN >Q8JU62|2.1.1.56|2.7.7.50 MAAVFGIQLVPKLNTSTTRRTFLPLRFDLLLDRLQSTNLHGVLYRALDFNPVDRSATVIQTYPPLNAWSP HPAFIENPLDYRDWTEFIHDRALAFVGVLTQRYPLTQNAQRYTNPLVLGAAFGDFLNARSIDIFLDRLFY GPTQESPITSITKFPYQWTIDFNVTADSVRTPAGCKYITLYGYDPSRPSTPATYGKHRPTYATVFYYSTL -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177

  6. Protein fold pattern problem http://scop.mrc-lmb.cam.ac.uk/scop/

  7. Protein fold pattern problem http://scop.mrc-lmb.cam.ac.uk/scop/

  8. flat classification VS hierarchical classification mistakes from the hierarchical classification

  9. flat classification VS hierarchical classification "Hierarchical Feature Selection with Recursive Regularization." IJCAI, 2017

  10. Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/

  11. Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/

  12. Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/ ENZYME now includes entries with preliminary EC numbers. Preliminary EC numbers include an 'n' as part of the fourth (serial) digit (e.g. EC 3.5.1.n3).

  13. Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/

  14. The Flow of Protein Classification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 >P04635|3.1.1.3|3.1.1.32 MKETKHQHTFSIRKSAYGAASVMVASCIFVIGGGVAEANDSTTQTTTPLEVAQTSQQETHTHQTPVTSLH TATPEHVDDSKEATPLPEKAESPKTEVTVQPSSHTQEVPALHKKTQQQPAYKDKTVPESTIASKSVESNK ATENEMSPVEHHASNVEKREDRLETNETTPPSVDREFSHKIINNTHVNPKTDGQTNVNVDTKTIDTVSPK DDRIDTAQPKQVDVPKENTTAQNKFTSQASDKKPTVKAAPEAVQNPENPKNKDPFVFVHGFTGFVGEVAA KGENHWGGTKANLRNHLRKAGYETYEASVSALASNHERAVELYYYLKGGRVDYGAAHSEKYGHERYGKTY >Q9NWT6|1.14.11.30|1.14.11.n4 MAATAAEAVASGSGEPREEAGALGPAWDESQLRSYSFPTRPIPRLSQSDPRAEELIENEEPVVLTDTNLV YPALKWDLEYLQENIGNGDFSVYSASTHKFLYYDEKKMANFQNFKPRSNREEMKFHEFVEKLQDIQQRGG EERLYLQQTLNDTVGRKIVMDFLGFNWNWINKQQGKRGWGQLTSNLLLIGMEGNVTPAHYDEQQNFFAQI KGYKRCILFPPDQFECLYPYPVHHPCDRQSQVDFDNPDYERFPNFQNVVGYETVVGPGDVLYIPMYWWHH IESLLNGGITITVNFWYKGAPTPKRIEYPLKAHQKVAIMRNIEKMLGEALGNPQEVGPLLNTMIKGRYN >P04418|3.2.2.17|4.2.99.18 MTRINLTLVSELADQHLMAEYRELPRVFGAVRKHVANGKRVRDFKISPTFILGAGHVTFFYDKLEFLRKR QIELIAECLKRGFNIKDTTVQDISDIPQEFRGDYIPHEASIAISQARLDEKIAQRPTWYKYYGKAIYA >A1JRR3|1.1.1.79|1.1.1.81 MNIIFYHPFFEAKQWLSGLQSRLPTANIRQWRRGDTQPADYALVWQPPQEMLASRVELKGVFALGAGVDA ILDQERRHPGTLPAGVPLVRLEDTGMSLQMQEYVVATVLRYFRRMDEYQLQQQQKLWQPLEPHQHDKFTI GILGAGVLGKSVAHKLAEFGFTVRCWSRTPKQIDGVTSFAGQEKLPAFIQGTQLLINLLPHTPQTAGILN QSLFSQLNANAYIINIARGAHLLERDLLAAMNAGQVAAATLDVFAEEPLPSMHPFWSHPRVTITPHIAAV TLPEVAMDQVVANIQAMEAGREPVGLVDVVRGY >Q9NZB8|4.1.99.22|4.6.1.17 MAARPLSRMLRRLLRSSARSCSSGAPVTQPCPGESARAASEEVSRRRQFLREHAAPFSAFLTDSFGRQHS YLRISLTEKCNLRCQYCMPEEGVPLTPKANLLTTEEILTLARLFVKEGIDKIRLTGGEPLIRPDVVDIVA QLQRLEGLRTIGVTTNGINLARLLPQLQKAGLSAINISLDTLVPAKFEFIVRRKGFHKVMEGIHKAIELG YNPVKVNCVVMRGLN >Q8JU62|2.1.1.56|2.7.7.50 MAAVFGIQLVPKLNTSTTRRTFLPLRFDLLLDRLQSTNLHGVLYRALDFNPVDRSATVIQTYPPLNAWSP HPAFIENPLDYRDWTEFIHDRALAFVGVLTQRYPLTQNAQRYTNPLVLGAAFGDFLNARSIDIFLDRLFY GPTQESPITSITKFPYQWTIDFNVTADSVRTPAGCKYITLYGYDPSRPSTPATYGKHRPTYATVFYYSTL -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177

  15. The Flow of Protein Classification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177

  16. Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/

  17. microRNA family identification

  18. Identification of microRNA AUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGA CUAGACUGACAUCGUGCAGAGACUAG ACUGAC >1 tgcgcgaauucacccauggauccauucaucuuccaagggcaccagc >2 agcgcgaauuccaagucacccauggauccauucaucuggcagcgu >3 agucgcgaauucaucaucuuccaagggcacccauggauccaucca

  19. Ref: Xue C, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics, 2005, 6(1): 310.

  20. Ref: Xue C, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics, 2005, 6(1): 310.

  21. microRNA prediction based on machine learning obvious differences weak generalization

  22. Importance of negative samples Negative Testing Set Positive Training Set Decision Boundary Negative Training Set

  23. Importance of negative samples Negative Testing Set Positive Training Set New Negative Training Set New Decision Boundary

  24. Human CDs Extend Blast 100nt 100nt Human Mature microRNAs Mature-like Reads Compute Secondary Structures Extract Parameter Filter Prediction Model Rebuilt Original Negative Set Mined Sequences innovation point Replace

  25. http://lab.malab.cn/~wly/mirnaDetect.html

  26. Novel miRNA found by our method 1

  27. Dinoflagellates genome (甲藻) Lin, et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science. 2015, 350(6261): 691-694.

  28. Outline • Hierarchical learning in bioinformatics • Protein fold pattern • Enzyme identification • microRNA family • High dimensionality problems • Gene expression • Methylation profile • GWAS

  29. High dimensionality problems • Sparse • Noisy

  30. Gene expression data

  31. Methylation

  32. https://cancergenome.nih.gov/

  33. GWAS(Genome-wide association study)

  34. Machine learning in GWAS

  35. Genome, GWAS and Watson • 15-19岁,芝加哥大学 • 19-22岁,印第安纳大学,博士学位 • 导师:Salvador Luria (1969年诺奖) • 偶像:穆勒(1946年诺奖,摩尔根的学生) • 22-25岁,剑桥大学卡文迪许实验室 • 领导:小布拉格(最年轻的诺奖得主) • 《Nature》发表DNA双螺旋结构 • 25-40岁,哈佛大学教授 • 34岁,1962年诺贝尔奖 • 40-79岁,冷泉港实验室主任 • 主持人类基因组计划 两次获得诺奖 • 居里夫人(1903,1911化学奖) • 约翰·巴丁(1956,1972物理奖) • 鲍林(1954化学奖,1962和平奖) • 桑格(1958,1980化学奖)

  36. References • Zhao et al. Hierarchical Feature Selection with Recursive Regularization. IJCAI, 2017: 3483-3489 • Xian-Ying Cheng, Wei-Juan Huang, Shi-Chang Hu, Hai-Lei Zhang, Hao Wang, Jing-Xian Zhang, Hong-Huang Lin, Yu-Zong Chen, Quan Zou*, Zhi-Liang Ji*. A global characterization and identification of multifunctional enzymes. PLoS One. 2012,7(6):e38979 • Quan Zou*, Yaozong Mao, Lingling Hu, Yunfeng Wu, Zhiliang Ji*. miRClassify: An advanced web server for miRNA family classification and annotation. Computers in Biology and Medicine. 2014, 45:157-160 • Leyi Wei, Minghong Liao, Yue Gao, Rongrong Ji, Zengyou He*, Quan Zou*. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11(1):192-201 • Pei Li, Maozu Guo, Chunyu Wang, Xiaoyan Liu, Quan Zou*. An overview of SNP interactions in genome-wide association studies. Briefings in Functional Genomics. 2015, 14(2):143-155

More Related