400 likes | 411 Views
Explore hierarchical learning in bioinformatics, focusing on protein fold patterns, enzyme identification, microRNA families, and high-dimensionality problems such as gene expression and methylation profiles.
E N D
Hierarchical learning and high dimensionality problems in bioinformatics 邹权 (PH.D.&Professor) 天津大学 计算机科学与技术学院 2017.9.8
Outline • Hierarchical learning in bioinformatics • Protein fold pattern • Enzyme identification • microRNA family • High dimensionality problems • Gene expression • Methylation profile • GWAS
Proteins >A1JRR3|1.1.1.79|1.1.1.81 MNIIFYHPFFEAKQWLSGLQSRLPTANIRQWRRGDTQPADYALVWQPPQEMLASRVELKGVFALGAGVDA ILDQERRHPGTLPAGVPLVRLEDTGMSLQMQEYVVATVLRYFRRMDEYQLQQQQKLWQPLEPHQHDKFTI GILGAGVLGKSVAHKLAEFGFTVRCWSRTPKQIDGVTSFAGQEKLPAFIQGTQLLINLLPHTPQTAGILN QSLFSQLNANAYIINIARGAHLLERDLLAAMNAGQVAAATLDVFAEEPLPSMHPFWSHPRVTITPHIAAV TLPEVAMDQVVANIQAMEAGREPVGLVDVVRGY >Q9NZB8|4.1.99.22|4.6.1.17 MAARPLSRMLRRLLRSSARSCSSGAPVTQPCPGESARAASEEVSRRRQFLREHAAPFSAFLTDSFGRQHS YLRISLTEKCNLRCQYCMPEEGVPLTPKANLLTTEEILTLARLFVKEGIDKIRLTGGEPLIRPDVVDIVA QLQRLEGLRTIGVTTNGINLARLLPQLQKAGLSAINISLDTLVPAKFEFIVRRKGFHKVMEGIHKAIELG YNPVKVNCVVMRGLN >Q8JU62|2.1.1.56|2.7.7.50 MAAVFGIQLVPKLNTSTTRRTFLPLRFDLLLDRLQSTNLHGVLYRALDFNPVDRSATVIQTYPPLNAWSP HPAFIENPLDYRDWTEFIHDRALAFVGVLTQRYPLTQNAQRYTNPLVLGAAFGDFLNARSIDIFLDRLFY GPTQESPITSITKFPYQWTIDFNVTADSVRTPAGCKYITLYGYDPSRPSTPATYGKHRPTYATVFYYSTL
Protein Classification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 >P04635|3.1.1.3|3.1.1.32 MKETKHQHTFSIRKSAYGAASVMVASCIFVIGGGVAEANDSTTQTTTPLEVAQTSQQETHTHQTPVTSLH TATPEHVDDSKEATPLPEKAESPKTEVTVQPSSHTQEVPALHKKTQQQPAYKDKTVPESTIASKSVESNK ATENEMSPVEHHASNVEKREDRLETNETTPPSVDREFSHKIINNTHVNPKTDGQTNVNVDTKTIDTVSPK DDRIDTAQPKQVDVPKENTTAQNKFTSQASDKKPTVKAAPEAVQNPENPKNKDPFVFVHGFTGFVGEVAA KGENHWGGTKANLRNHLRKAGYETYEASVSALASNHERAVELYYYLKGGRVDYGAAHSEKYGHERYGKTY >Q9NWT6|1.14.11.30|1.14.11.n4 MAATAAEAVASGSGEPREEAGALGPAWDESQLRSYSFPTRPIPRLSQSDPRAEELIENEEPVVLTDTNLV YPALKWDLEYLQENIGNGDFSVYSASTHKFLYYDEKKMANFQNFKPRSNREEMKFHEFVEKLQDIQQRGG EERLYLQQTLNDTVGRKIVMDFLGFNWNWINKQQGKRGWGQLTSNLLLIGMEGNVTPAHYDEQQNFFAQI KGYKRCILFPPDQFECLYPYPVHHPCDRQSQVDFDNPDYERFPNFQNVVGYETVVGPGDVLYIPMYWWHH IESLLNGGITITVNFWYKGAPTPKRIEYPLKAHQKVAIMRNIEKMLGEALGNPQEVGPLLNTMIKGRYN >P04418|3.2.2.17|4.2.99.18 MTRINLTLVSELADQHLMAEYRELPRVFGAVRKHVANGKRVRDFKISPTFILGAGHVTFFYDKLEFLRKR QIELIAECLKRGFNIKDTTVQDISDIPQEFRGDYIPHEASIAISQARLDEKIAQRPTWYKYYGKAIYA >A1JRR3|1.1.1.79|1.1.1.81 MNIIFYHPFFEAKQWLSGLQSRLPTANIRQWRRGDTQPADYALVWQPPQEMLASRVELKGVFALGAGVDA ILDQERRHPGTLPAGVPLVRLEDTGMSLQMQEYVVATVLRYFRRMDEYQLQQQQKLWQPLEPHQHDKFTI GILGAGVLGKSVAHKLAEFGFTVRCWSRTPKQIDGVTSFAGQEKLPAFIQGTQLLINLLPHTPQTAGILN QSLFSQLNANAYIINIARGAHLLERDLLAAMNAGQVAAATLDVFAEEPLPSMHPFWSHPRVTITPHIAAV TLPEVAMDQVVANIQAMEAGREPVGLVDVVRGY >Q9NZB8|4.1.99.22|4.6.1.17 MAARPLSRMLRRLLRSSARSCSSGAPVTQPCPGESARAASEEVSRRRQFLREHAAPFSAFLTDSFGRQHS YLRISLTEKCNLRCQYCMPEEGVPLTPKANLLTTEEILTLARLFVKEGIDKIRLTGGEPLIRPDVVDIVA QLQRLEGLRTIGVTTNGINLARLLPQLQKAGLSAINISLDTLVPAKFEFIVRRKGFHKVMEGIHKAIELG YNPVKVNCVVMRGLN >Q8JU62|2.1.1.56|2.7.7.50 MAAVFGIQLVPKLNTSTTRRTFLPLRFDLLLDRLQSTNLHGVLYRALDFNPVDRSATVIQTYPPLNAWSP HPAFIENPLDYRDWTEFIHDRALAFVGVLTQRYPLTQNAQRYTNPLVLGAAFGDFLNARSIDIFLDRLFY GPTQESPITSITKFPYQWTIDFNVTADSVRTPAGCKYITLYGYDPSRPSTPATYGKHRPTYATVFYYSTL -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177
Protein fold pattern problem http://scop.mrc-lmb.cam.ac.uk/scop/
Protein fold pattern problem http://scop.mrc-lmb.cam.ac.uk/scop/
flat classification VS hierarchical classification mistakes from the hierarchical classification
flat classification VS hierarchical classification "Hierarchical Feature Selection with Recursive Regularization." IJCAI, 2017
Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/
Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/
Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/ ENZYME now includes entries with preliminary EC numbers. Preliminary EC numbers include an 'n' as part of the fourth (serial) digit (e.g. EC 3.5.1.n3).
Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/
The Flow of Protein Classification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 >P04635|3.1.1.3|3.1.1.32 MKETKHQHTFSIRKSAYGAASVMVASCIFVIGGGVAEANDSTTQTTTPLEVAQTSQQETHTHQTPVTSLH TATPEHVDDSKEATPLPEKAESPKTEVTVQPSSHTQEVPALHKKTQQQPAYKDKTVPESTIASKSVESNK ATENEMSPVEHHASNVEKREDRLETNETTPPSVDREFSHKIINNTHVNPKTDGQTNVNVDTKTIDTVSPK DDRIDTAQPKQVDVPKENTTAQNKFTSQASDKKPTVKAAPEAVQNPENPKNKDPFVFVHGFTGFVGEVAA KGENHWGGTKANLRNHLRKAGYETYEASVSALASNHERAVELYYYLKGGRVDYGAAHSEKYGHERYGKTY >Q9NWT6|1.14.11.30|1.14.11.n4 MAATAAEAVASGSGEPREEAGALGPAWDESQLRSYSFPTRPIPRLSQSDPRAEELIENEEPVVLTDTNLV YPALKWDLEYLQENIGNGDFSVYSASTHKFLYYDEKKMANFQNFKPRSNREEMKFHEFVEKLQDIQQRGG EERLYLQQTLNDTVGRKIVMDFLGFNWNWINKQQGKRGWGQLTSNLLLIGMEGNVTPAHYDEQQNFFAQI KGYKRCILFPPDQFECLYPYPVHHPCDRQSQVDFDNPDYERFPNFQNVVGYETVVGPGDVLYIPMYWWHH IESLLNGGITITVNFWYKGAPTPKRIEYPLKAHQKVAIMRNIEKMLGEALGNPQEVGPLLNTMIKGRYN >P04418|3.2.2.17|4.2.99.18 MTRINLTLVSELADQHLMAEYRELPRVFGAVRKHVANGKRVRDFKISPTFILGAGHVTFFYDKLEFLRKR QIELIAECLKRGFNIKDTTVQDISDIPQEFRGDYIPHEASIAISQARLDEKIAQRPTWYKYYGKAIYA >A1JRR3|1.1.1.79|1.1.1.81 MNIIFYHPFFEAKQWLSGLQSRLPTANIRQWRRGDTQPADYALVWQPPQEMLASRVELKGVFALGAGVDA ILDQERRHPGTLPAGVPLVRLEDTGMSLQMQEYVVATVLRYFRRMDEYQLQQQQKLWQPLEPHQHDKFTI GILGAGVLGKSVAHKLAEFGFTVRCWSRTPKQIDGVTSFAGQEKLPAFIQGTQLLINLLPHTPQTAGILN QSLFSQLNANAYIINIARGAHLLERDLLAAMNAGQVAAATLDVFAEEPLPSMHPFWSHPRVTITPHIAAV TLPEVAMDQVVANIQAMEAGREPVGLVDVVRGY >Q9NZB8|4.1.99.22|4.6.1.17 MAARPLSRMLRRLLRSSARSCSSGAPVTQPCPGESARAASEEVSRRRQFLREHAAPFSAFLTDSFGRQHS YLRISLTEKCNLRCQYCMPEEGVPLTPKANLLTTEEILTLARLFVKEGIDKIRLTGGEPLIRPDVVDIVA QLQRLEGLRTIGVTTNGINLARLLPQLQKAGLSAINISLDTLVPAKFEFIVRRKGFHKVMEGIHKAIELG YNPVKVNCVVMRGLN >Q8JU62|2.1.1.56|2.7.7.50 MAAVFGIQLVPKLNTSTTRRTFLPLRFDLLLDRLQSTNLHGVLYRALDFNPVDRSATVIQTYPPLNAWSP HPAFIENPLDYRDWTEFIHDRALAFVGVLTQRYPLTQNAQRYTNPLVLGAAFGDFLNARSIDIFLDRLFY GPTQESPITSITKFPYQWTIDFNVTADSVRTPAGCKYITLYGYDPSRPSTPATYGKHRPTYATVFYYSTL -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177
The Flow of Protein Classification -0.12972021 -0.10267122 0.05165671 -0.02537533 -0.02327581 0.01257873 -0.04431615 -0.03793824 0.00783558 -0.09035013 -0.04484774 -0.02480496 -0.01150325 -0.02400325 0.03616526 -0.13563429 -0.15971042 -0.00528393 -0.02425524 -0.05029627 0.0067438 -0.04724623 -0.08116538 0.03915287 0.05580992 -0.02495753 -0.05490753 0.0361518 0.04706983 -0.09807123 0.10447804 0.09917403 0.07816287 0.11267566 0.06060866 -0.01122177
Enzyme identification • Hierarchical • Uncertainty • Multi-label http://enzyme.expasy.org/
Identification of microRNA AUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGACUAGACUGACAUCGUGCAGAGA CUAGACUGACAUCGUGCAGAGACUAG ACUGAC >1 tgcgcgaauucacccauggauccauucaucuuccaagggcaccagc >2 agcgcgaauuccaagucacccauggauccauucaucuggcagcgu >3 agucgcgaauucaucaucuuccaagggcacccauggauccaucca
Ref: Xue C, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics, 2005, 6(1): 310.
Ref: Xue C, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics, 2005, 6(1): 310.
microRNA prediction based on machine learning obvious differences weak generalization
Importance of negative samples Negative Testing Set Positive Training Set Decision Boundary Negative Training Set
Importance of negative samples Negative Testing Set Positive Training Set New Negative Training Set New Decision Boundary
Human CDs Extend Blast 100nt 100nt Human Mature microRNAs Mature-like Reads Compute Secondary Structures Extract Parameter Filter Prediction Model Rebuilt Original Negative Set Mined Sequences innovation point Replace
Dinoflagellates genome (甲藻) Lin, et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science. 2015, 350(6261): 691-694.
Outline • Hierarchical learning in bioinformatics • Protein fold pattern • Enzyme identification • microRNA family • High dimensionality problems • Gene expression • Methylation profile • GWAS
High dimensionality problems • Sparse • Noisy
Genome, GWAS and Watson • 15-19岁,芝加哥大学 • 19-22岁,印第安纳大学,博士学位 • 导师:Salvador Luria (1969年诺奖) • 偶像:穆勒(1946年诺奖,摩尔根的学生) • 22-25岁,剑桥大学卡文迪许实验室 • 领导:小布拉格(最年轻的诺奖得主) • 《Nature》发表DNA双螺旋结构 • 25-40岁,哈佛大学教授 • 34岁,1962年诺贝尔奖 • 40-79岁,冷泉港实验室主任 • 主持人类基因组计划 两次获得诺奖 • 居里夫人(1903,1911化学奖) • 约翰·巴丁(1956,1972物理奖) • 鲍林(1954化学奖,1962和平奖) • 桑格(1958,1980化学奖)
References • Zhao et al. Hierarchical Feature Selection with Recursive Regularization. IJCAI, 2017: 3483-3489 • Xian-Ying Cheng, Wei-Juan Huang, Shi-Chang Hu, Hai-Lei Zhang, Hao Wang, Jing-Xian Zhang, Hong-Huang Lin, Yu-Zong Chen, Quan Zou*, Zhi-Liang Ji*. A global characterization and identification of multifunctional enzymes. PLoS One. 2012,7(6):e38979 • Quan Zou*, Yaozong Mao, Lingling Hu, Yunfeng Wu, Zhiliang Ji*. miRClassify: An advanced web server for miRNA family classification and annotation. Computers in Biology and Medicine. 2014, 45:157-160 • Leyi Wei, Minghong Liao, Yue Gao, Rongrong Ji, Zengyou He*, Quan Zou*. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11(1):192-201 • Pei Li, Maozu Guo, Chunyu Wang, Xiaoyan Liu, Quan Zou*. An overview of SNP interactions in genome-wide association studies. Briefings in Functional Genomics. 2015, 14(2):143-155