1 / 58

Bioinformatics Programming

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang). Final Project. Topic. Sequence alignment Protein clustering Classification Other analysis techniques association rule frequent pattern network. Must be. A web server. Sequence alignment. First class

moana
Download Presentation

Bioinformatics Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Programming EE, NCKU Tien-Hao Chang (Darby Chang)

  2. Final Project

  3. Topic • Sequence alignment • Protein clustering • Classification • Other analysis techniques • association rule • frequent pattern • network

  4. Must be A web server

  5. Sequence alignment • First class • a novel sequence alignment algorithm • ClustalW • http://www.ebi.ac.uk/Tools/clustalw2/index.html • Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal-W—Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. • Second class • an application using ClustalW • E1DS • http://e1ds.ee.ncku.edu.tw/ • Chien,T.Y., Chang,D.T.H., Chen,C.Y., Weng,Y.Z. and Hsu,C.M. (2008) E1DS: catalytic site prediction based on 1D signatures of concurrent conservation. Nucleic Acids Res., 36, W291–W296.

  6. Protein clustering • First class • CD-HIT • http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi • Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283. • Second class • Protemot • http://protemot.ee.ncku.edu.tw/ • Chang,D.T.H., Weng,Y.Z., Lin,J.H., Hwang,M.J. and Oyang,Y.J. (2006) Protemot: prediction of protein binding sites with automatically extracted geometrical templates. Nucleic Acids Res., 34, W303–W309.

  7. First class • There might be some state-of-the-art packages • sequence alignment • BLAST (1990), ClustalW, FASTA, HMMER (1998), HHpred/HHsearch (2005), PSI-BLAST (1997), T-coffee, SSEARCH and so on • overtaking them is very difficult, but there still some room, especially for special purpose alignment • Abascal,F., Zardoya,R., and Telford,M.J. (2010)TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. Advance Access published on April 30, 2010. • Some possible direction • add some constrains (special purpose), speed the algorithm • combine multiple tools, ex: domain-conserved alignment • instead of implementing from scratch • manipulate the input to existing packages (preprocessing) • start from the output of existing packages (postprocessing)

  8. Second class • The programming part is less challenging, but is still heavy and probably more niggling • You need a good/interesting theme • predicting DNA-binding protein • predicting protein-protein interaction • mapping any ID to a specific database • connecting predicted TFBS to DNA/RNA sequences • … • Implement a specific algorithm and web-lize it might be okay • http://nar.oxfordjournals.org/papbyrecent.dtl has many update-to-date web servers

  9. http://www.flickr.com/photos/meteorry/3452536272/ In either class, you need to discuss with me

  10. http://www.sxc.hu/photo/544232 Final project schedule

  11. Discuss with me Before 5/12 (a soft deadline)

  12. What is machine learning?

  13. A very trivial machine learning toolK-Nearest-Neighbors (KNN) • The predicted class of the query sample depends on the voting among its k nearest neighbors O X X O O X O ? X X O O X X O

  14. k = 3 O X X O O X O O X X O O X X O

  15. k = 5 O X X O O X O X X X O O X X O

  16. Although KNN is very trivial, it can • Example: in vitro fertilization • Given: embryos described by 60 features • Problem: selection of embryos that will survive • Data: historical records of embryos and outcome • Given a set of known instances, predict outcome for newly coming instances • So, KNN learnt something related to “the definition of a good embryo”

  17. Although KNN is very trivial, it can • Example: in vitro fertilization • Given: embryos described by 60 features • Problem: selection of embryos that will survive • Data: historical records of embryos and outcome • Given a set of known instances, predict outcome for newly coming instances • So, KNN learnt something related to “the definition of a good embryo”?

  18. Can machines really learn? • Notice that here we call KNN a machine • Definitions of “learning” from dictionary: • To get knowledge of by study,experience, or being taught • To become aware by informationorfrom observation • To commit to memory • To be informed of, ascertain;to receiveinstruction • Operational definition: • Things learn when they changetheirbehavior in a way that makesthemperform better in the future Difficult to measure Trivial for computers Does a slipper learn?

  19. Shortly speaking, machine learning is Knowledge/Information Training dataA set of known instances Testing dataA query instance MachineE.g. KNN OutcomeClass of the query instance

  20. Furthermore, learning is Knowledge/Information When training data increases Training dataA set of known instances It delivers better (e.g. higher accuracy) outcome Testing dataA query instance MachineE.g. KNN OutcomeClass of the query instance

  21. Classifier In two sets of samples, tr and te Out accuracy of using tr to predict te Requirement - implement KNN with a parameter k - invoke RVKDE - complexity/teamwork report - using Perl would be the best Bonus - invoke LIBSVM - a script to decide the best k in a range

  22. Deadline 2010/5/11 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to darby@ee.ncku.edu.tw.

  23. Materials for the exercise 9 • Input sample (Iris) • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale • 1 1:-0.555556 2:0.25 3:-0.864407 4:-0.9166671 1:-0.666667 2:-0.166667 3:-0.864407 4:-0.9166671 1:-0.777778 3:-0.898305 4:-0.916667. • Test your program on satimage • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.tr • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t • RVKDE • http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde • $ wget http://mbi.ee.ncku.edu.tw/rvkde/res/rvkde-current-linux32.tgz$ tar zxvf rvkde-current-linux32.tgz$ rvkde-0.2.3-final/rvkde --classify --predict -v tr -V te -a 1 -b 1 --ks 10 --kt 10 • rvkde has a built-in function of parameter tuning (see --cv) • LIBSVM • http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • see the manual • LIBSVM provides a script of parameter tuning (see grid.py)

  24. Machine Learning and Bioinformatics

  25. Why these two fields? • From biologists’ view • There are abundant data to analyze • From computer guys’ view • The data are suitable (large and well-studied) • Biomedical problems are important • There are various computer science techniques for various Bioinformatics applications

  26. Circuit simulation Computer graphics Information retrieval Network analysis http://www.sophion.dk/sophion/Open-close2.jpg & http://alford.bios.uic.edu/Images/586%20images/circuit%20model http://healthbolt.net/wp-content/uploads/2006/09/cell-animation.jpg & http://www.osmosis.com.au/animate/images/bloodcells.jpg http://www.dashboardinsight.com/CMS/e01d472c-862e-4e13-8b23-591f8938889a/text_mining340x220.png http://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Social-network.svg/430px-Social-network.svg.png

  27. Applications, concepts and our approaches

  28. Our online services • Secondary structure prediction • Catalytic site prediction • Protein-ligand docking

  29. Secondary structure prediction • In biochemistry andstructuralbiology,secondary structure(SSE) is the generalthree-dimensional formof localsegments ofbiopolymers suchas proteinsand nucleic acids(DNA/RNA) http://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Myoglobin.png/542px-Myoglobin.png

  30. Prote2Shttp://prote2s.csie.ntu.edu.tw/

  31. Prote2Shttp://prote2s.csie.ntu.edu.tw/ >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAELVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

  32. Concept of Prote2S MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLG… Knowledge/Information <f1, f2, … fd> Training dataResidues with known SSE Testing dataA residue as a vector MachineE.g. KNN OutcomeSSE of the query residue

  33. >1A0OB RQLALEAKGETPSAVTRLSVVAKSEPQDEQSRSQSPRRIILS… PSI-BLAST PSSM Feature vector Feature encoding • Most classifiers dealwith a vector space • Feature encodingmeans to generatethe vectorrepresentation of anreal world instance • An instance isrepresented asseveral importantattributes, or say,features

  34. Disorder region Conformational switch Solvent accessibility Protein-protein interaction http://www.oup.com/uk/orc/bin/9780199265114/resources/anim/figs/f2-9.gif http://stke.sciencemag.org/content/jbc/vol280/issue7/images/large/zbc0060586670006.jpeg Similar applications to Prote2S

  35. A family tree Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

  36. Family tree represented as a table The “sister-of” relation Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

  37. Catalytic site prediction • The catalytic site is usuallya small pocket at thesurface of the enzyme thatcontainsresiduesresponsible for thesubstratespecificity andcatalyticresidues whichoften act asproton donorsor acceptorsor areresponsible for bindingacofactor http://fig.cox.miami.edu/~cmallery/255/255enz/ES_complex.jpg

  38. E1DShttp://e1ds.ee.ncku.edu.tw/

  39. E1DShttp://e1ds.ee.ncku.edu.tw/ >Paste your sequence in FASTA format to replace the sample FASTA here MIFSVDAVRADFPVLSREVNGLPLAYLDSAASAQKPSQVIDAEAEFYRHGYAAVHAGAHTLSAQATEKMENVRKRASLFI NARSAEELVFVRGTTEGINLVANSWGNSNVRAGDNIIISQMEHHANIVPWQMLCARVGAELRVIPLNPDGTLQLETLPTL FDAATRLLAITHVSNVLGTENPLAEMITLAHQHGAKVLVDGAQAVMHHPVDVQALDCDFYVFSGHKLYGPTGIGILYVKE ALLQEMPPWEGGGSMIATVSLSEGTTWTKAPWRFEAGTPNTGGIIGLGAALEYVSALGLNNIAEYEQNLMHYALSQLESV PDLTLYGPQARLGVIAFNLGAHHAYDVGSFLDNYGIAVRTGHHCAMPLMAYYNVPAMCRASLAMYNTHEEVDRLVTGLQR IHRLLG

  40. Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

  41. http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpghttp://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

  42. http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-3-l.jpghttp://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-3-l.jpg http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

  43. Allowing large flexible gaps • >1RPX:ASRVDKFSKSDIIVSPSILSANFSKLGEQVKAIEQAGCDWIHVDVMDGRFVPNITIGPLVVDSLRPITDLPLDVHLMIVEPDQRVPDFIKAGADIVSVHCEQSSTIHLHRTINQIKSLGAKAGVVLNPGTPLTAIEYVLDAVDLVLIMSVNPGFGGQSFIESQVKKISDLRKICAERGLNPWIEVDGGVGPKNAYKVIEAGANALVAGSAVFGAPDYAEAIKGIKTSKRPE • PROSITE pattern • [LIVMA]-x-[LIVM]-M-[ST]-[VS]-x-P-x(3)-[GN]-Q-x(0,1)-[FMK]-x(6)-[NKR]-[LIVMC] • Our pattern • H-x-D-x-M-D-x(94,144)-M-x-V-x-P-G-x(3)-Q-x(22,32)-D-G-G

  44. Transcription factor binding site Protein-protein interaction http://stke.sciencemag.org/content/jbc/vol280/issue7/images/large/zbc0060586670006.jpeg http://www.cs.uiuc.edu/homes/sinhas/img/DAILYILLINI.jpg & http://upload.wikimedia.org/wikipedia/en/thumb/8/8d/ChIP-on-chip_wet-lab.png/400px-ChIP-on-chip_wet-lab.png Applications using pattern mining

  45. Protein-ligand docking • The goalofprotein-liganddocking is to predictthe position andorientation of a ligand(a small molecule)whenit is bound to a protein receptor http://www-ucc.ch.cam.ac.uk/research/images/docking-small.jpg

  46. MEDockhttp://medock.csie.ntu.edu.tw/

  47. MEDockhttp://medock.csie.ntu.edu.tw/ http://gemdock.life.nctu.edu.tw/dock/images/1jff-sum.gif

  48. Concept of MEDock http://www.mathworks.com/cmsimages/op_main_wl_3250.jpg

  49. Genetic algorithm http://wellington.pm.org/archive/200505/oo-perl/images/mutation.jpg Mutation Crossover http://www.mannosidosis.org/images/inheritance.jpg Nature selection http://myconstructionphotos.smugmug.com/photos/58510454-M-1.jpg

  50. Protein folding Gene network Microarray analysis http://cnx.org/content/m11461/latest/protein_folding.jpg http://research.microsoft.com/users/manuelrg/microarray.gif http://www.ehponline.org/members/2007/10358/fig2.jpg Applications using optimization

More Related