Bioinformatics Programming

Bioinformatics Programming EE, NCKU Tien-Hao Chang (Darby Chang)

Final Project

Topic • Sequence alignment • Protein clustering • Classification • Other analysis techniques • association rule • frequent pattern • network

Must be A web server

Sequence alignment • First class • a novel sequence alignment algorithm • ClustalW • http://www.ebi.ac.uk/Tools/clustalw2/index.html • Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal-W—Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. • Second class • an application using ClustalW • E1DS • http://e1ds.ee.ncku.edu.tw/ • Chien,T.Y., Chang,D.T.H., Chen,C.Y., Weng,Y.Z. and Hsu,C.M. (2008) E1DS: catalytic site prediction based on 1D signatures of concurrent conservation. Nucleic Acids Res., 36, W291–W296.

Protein clustering • First class • CD-HIT • http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi • Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283. • Second class • Protemot • http://protemot.ee.ncku.edu.tw/ • Chang,D.T.H., Weng,Y.Z., Lin,J.H., Hwang,M.J. and Oyang,Y.J. (2006) Protemot: prediction of protein binding sites with automatically extracted geometrical templates. Nucleic Acids Res., 34, W303–W309.

First class • There might be some state-of-the-art packages • sequence alignment • BLAST (1990), ClustalW, FASTA, HMMER (1998), HHpred/HHsearch (2005), PSI-BLAST (1997), T-coffee, SSEARCH and so on • overtaking them is very difficult, but there still some room, especially for special purpose alignment • Abascal,F., Zardoya,R., and Telford,M.J. (2010)TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. Advance Access published on April 30, 2010. • Some possible direction • add some constrains (special purpose), speed the algorithm • combine multiple tools, ex: domain-conserved alignment • instead of implementing from scratch • manipulate the input to existing packages (preprocessing) • start from the output of existing packages (postprocessing)

Second class • The programming part is less challenging, but is still heavy and probably more niggling • You need a good/interesting theme • predicting DNA-binding protein • predicting protein-protein interaction • mapping any ID to a specific database • connecting predicted TFBS to DNA/RNA sequences • … • Implement a specific algorithm and web-lize it might be okay • http://nar.oxfordjournals.org/papbyrecent.dtl has many update-to-date web servers

http://www.flickr.com/photos/meteorry/3452536272/ In either class, you need to discuss with me

http://www.sxc.hu/photo/544232 Final project schedule

Discuss with me Before 5/12 (a soft deadline)

What is machine learning?

A very trivial machine learning toolK-Nearest-Neighbors (KNN) • The predicted class of the query sample depends on the voting among its k nearest neighbors O X X O O X O ? X X O O X X O

k = 3 O X X O O X O O X X O O X X O

k = 5 O X X O O X O X X X O O X X O

Although KNN is very trivial, it can • Example: in vitro fertilization • Given: embryos described by 60 features • Problem: selection of embryos that will survive • Data: historical records of embryos and outcome • Given a set of known instances, predict outcome for newly coming instances • So, KNN learnt something related to “the definition of a good embryo”

Although KNN is very trivial, it can • Example: in vitro fertilization • Given: embryos described by 60 features • Problem: selection of embryos that will survive • Data: historical records of embryos and outcome • Given a set of known instances, predict outcome for newly coming instances • So, KNN learnt something related to “the definition of a good embryo”?

Can machines really learn? • Notice that here we call KNN a machine • Definitions of “learning” from dictionary: • To get knowledge of by study,experience, or being taught • To become aware by informationorfrom observation • To commit to memory • To be informed of, ascertain;to receiveinstruction • Operational definition: • Things learn when they changetheirbehavior in a way that makesthemperform better in the future Difficult to measure Trivial for computers Does a slipper learn?

Shortly speaking, machine learning is Knowledge/Information Training dataA set of known instances Testing dataA query instance MachineE.g. KNN OutcomeClass of the query instance

Furthermore, learning is Knowledge/Information When training data increases Training dataA set of known instances It delivers better (e.g. higher accuracy) outcome Testing dataA query instance MachineE.g. KNN OutcomeClass of the query instance

Classifier In two sets of samples, tr and te Out accuracy of using tr to predict te Requirement - implement KNN with a parameter k - invoke RVKDE - complexity/teamwork report - using Perl would be the best Bonus - invoke LIBSVM - a script to decide the best k in a range

Deadline 2010/5/11 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to darby@ee.ncku.edu.tw.

Materials for the exercise 9 • Input sample (Iris) • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale • 1 1:-0.555556 2:0.25 3:-0.864407 4:-0.9166671 1:-0.666667 2:-0.166667 3:-0.864407 4:-0.9166671 1:-0.777778 3:-0.898305 4:-0.916667. • Test your program on satimage • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.tr • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t • RVKDE • http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde • $ wget http://mbi.ee.ncku.edu.tw/rvkde/res/rvkde-current-linux32.tgz$ tar zxvf rvkde-current-linux32.tgz$ rvkde-0.2.3-final/rvkde --classify --predict -v tr -V te -a 1 -b 1 --ks 10 --kt 10 • rvkde has a built-in function of parameter tuning (see --cv) • LIBSVM • http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • see the manual • LIBSVM provides a script of parameter tuning (see grid.py)

Machine Learning and Bioinformatics

Why these two fields? • From biologists’ view • There are abundant data to analyze • From computer guys’ view • The data are suitable (large and well-studied) • Biomedical problems are important • There are various computer science techniques for various Bioinformatics applications

Circuit simulation Computer graphics Information retrieval Network analysis http://www.sophion.dk/sophion/Open-close2.jpg & http://alford.bios.uic.edu/Images/586%20images/circuit%20model http://healthbolt.net/wp-content/uploads/2006/09/cell-animation.jpg & http://www.osmosis.com.au/animate/images/bloodcells.jpg http://www.dashboardinsight.com/CMS/e01d472c-862e-4e13-8b23-591f8938889a/text_mining340x220.png http://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Social-network.svg/430px-Social-network.svg.png

Applications, concepts and our approaches

Our online services • Secondary structure prediction • Catalytic site prediction • Protein-ligand docking

Secondary structure prediction • In biochemistry andstructuralbiology,secondary structure(SSE) is the generalthree-dimensional formof localsegments ofbiopolymers suchas proteinsand nucleic acids(DNA/RNA) http://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Myoglobin.png/542px-Myoglobin.png

Prote2Shttp://prote2s.csie.ntu.edu.tw/

Prote2Shttp://prote2s.csie.ntu.edu.tw/ >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAELVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

Concept of Prote2S MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLG… Knowledge/Information <f1, f2, … fd> Training dataResidues with known SSE Testing dataA residue as a vector MachineE.g. KNN OutcomeSSE of the query residue

>1A0OB RQLALEAKGETPSAVTRLSVVAKSEPQDEQSRSQSPRRIILS… PSI-BLAST PSSM Feature vector Feature encoding • Most classifiers dealwith a vector space • Feature encodingmeans to generatethe vectorrepresentation of anreal world instance • An instance isrepresented asseveral importantattributes, or say,features

Disorder region Conformational switch Solvent accessibility Protein-protein interaction http://www.oup.com/uk/orc/bin/9780199265114/resources/anim/figs/f2-9.gif http://stke.sciencemag.org/content/jbc/vol280/issue7/images/large/zbc0060586670006.jpeg Similar applications to Prote2S

A family tree Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

Family tree represented as a table The “sister-of” relation Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

Catalytic site prediction • The catalytic site is usuallya small pocket at thesurface of the enzyme thatcontainsresiduesresponsible for thesubstratespecificity andcatalyticresidues whichoften act asproton donorsor acceptorsor areresponsible for bindingacofactor http://fig.cox.miami.edu/~cmallery/255/255enz/ES_complex.jpg

E1DShttp://e1ds.ee.ncku.edu.tw/

E1DShttp://e1ds.ee.ncku.edu.tw/ >Paste your sequence in FASTA format to replace the sample FASTA here MIFSVDAVRADFPVLSREVNGLPLAYLDSAASAQKPSQVIDAEAEFYRHGYAAVHAGAHTLSAQATEKMENVRKRASLFI NARSAEELVFVRGTTEGINLVANSWGNSNVRAGDNIIISQMEHHANIVPWQMLCARVGAELRVIPLNPDGTLQLETLPTL FDAATRLLAITHVSNVLGTENPLAEMITLAHQHGAKVLVDGAQAVMHHPVDVQALDCDFYVFSGHKLYGPTGIGILYVKE ALLQEMPPWEGGGSMIATVSLSEGTTWTKAPWRFEAGTPNTGGIIGLGAALEYVSALGLNNIAEYEQNLMHYALSQLESV PDLTLYGPQARLGVIAFNLGAHHAYDVGSFLDNYGIAVRTGHHCAMPLMAYYNVPAMCRASLAMYNTHEEVDRLVTGLQR IHRLLG

Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpghttp://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-3-l.jpghttp://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-3-l.jpg http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

Allowing large flexible gaps • >1RPX:ASRVDKFSKSDIIVSPSILSANFSKLGEQVKAIEQAGCDWIHVDVMDGRFVPNITIGPLVVDSLRPITDLPLDVHLMIVEPDQRVPDFIKAGADIVSVHCEQSSTIHLHRTINQIKSLGAKAGVVLNPGTPLTAIEYVLDAVDLVLIMSVNPGFGGQSFIESQVKKISDLRKICAERGLNPWIEVDGGVGPKNAYKVIEAGANALVAGSAVFGAPDYAEAIKGIKTSKRPE • PROSITE pattern • [LIVMA]-x-[LIVM]-M-[ST]-[VS]-x-P-x(3)-[GN]-Q-x(0,1)-[FMK]-x(6)-[NKR]-[LIVMC] • Our pattern • H-x-D-x-M-D-x(94,144)-M-x-V-x-P-G-x(3)-Q-x(22,32)-D-G-G

Transcription factor binding site Protein-protein interaction http://stke.sciencemag.org/content/jbc/vol280/issue7/images/large/zbc0060586670006.jpeg http://www.cs.uiuc.edu/homes/sinhas/img/DAILYILLINI.jpg & http://upload.wikimedia.org/wikipedia/en/thumb/8/8d/ChIP-on-chip_wet-lab.png/400px-ChIP-on-chip_wet-lab.png Applications using pattern mining

Protein-ligand docking • The goalofprotein-liganddocking is to predictthe position andorientation of a ligand(a small molecule)whenit is bound to a protein receptor http://www-ucc.ch.cam.ac.uk/research/images/docking-small.jpg

MEDockhttp://medock.csie.ntu.edu.tw/

MEDockhttp://medock.csie.ntu.edu.tw/ http://gemdock.life.nctu.edu.tw/dock/images/1jff-sum.gif

Concept of MEDock http://www.mathworks.com/cmsimages/op_main_wl_3250.jpg

Genetic algorithm http://wellington.pm.org/archive/200505/oo-perl/images/mutation.jpg Mutation Crossover http://www.mannosidosis.org/images/inheritance.jpg Nature selection http://myconstructionphotos.smugmug.com/photos/58510454-M-1.jpg

Protein folding Gene network Microarray analysis http://cnx.org/content/m11461/latest/protein_folding.jpg http://research.microsoft.com/users/manuelrg/microarray.gif http://www.ehponline.org/members/2007/10358/fig2.jpg Applications using optimization

Bioinformatics Programming