Limsoon Wong Institute for Infocomm Research Singapore

From Informatics to Bioinformatics Limsoon Wong Institute for Infocomm Research Singapore

What is Bioinformatics?

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science

From Informatics to Bioinformatics Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore MHC-Peptide Binding (PREDICT) Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT/I2R KRDL

Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.

Data Integration Results sybase-add (#name:”GDB", ...); create view Lfromlocus_cyto_locationusingGDB; create view Efromobject_genbank_erefusingGDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from Lasc, Easg, {selectu fromg.#genbank_ref.na-get-homolog-summaryasu wherenot(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")}asH where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); • Using Kleisli: • Clear • Succinct • Efficient • Handles • heterogeneity • complexity

Data Warehousing {(#uid: 6138971, #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} • Motivation efficiency availabilty “denial of service” data cleansing • Requirements efficient to query easy to update. model data naturally

Data Warehousing Results ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create tableGP (#uid: "NUMBER", #detail: "LONG") usingdb; ! Populate table with GenPept reports select#uid: x.#uid, #detail: xintoGP fromaa-get-seqfeat-general "PTP”asx usingdb; ! Map GP to that table create viewGPfrom GPusingdb; ! Run a queryto get title of 131470 selectx.#detail.#title fromGPasx wherex.#uid = 131470; Relational DBMS is insufficientbecauseit forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relationalDBMS.It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store.

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable

Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression

Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)

WEB Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”

Protein Interaction Extraction Results • Rule-based system for processing free texts in scientific abstracts • Specialized in • extracting protein names • extracting protein-protein interactions Jak1

Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

Using Feature Generation & Feature Selection for Accurate Prediction of Translation Initiation Sites A more detailed example of post-genome knowledge discovery

Translation Initiation Recognition

A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Approach • Training data gathering • Signal generation • k-grams, distance, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...

Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • 3312 sequences • 13503 ATG sites • 3312 (24.5%) are TIS • 10191 (75.5%) are non-TIS • Use for 3-fold x-validation expts

Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame

Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms

Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Sample k-grams Selected by CFS Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Kozak consensus Stop codon Codon bias?

Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, ...

Results (3-fold x-validation)

Improvement by Voting • Apply any 3 of Naïve Bayes, SVM, Neural Network, & Decision Tree. Decide by majority.

Improvement by Scanning • Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS. • Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

Performance Comparisons * result not directly comparable

Pedersen&Nielsen [ISMB’97] Neural network No explicit features Zien [Bioinformatics’00] SVM+kernel engineering No explicit features Hatzigeorgiou [Bioinformatics’02] Multiple neural networks Scanning rule No explicit features Our approach Explicit feature generation Explicit feature selection Use any machine learning method w/o any form of complicated tuning Scanning rule is optional Technique Comparisons

Acknowledgements • A.G. Pedersen • H. Nielsen • Roland Yap • Fanfan Zeng

Limsoon Wong Institute for Infocomm Research Singapore

Limsoon Wong Institute for Infocomm Research Singapore

Presentation Transcript

Exciting Media Limsoon Wong Institute for Infocomm Research

Institute for Magnetics Research

Limsoon Wong KRDL

InfoComm 2012

Institute for Research Development

Limsoon Wong Kent Ridge Digital Labs Singapore

Infocomm Project

Limsoon Wong Laboratories for Information Technology Singapore

Singapore Environment Institute

Limsoon Wong Kent Ridge Digital Labs

Infocomm Studies

Institute for Gravitational Research

RAFFLES INSTITUTE SINGAPORE

Y.J. Yang, J.Y. Zhou, R.H. Deng, F. Bao Institute for Infocomm Research, Singapore

Israr Wong - Eye bag removal Singapore

Limsoon Wong Laboratories for Information Technology Singapore

Exciting Media Limsoon Wong Institute for Infocomm Research