370 likes | 403 Views
Discover the evolution of bioinformatics, from managing data to extracting knowledge. Learn about protein interaction extraction, MHC-peptide binding prediction, gene expression analysis, and more in Singapore's bioinformatics research. See the benefits to patients, pharmaceutical companies, and scientists. Explore the integration technology, data warehousing, cleansing, and gene feature recognition involved in bioinformatics research. Gain insight into the innovative technologies and methodologies, such as Kleisli, used to handle the complexity and heterogeneity of biological data efficiently.
E N D
From Informatics to Bioinformatics Limsoon Wong Institute for Infocomm Research Singapore
Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science
From Informatics to Bioinformatics Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore MHC-Peptide Binding (PREDICT) Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT/I2R KRDL
Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.
Data Integration Results sybase-add (#name:”GDB", ...); create view Lfromlocus_cyto_locationusingGDB; create view Efromobject_genbank_erefusingGDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from Lasc, Easg, {selectu fromg.#genbank_ref.na-get-homolog-summaryasu wherenot(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")}asH where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); • Using Kleisli: • Clear • Succinct • Efficient • Handles • heterogeneity • complexity
Data Warehousing {(#uid: 6138971, #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} • Motivation efficiency availabilty “denial of service” data cleansing • Requirements efficient to query easy to update. model data naturally
Data Warehousing Results ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create tableGP (#uid: "NUMBER", #detail: "LONG") usingdb; ! Populate table with GenPept reports select#uid: x.#uid, #detail: xintoGP fromaa-get-seqfeat-general "PTP”asx usingdb; ! Map GP to that table create viewGPfrom GPusingdb; ! Run a queryto get title of 131470 selectx.#detail.#title fromGPasx wherex.#uid = 131470; Relational DBMS is insufficientbecauseit forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relationalDBMS.It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store.
Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS
Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable
Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression
Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)
WEB Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”
Protein Interaction Extraction Results • Rule-based system for processing free texts in scientific abstracts • Specialized in • extracting protein names • extracting protein-protein interactions Jak1
Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….
Using Feature Generation & Feature Selection for Accurate Prediction of Translation Initiation Sites A more detailed example of post-genome knowledge discovery
A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?
Approach • Training data gathering • Signal generation • k-grams, distance, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...
Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • 3312 sequences • 13503 ATG sites • 3312 (24.5%) are TIS • 10191 (75.5%) are non-TIS • Use for 3-fold x-validation expts
Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame
Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms
Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • Which of the following 3 signals is good?
Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other
Sample k-grams Selected by CFS Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Kozak consensus Stop codon Codon bias?
Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, ...
Improvement by Voting • Apply any 3 of Naïve Bayes, SVM, Neural Network, & Decision Tree. Decide by majority.
Improvement by Scanning • Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS. • Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG
Performance Comparisons * result not directly comparable
Pedersen&Nielsen [ISMB’97] Neural network No explicit features Zien [Bioinformatics’00] SVM+kernel engineering No explicit features Hatzigeorgiou [Bioinformatics’02] Multiple neural networks Scanning rule No explicit features Our approach Explicit feature generation Explicit feature selection Use any machine learning method w/o any form of complicated tuning Scanning rule is optional Technique Comparisons
Acknowledgements • A.G. Pedersen • H. Nielsen • Roland Yap • Fanfan Zeng