190 likes | 297 Views
From Informatics to Bioinformatics. Limsoon Wong Laboratories for Information Technology Singapore. What is Bioinformatics?. Themes of Bioinformatics. Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery =
E N D
From Informatics to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore
Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science
From Informatics to Bioinformatics MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT KRDL
Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.
Data Integration Results sybase-add (#name:”GDB", ...); create view Lfromlocus_cyto_locationusingGDB; create view Efromobject_genbank_erefusingGDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from Lasc, Easg, {selectu fromg.#genbank_ref.na-get-homolog-summaryasu wherenot(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")}asH where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); • Using Kleisli: • Clear • Succinct • Efficient • Handles • heterogeneity • complexity
Data Warehousing {(#uid: 6138971, #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} • Motivation efficiency availabilty “denial of service” data cleansing • Requirements efficient to query easy to update. model data naturally
Data Warehousing Results ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create tableGP (#uid: "NUMBER", #detail: "LONG") usingdb; ! Populate table with GenPept reports select#uid: x.#uid, #detail: xintoGP fromaa-get-seqfeat-general "PTP”asx usingdb; ! Map GP to that table create viewGPfrom GPusingdb; ! Run a queryto get title of 131470 selectx.#detail.#title fromGPasx wherex.#uid = 131470; Relational DBMS is insufficientbecauseit forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relationalDBMS.It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store.
Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS
Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable
Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression
Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)
WEB Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”
Protein Interaction Extraction Results • Rule-based system for processing free texts in scientific abstracts • Specialized in • extracting protein names • extracting protein-protein interactions Jak1
Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….