190 likes | 265 Views
From Informatics to Bioinformatics. Limsoon Wong Kent Ridge Digital Labs Singapore. What is Bioinformatics?. What are the Themes of Bioinformatics?. Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery =
E N D
From Informatics to Bioinformatics Limsoon Wong Kent Ridge Digital Labs Singapore
What are the Themes of Bioinformatics? Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
What are the Benefits of Bioinformatics? • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science
Data Integration • A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.
Data Integration Results sybase-add (#name:”GDB", ...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from L as c, E as g, (select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")) as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); • Using Kleisli: • Clear • Succint • Efficient • Handles • heterogeneity • complexity
Data Warehousing {(#uid: 6138971, #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} • Motivation • efficiency • availabilty • “denial of service” • data cleansing • Requirements • efficient to query • easy to update. • model data naturally
Data Warehousing Results • Relational DBMS is insufficient becauseit forces us to fragment data into 3NF. • Kleisli turns flat relational DBMS into nested relationalDBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its updatable complex object store. It can even use all of these systems simultaneously! ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db; ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of 131470 select x.#detail.#title from GP as x where x.#uid = 131470;
Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS
Gene Expression Analysis • Clustering gene expression profiles • Classifying gene expression profiles • find stable differentially expressed genes
Gene Expression Analysis Results • The Discovery System • Correlation test • Voter selection • Class prediction
WEB Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”
Protein Interaction Extraction Results • Rule-based system for processing free texts in scientific abstracts • Specialized in • extracting protein names • extracting protein-protein interactions Jak1
Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable
Medical Record Analysis Results • DeEPs, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works for gene expressions
Research Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Business Peter Saunders Industry Assignees Hao Han (gX) Rahul Despande (MC) Engineering Allen Chong Judice Koh SPT Krishnan Seng Hong Seah Guanglan Zhang Zhuo Zhang Students Huiqing Liu Song Zhu Kun Yu Behind the Scene