260 likes | 366 Views
geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets. dr. florentino fdez-riverola university of vigo. Computer System of New Generation. Outline. DNA Microarray Technology characteristics and model operation overview Bioinformatics and AI
E N D
geneCBRa case-base reasoning tool for cancer diagnosis using microarray datasets dr. florentino fdez-riverola university of vigo Computer System of New Generation
Outline DNA Microarray Technology characteristics and model operation overview Bioinformatics and AI new challenges and emerging research areas CBR systems case-based reasoning GENE-CBR human genome analysis using CBR systems Demo geneCBR in action: cancer diagnosis using microarrays 2/26
Microarrays: characteristics • silicon chips that can measure the expression levels of thousands of genes simultaneously • microarrays are base on a database over 40000 fragments of genes called expressed sequence tags (ESTs) • allow us for the first time to obtain a “global” view of the cells belonging to: • different individuals • different time-intervals for the same individual • different tissues of the same individual • gene expression profiles can be used as inputs to large-scale data analysis as: • fingerprints to build more accurate molecular classification • discovering hidden taxonomies • Increasing our understanding of normal and disease states 3/26
Microarrays: model operation overview • how does the chip work? • Microarray chips incorporate different dyed genes tiled in a grid-like fashion • The individual’s DNA to analyze is dyed with a different colour • Both sets of labelled DNA strands are allowed to hybridize or bind • hybridization events are detected identifying fluorescent changes in the strands or DNA • an scanner and the associated software perform various forms of image analysis to measure and report raw gene expression values • the scanned intensities show how active the genes represented by the ESTs are in the cell: • strong fluorescence indicates that the gene is very active in the cell • no fluorescence indicates that the gene is inactive in the cell scanner preprocessing microarray data file 4/26
Available data • bone narrow samples from 43 adult patients with Acute Myeloid Leukemia (AML) plus 6 sane individuals • 10 patients with Acute Promyelocytic Leukemia [APL] • 4 patients with Acute Myeloid Leukemia with inv(16) [AML-inv(16)] • 7 patients with Acute Monocytic Leukemia [AML-mono] • 22 patients with Acute non-Monocytic Leukemia [AML-other] • 6 samples belonging to sane individuals [control samples] • volume of information processed • each microarray contains 22.283 ESTs ( genes) • 49 microarrays = 1.091.867 gene expression values • today available data • 150 microarrays (Human Genome 133A) + 210 microarrays (Human Genome - plus) 5/26
Challenges for microarray Data Mining • three main types of data analysis needed for biomedical applications: • gene selection ( attribute selection in AI): • find the genes most strongly related to a particular class • classification ( supervised classification in AI) • classifying diseases or predicting outcomes based on gene expression patterns, and perhaps even identifying the best treatment for given genetic signature • clustering ( unsupervised classification in AI) • finding new biological classes or refining existing ones • three parallel research areas: • convenient visualization of experiments and results • discovery of biological knowledge (metabolic pathways, etc.) • low-level analysis providing better readouts (preprocessing, normalization, etc.) 6/26
analysis of microarrays presents a number of unique challenges for Machine Learning and Data Mining techniques but … Its capacity for generating enormous amounts of data is, however, also an handicap: great amount of data belonging to each individual (thousands of genes) efficiency and memory problems lack of initial knowledge which is the significance level of each gene? given the difficulty of collecting microarray samples, the number of samples is likely to remain small in many interesting cases having so many fields relative to so few samples creates a high likelihood of finding false positives these problems are increased if we consider the potential errors that can be present in microarray data (symmetric and random errors) it is required sophisticated data analysis techniques and robust methods capable of extracting biologically meaningful knowledge from the raw data Problems with existing data 7/26
CBR systems (Case-Based Reasoning) • Kolodner (1983a, 1983b). Problem solving paradigm in AI. It can be viewed as a methodology for reasoning and learning • “reasoning by re-using past cases is a powerful and frequently applied way to solve problems for humans” Joh (1997) • the memory of the system (case base) stores a certain number of previously experienced situations • CASE = PROBLEM description + applied SOLUTION [ + RESULT ] • a new problem is solved by finding similar past cases and reusing them in the new problem situation • Riesbeck et al., (1989) • 4 cyclical steps are performed when it is necessary to solve a new problem • Kolodner (1993); Aamodt y Plaza (1994); Watson (1997) • Case-based reasoning is - in effect - a cyclic and integrated process of solving a problem, learning from this experience, solving a new problem, and so on... 8/26
MEMORY CASE BASE The CBR cycle RETRIEVING one or more previously experienced cases most similar cases New problem (1) RETRIEVE • REUSING • the case(s) in one way or another (2) REUSE • REVISING • the solution based on reusing a previous case(s) (4) RETAIN (3) REVISE confirmed solution proposed solution • RETAINING • the new experience by incorporating it into the existing knowledge-base (case base). 9/26
Main characteristics of CBR systems • adaptive and dynamic systems: the number of cases stored in the memory of the model changes, allowing the system adaptation to new situations • CBR allow the utilisation of general knowledge in the resolution of a particular problem • CBR facilitate the indexation of the available information • CBR can use uncompleted cases • CBR are advised about their limitations (perhaps a problem has no solution) • CBR facilitate the utilisation of representative and flexible data structures • case adaptation aids to discover inter-connections and hided structures in the available data • CBR can be completely automated 10/26
GENE-CBR 11/26
doctor research group uses programmer Goals Objectives GENE-CBR Develop an effective and reliable system able to diagnose cancer subtypes based on the analysis of microarray data CBR system (Case-Based Reasoning) “Solve new problems (new patient) based on the previous experience (diagnosed patients)” Implement a flexible tool for designing and testing new techniques and experiments AI techniques selection, clustering, inference… BeanShell Programmer interface Construct an advanced edition module for run-time modification of coded techniques 12/26
wizard JCBR doctor research group Expert Mode(testing techniques) Diagnostic Mode (diagnosing) CBR Programming Mode (BeanShell) programmer Logic architecture DFP DFP GCS GCS [1] RETRIEVE [2] REUSE CASE BASE [3] REVISE [4] RETAIN 13/26
gene-CBR Model overview Gene Selection reclassification most relevant genes = DFP Clustering revised prediction and final diagnostic genetically similar patients Knowledge Discovery Initial prediction Prediction 14/26
GENE-CBR::[i] retrieval • objectives: • perform gene selection without losing information • extracting simplified fuzzy patterns (FP) for each pathology • possibility of using AI techniques initially discarded • main phases: • supervised fuzzy discretisation of gene expression values • Low, Medium, High and overlapping labels (LM, MH) • supervised gene selection for each pathology • advantages: • independence of the ordering existing in data • takes into account data variability • allows for discovering new knowledge • obtained results are interpretable 15/26
APL AML-inv() AML-mono AML-other healthy Leucemia Aguda Promielocítica GENE-CBR::[i] retrieval 16/26
healthy APL AML-inv() AML-mono AML-other GENE-CBR::[i] retrieval FP_AML-other FP_healthy FP_AML-inv() FP_APL FP_AML-monocytic DFP . . . 17/26
GENE-CBR::[ii] reuse • objectives: • unsupervised identification of genetic similarities between patients • taking only into account the previous selected genes (DFP) • main phases: • training a GCS network DFP-dimensional • Growing Cell Structures. Fritzke, B. (1993) • presenting the new patient to the network • classifying using a proportional weighting voting schema • advantages: • clustering without taking into account the patient class • definition of an indexing and similarity structure between nodes ( relating patients) • generation of clusters containing new subtypes of unknown cancer (knowledge discovery) 18/26
PAT. gene expression values DFP CLASS AML-inv() DFP AML-inv() . . . AML-otras AML-inv() ¿? GENE-CBR::[ii] reuse + Similarity - Similarity AML-inv() 19/26
GENE-CBR::[iii] revise • objectives: • provide doctors with meaningful information about the classification carried out by the system • help in discovering new knowledge • if-then rules as decision making support mechanism • information supplied: • identification of similar patients (from a genetically point of view) • proportional weighting voting and assigned weights • rules generation using See5. Quinlan, J.R. (2000) • DFP genes belonging to the set of patients retrieved by the GCS network • advantages: • doctors can supervise the final decision proposed by the system • new knowledge generation in the form of easy understandable rules 20/26
AML-inv() AML-inv() AML-inv() AML-otras AML-inv() GENE-CBR::[iii] revise BIOLOGICAL AND CLINICAL CHARACTERISTICS CARIOTYPE Rule 6: (45 / 4, lift 1.1) If X65962 (AFFX-HSAC07/X00351_5_at) is LOWthen If U96781 (AFFX-BioDn-3_at) is LOW-MEDIUMthen AML-other Else If D87845 (AFFX-hum_alu_at) is HIGHthen AML-inv() [0.968] 21/26
GENE-CBR::[iv] retain • objectives: • feedback the system with new knowledge • new subclassification of existing cancer pathologies • reclassification of existing patients • identification of correlated genes • discovering of new marks able to distinguish new pathologies • Identification of prototypical patients and rare cases • main phases: • update the case base with new a microarray every time a new classification is generated • modification of the parameters of the model • advantages: • possibility of easily integrating new biological knowledge in the hybrid system 22/26
Applied technologies • 100% Java • Swing • BeanShell • Log4j • JFreeChart • Design patterns • Action • Future • MVC • Singleton • Wizard • Unified Modeling Language • Poseidon for UML 23/26
Future work • going through a plug-in architecture • designing a core where each technique is implemented as a plug-in => aiBENCH • implementing fold-cross validation • generation of multiple training and test cases in an automatic way • supporting standard microarray data formats • MIAME: Minimum Information About a Microarray Experiment • deploying of GENE-CBR with JavaWebStart • remote and automatic access to latest versions of GENE-CBR project • on-line access to genetic sequence databases • geneBank (http://www.ncbi.nlm.nih.gov/Genbank) 24/26
geneCBRa case-base reasoning tool for cancer diagnosis using microarray datasets dr. florentino fdez-riverola university of vigo Computer System of New Generation