160 likes | 263 Views
Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions. Jaakko Hollm én and Jarkko Tikka Helsinki Institute of Information Technology Helsinki University of Technology Espoo, Finland. Background on the problem.
E N D
Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions Jaakko Hollmén and Jarkko Tikka Helsinki Institute of Information Technology Helsinki University of Technology Espoo, Finland Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Background on the problem • Collaboration: Knuutila, Myllykangas at the University of Helsinki • DNA copy number amplifications are mutations in the DNA structure cancer • Bibliomics survey of 838 journal articles during 1992-2002 • Data: chromosomal mutations of 4500 cancer patients Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Example on the data collection S. Myllykangas, J. Himberg, T. Böhling, B. Nagy, J. Hollmén, and S. Knuutila. DNA copy number amplification profiling of human neoplasms . Oncogene, 25(55):7324-7332, November 2006 Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Chromosomal regions: names • Standardized nomenclature for chromosomal regions (spatial) • 1p36.2: chromosome 1, the arm p, region 36, subregion 2 • Ranges: 1p36.1-p36.3 • Hierarchical, irregular naming scheme used in literature Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
DNA copy number amplification data as 0-1 data Cancer patients (i) Chromosomal areas; spatial coordinates (j) Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Mixture models for 0-1 data • Cancer is a collection of diseases • Finite mixture model of multivariate Bernoulli distributions: • Learn the model with the EM algorithm Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Model selection: how many components in a mixture? • 5-fold cross validation repeated 10 times • Try different solutions, based on average likelihood for a validation set J=6 training validation Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Mixture model: Chromosome 1 Mixture Components j=1,...,6 • Model is summarized by J + Jd parameters (about 200 parameters altogether) Chromosomal areas (spatial coordinates) Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Mixture model in clustering Clustered cancer patients Chromosomal areas (spatial coordinates) Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Solution creates a problem • We solved the modeling problem, but created a communications problem! • How do the cancer experts understand and refer to our models? Names? Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Compact and Understandable Descriptions • Understandable (language, nomenclature) • Compact (size of the description) • Describe the parameters of the model • Use the model to cluster the data and describe the data in the clusters Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Describe the model parameters • Mode of the component distribution • most probable chromosomal area • Hypothetical mean organism (HMO) • quantize the parameters to represent a hypothetical case of data Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Describe the clustered data • Describe the margins of the clusters with maximal frequent itemsets • Why maximal: describe the largest representative commonality in the data; extracting frequent itemsets not feasible • Express the itemsets as ranges of contiguous chromosomal areas Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Descriptions, Chromosome 1 • Maximal frequent itemsets extracted globally: 1q21-q22,1q22-q23 • Shadowing and spurious mutations Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Amplification models and patterns 1q32-q44, 1q11-q44, 1q21-q25, 1q21-q23, 1p35-p32, 2p15-p14, 2q32, 2p25-2p24,2 p24-p23, 2p25-2p11.1, 3q26.1-q26.3, 3q11.1-q29, 3p26-q29, 3q25-q29, 3p24, 3q27-q29, 4q12, 4p15.3-p12, 5p13-p12, 5p15.3-p11, 5p15.3 5q35, 6q22, 6p25-q27, 6p25-p22, 6p12, 6p25-p11.1, 6q21-q27, 7q3- q36, 7p21, 7p13-p11.2, 7q21 ,7p22-q36, 7p22-p11.1, 8p23-q24.3, 8q24.1-q24.3, 8q23, 8q21.1-q22, 8q21.1-q24.3, 8q11.1-q24.3, 9q11-q34, 9p24 q34, 9q34, 9p24-p21, 10q11.1-q26, 10p15-p12, 11q11-q25, 11p15-q25, 11q23, 11q13, 11q14-q22, 11p12-p11.2, 11q12-q13, 12p13-p11.1, 12q13-q15, 12q11-q21, 12q12-12q23,12q24.1-q24.3, 12p12, 12q14-q15, 13q32-q34, 13p13-q34, 13q13-q14, 13q22-q34, 13q22-q31, 13q11-q34, 14q12-q21, 14q12-q32, 14q32, 15q11.1-q26, 15q24-q25, 16p13.3-q24, 16p13.3-p11.1, 16q22, 16p13.1-p12, 17q11.1-q25, 17p13-11.1, 17q21-q25, 17q12-q21, 17p13-q25, 17q24-q25, 17q22, 18q11.1-q23, 18q21, 18p11.3-18q23, 18p11.3-11.1, 19q13.1, 19p13.3-p13.2, 19p13.3-q13.4, 19q13.1-q13.4, 20q12, 20p12-p11.2, 20q11.1-q13.3, 20p13-q13.3, 20q13.1-q13.3, 20q11.1-q12, 21p13-q22, 21q11.2-q21, 21q21-q22, 21q11.1-q22, 22q11.1-q13, 22q13, 22p13-q13, Xp22.3-q28, Xp22.1-p11.2, Xq26-q28, Xq11-q28 Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models
Summary and Conclusions • DNA copy number amplifications (mutations) in cancer: database collected from literature • Mixture modeling of 0-1 data • Models summarized based on: parameters and clustered data with maximal frequent itemsets • The collection of DNA copy number amplifications forms a new basis for cancer classification Hollmén, Tikka: Compact and Understandable Descriptions of Mixtures of Bernoulli Models