200 likes | 335 Views
Discovery of (new) phenotypes in a genome-wide RNAi HeLa cell imaging screen. Gregoire Pau, Oleg Sklyar, Wolfgang Huber EMBL-EBI Cambridge Florian Fuchs, Michael Boutros DKFZ Heidelberg. Experimental setup. Genome-wide cell array screen with HeLa cells
E N D
Discovery of (new) phenotypesin a genome-wide RNAi HeLa cell imaging screen Gregoire Pau, Oleg Sklyar, Wolfgang Huber EMBL-EBI Cambridge Florian Fuchs, Michael Boutros DKFZ Heidelberg
Experimental setup • Genome-wide cell array screen with HeLa cells • Seeded, incubated for ~48h and stained with 3 markers • ~18000 genes knockdown (1 gene 1 well) • Actin (TRITC) • Tubulin (Alexa 488) • DNA (Hoechst) Florian Fuchs, Michael Boutros DKFZ Heidelberg
Gene phenotypes • Gene phenotype: phenotype expressed by a population of cells • Gene phenotype ≠cell phenotype ! • Examples: • No phenotype (observed on negative control empty wells)
Gene phenotypes • Examples • Apoptotic phenotype (observed on a COPB well) • Elongated phenotype (well LOC51693)
Cell phenotypes • Frequently observed cells • But they are many other more ! O Interphase O Mitotic O Dead cell
Goal Find new gene phenotypes "Given an input phenotype, how close is it to a known gene phenotype ?" ? Input well image No phenotype Apoptotic Elongated
Probabilistic point of view • Let denote the features of a cell i by Xi where Xi Rp • Each cell has p features: • Cell size, nucleus-to-cell size ration, nucleus eccentricity… • Actin Haralick moment, total tubulin, nucleus-to-cell actin ratio… • A gene phenotype is then characterized by a m.v. distribution Z • Where a realization is a set of n cells (X1,…,Xn) drawn from Z Cell feature 2 (X*2) Z Cell Cell feature 1 (X*1)
Models • Outlier detection problem • Given n cells (X1,…,Xn) where Xi Rp • How good are they fitting to a phenotype distribution (model) Z ? • Requires the estimation the density of Z (few samples, n≈p, hard !) • Requires a m.v. goodness-of-fit test (hard ?!) • Hard ! • Different workarounds • Shrinking (by binning) the space Rp by defining K cell classes • Z could be modeled by a simpler (and tractable) distribution
Defining cell classes • Defining K classes (here, K=3) • Counting the number of cells belonging to classes • Classical approach, robust • Needs a good priori biological knowledge • Adapted to clustering but maybe not to novelty detection Cell feature 2 (X*2) Cell feature 2 (X*2) Cell feature 1 (X*1) Cell feature 1 (X*1) O Interphase O Mitotic O Dead cell
Modeling Z • Assuming the phenotype distribution Z is known • Assuming a set of n cells (x1,…,xN) • P(X1=x1,… XN=xN) can be computed • Cells features Xi are independent • Two models: • Z is a normal distribution • Z is a mixture of 3 normal distributions
First model: Z is normal • Independence and normal assumption • A = log(P(X1=x1,… XN=xN)) = i log(p(Xi=xi)) • A is the log-probability that the cells features are similar to Z • Here Z is the distribution of the 'no phenotype' phenotype • Goal: Finding phenotypes far away from the 'no phenotype' • p(X=x)= N(X,X) can be easily estimated on a training set of wells showing no phenotype
Result • Using p=5 dimension cell features • Geometric: nucleus to cell size ratio, cell size, cell eccentricity • Protein: nucleus-to-cell actin ratio, nucleus-to-cell intensity ratio • Log-probability A can be computed on every well (~17000) • Sorting the lowest values Ai • Gives wells with some bluish dead cells, with very low p(X=x), which 'spoil' the sum lp Boring phenotypes: too close to the 'no phenotype'
Workarounds • Naïve solutions ? • Trimming: A', keeping only the 50 % interquantile p(X=x) values • Median: using A''=mediani(log(p(Xi=xi))) • Sorting the lowest A'', 5 new phenotypic classes can be found: • Condensed phenotype • Elongated phenotype • Bi-nucleated phenotype • 'Large cells' phenotype • 'Densely packed small cells' phenotype
Results • Condensed phenotype • Elongated STK39 TENC1 Curly shaped cells LOC51693 KCNT1
Results • Binucleated • Large cells phenotype KIAA0363 ADRB2
Results • Densely packed cells phenotype (empty spot) Artefact ? AFAR3
Note • Cells features • A = log(P(X1=x1,… XN=xN)) = i log(p(Xi=xi)) • A is the log-probability that the cells features are distributed in the same way than the model phenotype • Cell numbers • The number of cells N also can be a discriminating factor ! • Example: in an apoptotic phenotype • B = log(P(N=n)) is easy to compute • But how to combine A and B into a 'global outlier' score ?
Second model: Z is a mixture of 3 normal • Previous model was a coarse approximation • Normal assumption: 'no phenotype' population cells exhibit at least 3 different cell phenotypes (mitotic, interphase and dead cells) • New model • Z is a mixture of 3 normal distributions O Interphase O Mitotic O Dead cells
Model • Density of a cell feature X • P(X=x) = (1- M- D)fI(x) + MfM(x) + DfD(x) • Where M, D are the mixture components of mitotic and dead cells • Where fI, fM and fD are the normal densities of components • Fitting X on a phenotype • Gives A, B but also the mixtures M, D • Can they be used as discriminative parameters ? • Approach similar to the definition of cell classes ? • How to combine A, B M and D to a global 'outlier' score ? • Ongoing work… • … not yet !
Conclusion • Probalistic approach • Suitable for novelty detection • Even using Normal model lead to several phenotype discoveries • May not be extended to a clustering approach • Ongoing work • Results using the 3-component mixture model should be promising • … no ready yet !