Patterns from Gene Expression: Efficient Data Analysis Tool for Differential Gene Expression Experiments

Generation of patterns from gene expression by assigning confidence to differentially expressed genes Elisabetta Manduchi, Gregory R. Grant, Steven E.McKenzie, G. Christian Overton, Saul Surrey, Christian J. Stoeckert Presented by Keith Betts

Goal: • Provide tools to aid in the analysis of data collected from highly parallel gene expression experiments. • Generate descriptive and dependable expression patterns representing the differential expression of genes across cell types.

Identify those genes that are ‘most likely’ to be differentially expressed. • Transform typical ‘raw’ input into easily interpretable list of patterns

Patterns from Gene Expression

What is ??? • PaGE is free downloadable Perl software (tested mainly on Unix systems) which can be used as a statistical test for differentially expressed genes between two experimental conditions, given replicated expiriments. • Available at: • http://www.cbil.upenn.edu/PaGE/

Methods and Algorithm • Input consists of normalized data (the normalization procedure depends on the kind of experiments conducted) • The input normalized intensities are subjected to preprocessing steps.

Methods and Algorithm Cont. • In each gene tag’s expression pattern there will be one symbol for each homotypic group (set of samples of the same type) • For each homotypic group and for each gene tag, compute the average intensity of that tag over the group which have values for that tag. • This average will represent the intensity of that tag at that group.

Two Stage Approach First: Attach an ordered list of real numbers to each tag. Second: Bin the numbers in this list, resulting in a pattern of integers.

First Stage • Fix an ordering of the groups in the collection. • Attach to each tag the ordered list of real numbers obtained by dividing each of its non-reference group intensities by the median of its group intensities. • List of ratios attached to the tag.

Second Stage • For each non-reference group, partition the range into disjoint subintervals. • Number the bins using consecutive integers –m,…,0,….m (where 0 corresponds to ratio 1) • Attach the ordered list of integers to each gene tag.

Example • For group i: Divide the range into mi + ni + 1 bins. • The list of ratios from the first stage for a certain gene tag is (r1, r2,…., rl) • Each ri belongs to exactly one of the bins Bi,j. • The expression pattern associated with this tag is then (j1, j2,…, jl)

Choose level cutoffs • Suppose we are taking ratios to a reference homotypic group (group 0) and are focusing on a fixed group (group i). • Suppose also that we have replicate experiments for each of the two groups. • Concentrate on up-regulation

Goal • Goal is to achieve a certain degree of confidence in the assertion: ‘this gene is up-regulated at group i as compared to the reference group’

Each gene will have a distribution of intensities in a group, whose mean will be called ‘the true mean intensity of the gene at that group’ • Denote the Random Variable giving the intensity of gene g at group j by Xg,j, and denote the Mean and Std. Dev as g,j,g,j

False Positive Rate Prob((Xg,I / Xg,0) > Ci | (g,j / g,0 ) < 1)

Claim that (Ave.g,I / g,j) / (Ave.g,0 / g,0) > Ci ) And (g,j / g,0 ) < 1 Are independent events.

Seek Ci as small as possible such that: Prob( (Ave.g,I / g,j) / (Ave.g,0 / g,0) > Ci ) < s%

Approximate (Ave.g,j / g,j) for (j = 0, i) ((Xg,j,k / Ave.g,j) – 1) / Sqrt(tj – 1) + 1

Compute the desired Ci through integration • If fj ( j = 0,i) is the density function for Ave.g,j / g,j , and C is fixed, then evaluate using

If this is above the desired false positive rate, them C is raised and the integral is recalculated. • Repeat process until the desired false positive rate is attained.

Down-regulation • Proceed in similar manner • Seek ci as small as possible such that: Prob( (Ave.g,I / g,j) / (Ave.g,0 / g,0) > ci ) < s%

Once the Ci’s and the ci’s are determined for each reference group I, if the ratio of the average intensity of a gene tag at group i, and the average intensity of the same gene tag at the reference group is between Ci and Ci2, we say that the gene tag is up-regulated one level at this group as compared to the reference group.

One can now estimate the probability Prob(not up | predicted up) Prob(not up) * Prob(predicted up | not up) / Prob(predicted up)  Prob(predicted up | not up) / Prob(predicted up)

As a consequence of this approach, when we see a level different from 0, we have a certain confidence in the gene tag being up-regulated or down-regulated as compared to the reference group. • However, when we see a 0 there is no confidence implied. • We can only take 0 to mean that we do not have enough evidence to support a change in level.

Results Application to an erythroid development nylon filter dataset

Background • Erythroid development dataset contains 5 homotypic groups representing an erythroleukemic cell line and normal cells under different conditions • There are repliate data for each of the groups.

Background Continued • The groups are: • CD34 positive cells • Human adult erythroblasts • Cord erythroblasts • HEL cells • HEL cells treated with hemin

Application • Available replicates Two CD34 Three adult erythroblasts Two cord blood erythroblasts Three HEL Two HEL + hemin

The value of d is set at 15 • Only the moderate to highly abundant mRNA classes are likely to have given hybridization signals above background on the filter array. • Set the HEL group as reference

Two approaches • PaGE was run once merging the adult and the cord erythoblasts into one group with five replicates • PaGe was run a second time keeping the adult and cord erythoblasts in separate groups.

Performance • Running time always under 90 seconds when run on a UltraSPARC Iii CPU at 300MHZ with 128MB RAM.

Adult and Cord Merged Results • Total of 18,123 clones • 540 were above the minimum useful value in every group • 5,063 were above the minimum useful value in at least one group.

Merged Results Cont. • For s% = 1% (false positive rate) • 5 levels for CD-34 (0 to 4) • 10 levels for erythoblasts (-1 to 8) • 6 levels for HEL + hemen (-1 to 4)

Findings • Clones representing the same gene were usually found to have identical or very similar patterns. • Clones representing genes whose expression is known in these cells presented patterns compatible with what was expected.

New Application • Ask what genes are differentially expressed between Normal and leukemic cells? • Ask which genes are induced by hemin to adopt a normal expression pattern.

Findings • Having more genes available to start with led to more genes identified as differentially expressed but at lower confidence. • At similar confidence levels, starting with more genes did not necessarily lead to more genes identified as differentially expressed between normal and HEL cells.

Patterns from Gene Expression: Efficient Data Analysis Tool for Differential Gene Expression Experiments