280 likes | 454 Views
Looking for signals in tens of thousands of GeneChips. Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk.
E N D
Looking for signals in tens of thousands of GeneChips Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk There are >105 GeneChip experiments in the public domain, that cost ~$109 to produce. Extracting further information from this resource will be very cost effective.
Microarray informatics at Essex University Departments of Mathematical Sciences and Biological Sciences FacultyDegrees in ….. Dr Andrew Harrison Physics Professor Graham Upton Statistics Dr Berthold Lausen Statistics + Dr Hugh Shanahan (Royal Holloway) Physics PhD students Farhat Memon Computer Science Anne Owen Mathematics Fajriyah Rohmatul Statistics Current MSc and UG students Aleksandra Iljina Statistics and Data Analysis Lina Hamadeh Statistics and Data Analysis Madalina Ghita Mathematics Alumni Dr Jose Arteaga-Salas Statistics Dr Renata Camargo Computer Science Dr Caroline Johnston Molecular Biology and Bioinformatics Dr William Langdon Computer Science and Physics Dr Joanna Rowsell Mathematics Dr Olivia Sanchez-Graillet Computer Science and Bioinformatics Dr Maria Stalteri Inorganic Chemistry and Bioinformatics + 4 former MSc students
Perfect Match (PM) Mismatch (MM) m=log2(Fold Change), a=log2(Average Intensity) The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene - Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195 There is a huge multiple-testing problem. What can be learnt from comparing different experiments?
Some genes are represented by multiple probe-sets. Probe-set A Probe-set B If they are measuring the same thing the signals should be up and down regulated together. Is that always true? No Stalteri and Harrison, 2007, BMC Bioinformatics, 8:13
Probes map to different exons. Alternative splicing may cause some exons to be upregulated and others to be downregulated.
Genes come in pieces. But exons do not. Multiple probes mapping to the same exon should measure the same thing.
We are studying the correlations in expression across >6,000 GeneChips (HGU-133A), sampling RNA from many tissues and phenotypes.
The correlations in intensities (log2) between probes in probeset 208772_at on the HG-U133A array. The number in each square is the correlation ×10 Blue = low correlation Yellow = high correlation Average intensity in GEO Probe order along the gene The correlation calculated for PM probes 9 and 11 , the data in the earlier scatter plot, is reported as 8 (0.76 multiplied by 10 and rounded).
This probeset shows no coherent correlations amongst its probes.
Probes 1-11 all map to the same exon. This is a different probe-set mapping to the same exon – there seems to be one outlier.
There is little sequence similarity between the probes, they are from probe-sets picking up different biology, yet they are correlated! TCCTGGACTGAGAAAGGGGGTTCCT GAGACACACTGTACGTGGGGACCAC GGTAGACTGGGGGTCATTTGCTTCC Virtually all of the probes in the group have runs of Guanines within their 25 bases.
Comparing probes with runs of Gs. Number of contiguous Gs Mean Correlation 3 0.14 4 0.42 5 0.49 6 0.62 7 0.75 We are only looking at a small fraction of the entire probe, yet it is dominating the effects across all experiments.
G-quadruplexes GGGG GGGG GGGG Probes all have the same sequence in a cell – a run of guanines will result in closely packed DNA with just the right properties to form G-quadruplexes. Upton et al. 2008 BMC Genomics, 9, 613
How do we deal with known outliers such as G-quadruplexes? What is the best way to calculate expression in the presence of outliers?
G-stacks bias which genes are reported to be clustered together within published experiments.
Kerkhoven et al. 2008, PLoS ONE 3(4): e1980 Probes containing GCCTCCC will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridization.
Log(magnitude) of averaged probe values Colour coded by size. Note the perimeter of bright-dark pairs. Cell (0,0) contains a probe which does not measure any biology
Corner correlations(correlations with values in cell (0,0)) Numbers are correlations times 10 (red greater than 0.8) Negative correlations appear as blanks Filled circles indicate probes not listed in CDF file. Large circles indicate correlations greater than 0.8
Correlations with cell (0,0) Being in the opposite corner has not reduced the correlations of the interior row and column
What are in the sheep pens? Entries are correlation with cell (0,0) Entries are log(mean(Intensity)) Sheep!
Many thousands of probes are correlated with each other simply because they are adjacent to bright probes. We believe that the focus of the scanner may be responsible – regions adjacent to bright spots will gain the same fraction of light. A comparison of many images at different levels of blurriness will appear to indicate that dark regions adjacent to bright regions are correlated in their intensities.
A CEL file contains information about the ID of the scanner as well as the date on which the image was scanned – how does the impact of blur change over time for each scanner? Upton and Harrison, 2010, Stat Appl Genet Mol Biol, 9(1), Article 37
How best to transform a DAT image into a CEL file? We are testing whether ideas from astronomy are applicable. We are checking whether the temporal patterns in scanner performance for human and other organisms are related.
Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix are needed in order to extract reliable information from Affymetrix GeneChips Thank you for your attention.