550 likes | 699 Views
Analyzing genome tiling microarrays for the detection of novel expressed genes. Preliminary version 23 Feb 2007. Yang Li and Rainer Breitling Dagstuhl seminar, March 2007. Groningen Bio informatics Centre. Outline. Introduction to tiling arrays Published research on exon finding
E N D
Analyzing genome tiling microarrays for the detection of novel expressed genes Preliminary version 23 Feb 2007 Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Groningen Bioinformatics Centre
Outline • Introduction to tiling arrays • Published research on exon finding • Our data set • Machine learning for exon finding • Results
Background • Genomic tiling array • Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription. • Tiling • A sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.
Two types of tiling array construction: • Oligonucleotide tiling array • 2) Tiling array constructed using PCR products Trend in Genetics 2005 v21 466
Detection of transcription • Discovery of novel genes • Discovery of novel non-coding RNAs • 3) Alternative splicing study • Advantages: • The sensitivity of microarrays enables rare transcripts to be detected; • The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed. • The experimental design is not dependent on current genome annotations.
Recent Research • Surprising amounts of genomic ‘dark matter’ • More than 50% of animal genomes may be transcribed • Novel protein-coding genes • Novel non-coding genes (rRNA, tRNA, snoRNA, miRNA…) • Antisense transcripts • Alternative isoforms and gene ‘extensions’ • Leaky transcription • Technical noise/artifacts
Exon-intron discriminators Kampa et al. Hodges–Lehman estimator (pseudo median )
Exon-intron discriminators Schadt et al. PCA 1. Probes are separated into 15 kb sliding windows 2. Calculate robust principal component (between-sample correlation matrix) 3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS)) 4. Decide on exon vs. intron 5. Assign probes to transcriptional units
Exon-intron discriminators • Our collaborators’ approach (Andrew Fraser and Tom Gingeras): • use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions • apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag • minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters
About our tiling data • Affymetrix C. elegans Tiling 1.0R Array • Genome-wide gene expression: • ChrI~V, Chr X and Chr M (Mitochondrion) • Resolution: on average 25 bp • Negative bacterial controls • Samples: 21 samples across development (plus mutant) • Probes: 2,942,364PM/MM pairs
sample number About tiling data * smg-1: deficient in nonsense mediated decay
Examples LAP-1(ZK353.6) Genomic Position: III:8401845..8399119 bp Lap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting.
Example Probe intensity intron extron
Example 2 Probe intensity
General impression Chr III 2866 genes
Methods: machine learning Aim Find the most effective (correct) machine learning method that distinguishes between True exons and True introns Find the simplest (fastest, intuitive) method that achieves this task
Methods: machine learning Main challenge True exons and True introns are not known: Annotated exons may be unexpressed Annotated introns may be novel transcripts Our approach Ignore the problem and optimize supervised performance Assumption True novel transcripts will be similar to known ones
Methods: machine learning • 1.Classification and regression tree (CART) • binary recursive partitioning • Advantages: • Easy to understand • Easy to implement • Computationally cheap
Methods: Machine learning 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?
2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?
2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?
2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?
Maximum Margin denotes +1 denotes 0 The classifier with the maximum margin is the ideal one.
Evaluation Receiver Operating Characteristic curve (ROC curve) ROC 1.00 1.14 0.90 0.85 0.80 0.93 0.72 0.50 True Positive Rate (sensitivity) 0.51 0.3 0.00 0.1 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate (1-specificity)
Raw Normalized Mean Median Max Max_1 pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2 Selection of informative features – intensities
Selection of informative features – correlation Raw Normalized Pearson Spearman pm1,pm-1, mm1,mm-1
Selection of informative features • Summary • Almost all reasonable features are informative • No striking difference between mean and median, but they seem better than max, max_1 • CC also informative. No striking difference between Pearson and Spearman • Quantile normalization doesn’t improve the result • Decision • Median, CC (Pearson) of non-normalized data are used to generate features • GC content or melting temperature can also be informative
Selection of informative features • Neighbours • MM • CC.PM • CC.MM • Tm • ANOVA results
AUC ~ ( other factors ) expression exon length melting temperature relative position
Can minrun and maxgap improve the results? maxgap = 1 minrun = 3
Can minrun and maxgap improve the results? minrun = 3 maxgap = 1
Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.936 0.806 0.009 0.464
Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.718 0.850 0.030 0.627
Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.500 0.856 0.059 0.700
Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.300 0.815 0.216 0.851