1 / 55

Yang Li and Rainer Breitling Dagstuhl seminar, March 2007

Analyzing genome tiling microarrays for the detection of novel expressed genes. Preliminary version 23 Feb 2007. Yang Li and Rainer Breitling Dagstuhl seminar, March 2007. Groningen Bio informatics Centre. Outline. Introduction to tiling arrays Published research on exon finding

sanura
Download Presentation

Yang Li and Rainer Breitling Dagstuhl seminar, March 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing genome tiling microarrays for the detection of novel expressed genes Preliminary version 23 Feb 2007 Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Groningen Bioinformatics Centre

  2. Outline • Introduction to tiling arrays • Published research on exon finding • Our data set • Machine learning for exon finding • Results

  3. Background • Genomic tiling array • Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription. • Tiling • A sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.

  4. Two types of tiling array construction: • Oligonucleotide tiling array • 2) Tiling array constructed using PCR products Trend in Genetics 2005 v21 466

  5. Detection of transcription • Discovery of novel genes • Discovery of novel non-coding RNAs • 3) Alternative splicing study • Advantages: • The sensitivity of microarrays enables rare transcripts to be detected; • The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed. • The experimental design is not dependent on current genome annotations.

  6. Recent Research

  7. Recent Research • Surprising amounts of genomic ‘dark matter’ • More than 50% of animal genomes may be transcribed • Novel protein-coding genes • Novel non-coding genes (rRNA, tRNA, snoRNA, miRNA…) • Antisense transcripts • Alternative isoforms and gene ‘extensions’ • Leaky transcription • Technical noise/artifacts

  8. Exon-intron discriminators Kampa et al. Hodges–Lehman estimator (pseudo median )

  9. Exon-intron discriminators Schadt et al. PCA 1. Probes are separated into 15 kb sliding windows 2. Calculate robust principal component (between-sample correlation matrix) 3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS)) 4. Decide on exon vs. intron 5. Assign probes to transcriptional units

  10. Exon-intron discriminators • Our collaborators’ approach (Andrew Fraser and Tom Gingeras): • use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions • apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag • minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters

  11. About our tiling data • Affymetrix C. elegans Tiling 1.0R Array • Genome-wide gene expression: • ChrI~V, Chr X and Chr M (Mitochondrion) • Resolution: on average 25 bp • Negative bacterial controls • Samples: 21 samples across development (plus mutant) • Probes: 2,942,364PM/MM pairs

  12. sample number About tiling data * smg-1: deficient in nonsense mediated decay

  13. Examples LAP-1(ZK353.6) Genomic Position: III:8401845..8399119 bp Lap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting.

  14. Example Probe intensity intron extron

  15. Example

  16. Example 2 Probe intensity

  17. Example 2

  18. General impression Chr III 2866 genes

  19. General impression

  20. General impression

  21. PCA

  22. Methods: machine learning Aim Find the most effective (correct) machine learning method that distinguishes between True exons and True introns Find the simplest (fastest, intuitive) method that achieves this task

  23. Methods: machine learning Main challenge True exons and True introns are not known: Annotated exons may be unexpressed Annotated introns may be novel transcripts Our approach Ignore the problem and optimize supervised performance Assumption True novel transcripts will be similar to known ones

  24. Methods: machine learning • 1.Classification and regression tree (CART) • binary recursive partitioning • Advantages: • Easy to understand • Easy to implement • Computationally cheap

  25. Methods: Machine learning 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

  26. 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

  27. 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

  28. 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

  29. Maximum Margin denotes +1 denotes 0 The classifier with the maximum margin is the ideal one.

  30. Evaluation Receiver Operating Characteristic curve (ROC curve) ROC 1.00 1.14 0.90 0.85 0.80 0.93 0.72 0.50 True Positive Rate (sensitivity) 0.51 0.3 0.00 0.1 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate (1-specificity)

  31. The Area Under an ROC Curve (AUC)

  32. Raw Normalized Mean Median Max Max_1 pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2 Selection of informative features – intensities

  33. Selection of informative features – correlation Raw Normalized Pearson Spearman pm1,pm-1, mm1,mm-1

  34. Selection of informative features • Summary • Almost all reasonable features are informative • No striking difference between mean and median, but they seem better than max, max_1 • CC also informative. No striking difference between Pearson and Spearman • Quantile normalization doesn’t improve the result • Decision • Median, CC (Pearson) of non-normalized data are used to generate features • GC content or melting temperature can also be informative

  35. Selection of informative features – neighbors CART

  36. Selection of informative features – neighbors SVM CART

  37. Selection of informative features • Neighbours • MM • CC.PM • CC.MM • Tm • ANOVA results

  38. Results

  39. Example tree

  40. AUC ~ ( expression level )

  41. AUC ~ length( exon )

  42. AUC ~ Tm

  43. AUC ~ probe position within exon

  44. AUC ~ ( other factors ) expression exon length melting temperature relative position

  45. Can minrun and maxgap improve the results? maxgap = 1 minrun = 3

  46. Can minrun and maxgap improve the results? minrun = 3 maxgap = 1

  47. Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.936 0.806 0.009 0.464

  48. Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.718 0.850 0.030 0.627

  49. Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.500 0.856 0.059 0.700

  50. Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.300 0.815 0.216 0.851

More Related