Yang Li and Rainer Breitling Dagstuhl seminar, March 2007

Analyzing genome tiling microarrays for the detection of novel expressed genes Preliminary version 23 Feb 2007 Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Groningen Bioinformatics Centre

Outline • Introduction to tiling arrays • Published research on exon finding • Our data set • Machine learning for exon finding • Results

Background • Genomic tiling array • Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription. • Tiling • A sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.

Two types of tiling array construction: • Oligonucleotide tiling array • 2) Tiling array constructed using PCR products Trend in Genetics 2005 v21 466

Detection of transcription • Discovery of novel genes • Discovery of novel non-coding RNAs • 3) Alternative splicing study • Advantages: • The sensitivity of microarrays enables rare transcripts to be detected; • The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed. • The experimental design is not dependent on current genome annotations.

Recent Research

Recent Research • Surprising amounts of genomic ‘dark matter’ • More than 50% of animal genomes may be transcribed • Novel protein-coding genes • Novel non-coding genes (rRNA, tRNA, snoRNA, miRNA…) • Antisense transcripts • Alternative isoforms and gene ‘extensions’ • Leaky transcription • Technical noise/artifacts

Exon-intron discriminators Kampa et al. Hodges–Lehman estimator (pseudo median )

Exon-intron discriminators Schadt et al. PCA 1. Probes are separated into 15 kb sliding windows 2. Calculate robust principal component (between-sample correlation matrix) 3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS)) 4. Decide on exon vs. intron 5. Assign probes to transcriptional units

Exon-intron discriminators • Our collaborators’ approach (Andrew Fraser and Tom Gingeras): • use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions • apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag • minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters

About our tiling data • Affymetrix C. elegans Tiling 1.0R Array • Genome-wide gene expression: • ChrI~V, Chr X and Chr M (Mitochondrion) • Resolution: on average 25 bp • Negative bacterial controls • Samples: 21 samples across development (plus mutant) • Probes: 2,942,364PM/MM pairs

sample number About tiling data * smg-1: deficient in nonsense mediated decay

Examples LAP-1(ZK353.6) Genomic Position: III:8401845..8399119 bp Lap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting.

Example Probe intensity intron extron

Example

Example 2 Probe intensity

Example 2

General impression Chr III 2866 genes

General impression

PCA

Methods: machine learning Aim Find the most effective (correct) machine learning method that distinguishes between True exons and True introns Find the simplest (fastest, intuitive) method that achieves this task

Methods: machine learning Main challenge True exons and True introns are not known: Annotated exons may be unexpressed Annotated introns may be novel transcripts Our approach Ignore the problem and optimize supervised performance Assumption True novel transcripts will be similar to known ones

Methods: machine learning • 1.Classification and regression tree (CART) • binary recursive partitioning • Advantages: • Easy to understand • Easy to implement • Computationally cheap

Methods: Machine learning 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

Maximum Margin denotes +1 denotes 0 The classifier with the maximum margin is the ideal one.

Evaluation Receiver Operating Characteristic curve (ROC curve) ROC 1.00 1.14 0.90 0.85 0.80 0.93 0.72 0.50 True Positive Rate (sensitivity) 0.51 0.3 0.00 0.1 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate (1-specificity)

The Area Under an ROC Curve (AUC)

Raw Normalized Mean Median Max Max_1 pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2 Selection of informative features – intensities

Selection of informative features – correlation Raw Normalized Pearson Spearman pm1,pm-1, mm1,mm-1

Selection of informative features • Summary • Almost all reasonable features are informative • No striking difference between mean and median, but they seem better than max, max_1 • CC also informative. No striking difference between Pearson and Spearman • Quantile normalization doesn’t improve the result • Decision • Median, CC (Pearson) of non-normalized data are used to generate features • GC content or melting temperature can also be informative

Selection of informative features – neighbors CART

Selection of informative features – neighbors SVM CART

Selection of informative features • Neighbours • MM • CC.PM • CC.MM • Tm • ANOVA results

Results

Example tree

AUC ~ ( expression level )

AUC ~ length( exon )

AUC ~ Tm

AUC ~ probe position within exon

AUC ~ ( other factors ) expression exon length melting temperature relative position

Can minrun and maxgap improve the results? maxgap = 1 minrun = 3

Can minrun and maxgap improve the results? minrun = 3 maxgap = 1

Maxgap and minrun optimization Maxgap/minrun Minrun/maxgap thres ccr fpr tpr 0.936 0.806 0.009 0.464

Yang Li and Rainer Breitling Dagstuhl seminar, March 2007

Yang Li and Rainer Breitling Dagstuhl seminar, March 2007

Presentation Transcript

March 2007

March 2007

Pakistan Society of Actuaries Seminar 3 March 2007

Dean Owen, Faculty Seminar, 2 March, 2007

March 2007

Shouye YANG , Yanguang DOU, Chao LI

Li Yang 1 , Valentina Kratasyuk 2

Dagstuhl Seminar January 16-21, 2011 - Bidirectional Transformations

March 2007

Dagstuhl seminar 08142 The Brainstorming

Dagstuhl Seminar on Event Processing

Dagstuhl Seminar January 17, 2011 - Bidirectional Transformations

Dagstuhl Intro

Dagstuhl Seminar on Atomicity, 2006

March 2007

Davide M. Parrilli, ICRI Dagstuhl, 24 March 2009

Dagstuhl presentation

Dagstuhl Seminar 08111 on Ranked XML Querying

Davide M. Parrilli, ICRI Dagstuhl, 24 March 2009

Dagstuhl Seminar