140 likes | 309 Views
Post-process of IMGAG M.t. 2.0 Release Affymetrix Medicago Probe set – IMGAG 2.0 / MTGI 8.0 Mapping. Zhao Bioinformatics Lab. IMGAG M.t. 2.0 Data downloaded from ftp://ftpmips.gsf.de/plants/medicago/MT_2_0/MT2.0_medicago_chrX_20080303_NoOverlap.xml.tar.gz. Summary
E N D
Post-process of IMGAG M.t. 2.0 Release Affymetrix Medicago Probe set – IMGAG 2.0 / MTGI 8.0 Mapping Zhao Bioinformatics Lab
IMGAG M.t. 2.0Data downloaded from ftp://ftpmips.gsf.de/plants/medicago/MT_2_0/MT2.0_medicago_chrX_20080303_NoOverlap.xml.tar.gz • Summary • 38,844 TU and 38,844 models. One to one • 38,759 gene name, so 82 model is redundant in gene name. • Of the 38,844 models, 85’s CDS region is not compatible with FASTA file • 4644 models with 5’-UTR + CDs; • 5846 models with CDS+3’-UTR • 11656 models with 5’-UTR + CDS + 3’-UTR. • 16698 models CDS only
Evidence Code • F (5036 genes) full coverage/FL-cDNA: The complete gene model from translation start to translation stop is covered by expressed Medicago sequence, e.g. FL-cDNA or EST alignments across the full length of the coding sequence. • E (14737 genes) expressed/EST matches: Expression of the gene is supported by Medicago EST sequence that matches the gene call (partially). • H (14209 genes) homology/heterologous: the gene call is supported by similarity to Medicago or other ESTs, protein, FL-cDNA, genomic or other sequences with partial or full-length alignments. • I (1375 genes) intrinsic/ab initio/inferred/hypothetical: the gene call is based only on intrinsic prediction tools such as FGENESH, Genscan or Eugene, and no significant alignments to other sequences are available. The length of the prediction is greater than 300 bp or there is a significant domain match in Interpro. • L (3830 genes) 'low quality' gene calls: gene calls not in F, E, nor H, with no significant Interpro domain match and a length less than 300 bp. i.e., unsupported intrinsic predictions of short length and thus statistically containing many false predictions.Total genes: 38334 NON-OVERLAPPED genes
Affymetrix Medicago Probe set – IMGAG gene Mapping Two approaches A. Blast-based approach (1) HSP length / Affymetrix probeset target length >= threshold1 (2) Matching identity length / Max_HSP length >= threshold2 B. Affy probe-set level matching (1) IMGAG gene sequences were matched to corresponding Affymetrix probe sets using a position-weighted scoring index in which mismatches near the middle of a probe were most heavily penalized as follows: (1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1). (2) A perfect match for a probe set yields a score of 45. Matches were declared when at least 8 of 11 probe sets had scores of 43 or higher.
Statistics on Approach A – scenario #1: less stringent Affy Probeset Target Blast against IMGAG cDNA Threshold 1=0.7; Threshold 2=0.7
Statistics on Approach A – scenario #2: Perfect matches • Affy Probeset Target Blast against IMGAG cDNA Threshold 1=1.0; Threshold 2=1.0
Statistics of our probe_set vs. EST mapping Overlapping mapping between our probe-set vs. EST mapping and the Affy original probe-se vs. EST mapping. 37872 ∩ 32108=32106. Our method covered 32106/32108=99.9993% of the Affy original mapping.
Statistics on Approach B IMGAG cDNA versus Probe_set
Probe sets map to IMGAG or ESTs 14.72 EST 41.82 (28.22) IMGAG 15.25
MTGI 8 vs.– IMGAG gene Mapping Mt2.0 cDNA BLASTN against MTGI8 (expectation 1e-04); Further applied blow filters: HSP length/Unigene length (a)Identity length/HSP length (b) Result:9333 (24.0%) cDNA are mapped to 9255 (25.1%) unigene (a>0.9 b>0.9);11517 (29.6) cDNA are mapped to 11383 (30.9%) unigene (a>0.8 b>0.8);13284 (34.2%) cDNA are mapped to 13092 (35.5%) unigene (a>0.7 b>0.7); 9959 (25.64.0%) cDNA are mapped to 10543 (28.59%) unigene (a>0.8 b>0.95);13063 (33.63%) cDNA are mapped to 14585 (39.55%) unigene (a>0.5 b>0.95); Total cDNA: 38844, Total unigene: 36878
MTGI 8 High Quality TC vs.– IMGAG gene Mapping I. Retrieved 9,396 High Quality TC based on IMGAG’s criteria BLAST TIGR’s High Quality TC vs. BAC: (1). >95% identity over 80% of the TC length = 64.3% (current 2,500 BACs) -> 73.2% projected for 2,800 BACs to be sequenced (2). >95% identity over 50% of the TC length = 68.6% (current 2,500 BACs) -> 77.0% projected for 2,800 BACs to be sequenced II. Our Mt2.0 cDNA BLASTN against 9396 MTGI8 High Quality TC (expectation 1e-04); Further applied blow filters: HSP length/Unigene length (a)Identity length/HSP length (b) Result:3550 (9.14%) cDNA are mapped to 3294(35.06%) unigene (a>0.8 b>0.95);5052 (13.0%) cDNA are mapped to 4613(49.10%) unigene (a>0.5 b>0.95); Total cDNA: 38844, Total High Quality TC: 9396
Thank You! • Suggestions / Comments