60 likes | 191 Views
Annotation Parsing. Affymetrix File Format. Comma seperated file containing lots of data: UniGene, Ensembl, Entrez and SwissProt ID’s Genome Version, Chromosonal Location, Alignment info Gene Ontology Info Pathway Membership Protein Families and Domains Looks like:.
E N D
Affymetrix File Format • Comma seperated file containing lots of data: • UniGene, Ensembl, Entrez and SwissProt ID’s • Genome Version, Chromosonal Location, Alignment info • Gene Ontology Info • Pathway Membership • Protein Families and Domains • Looks like: "1000_at","Human Genome U95Av2 Array","Homo sapiens","Dec 18, 2005","Exemplar sequence","GenBank","X60188mRNA","X60188 /FEATURE=mRNA /DEFINITION=HSERK1 Human ERK1 mRNA for protein serine/threonine kinase","X60188","---","Hs.861","May 2004 (NCBI 35)","chr16:30032927-30042040 (-) // 93.03 // p11.2","mitogen-activated protein kinase 3","MAPK3","chr16p12-p11.2","full length","ENSG00000102882","5595","P27361 /// Q9BWJ1 /// Q7Z3H5 /// Q8NHX0 /// Q8NHX1","EC:2.7.1.-","601795","NP_002737.1","NM_002746","---","---","---","---","---","---","74 // regulation of progression through cell cycle // non-traceable author statement /// 6468 // protein amino acid phosphorylation // inferred from direct assay /// 6468 // protein amino acid phosphorylation // inferred from electronic annotation /// 7049 // cell cycle // inferred from electronic annotation","---","166 // nucleotide binding // inferred from electronic annotation /// 4674 // protein serine/threonine kinase activity // inferred from electronic annotation /// 4707 // MAP kinase activity // non-traceable author statement /// 4713 // protein-tyrosine kinase activity // inferred from electronic annotation /// 5515 // protein binding // inferred from physical interaction /// 5524 // ATP binding // non-traceable author statement /// 16740 // transferase activity // inferred from electronic annotation /// 4672 // protein kinase activity // inferred from electronic annotation /// 4707 // MAP kinase activity // inferred from electronic annotation /// 5524 // ATP binding // inferred from electronic annotation /// 16301 // kinase activity // inferred from electronic annotation","MAPK_Cascade // GenMAPP /// S1P_Signaling // GenMAPP /// TGF_Beta_Signaling_Pathway // GenMAPP","ec // A2S7_HUMAN // (Q96Q40) Serine/threonine-protein kinase ALS2CR7 (EC 2.7.1.37) (Amyotrophic lateral sclerosis 2 chromosomal region candidate gene protein 7) // 1.0E-77 /// ec // A2S7_HUMAN // (Q96Q40) Serine/threonine-protein kinase ALS2CR7 (EC 2.7.1.37) (Amyotrophic lateral sclerosis 2 chromosomal region candidate gene protein 7) // 2.0E-85 /// hanks // 3.1.1 // CMCG Group; CMGC I Cyclin-dependent (CDKs) and close relatives; CDC2Hs // 1.0E-85 /// hanks // 3.1.1 // CMCG Group; CMGC I Cyclin-dependent (CDKs) and close relatives; CDC2Hs // 1.0E-79","---","IPR000719 // Protein kinase","---","---","This probe set was annotated using the Matching Probes based pipeline to a Entrez Gene identifier using 3 transcripts. // false // Matching Probes // A","BC000205(15),BX537897(15),NM_002746(16)","NM_002746 // Homo sapiens mitogen-activated protein kinase 3 (MAPK3), mRNA. // refseq // 16 // --- /// CR603463 // full-length cDNA clone CS0DN005YA14 of Adult brain of Homo sapiens (human). // gb // 15 // --- /// ENSESTT00000097559 // --- // ensembl_est // 15 // --- /// ENST00000263025 // cdna:known-ccds chromosome:NCBI35:16:30032928:30042042:-1 gene:ENSG00000102882 CCDS10672.1 // ensembl_transcript // 15 // --- /// BC000205 // Homo sapiens, clone IMAGE:3350666, mRNA, partial cds. // gb // 15 // --- /// BX537897 // Homo sapiens mRNA; cDNA DKFZp686O0215 (from clone DKFZp686O0215). // gb // 15 // ---","ENSESTT00000097558 // ensembl_est // 4 // Cross Hyb Matching Probes /// AK096992 // gb // 1 // Cross Hyb Matching Probes"
WorkBench Model • Automatically identify chip type by specific marker presence • Parse and filter appropriate annotation file to produce a smaller version of annotations, called idx file. • Store all annotations in Map from marker ID to annotation line. • For future accesses, skip filtering step 2.
Issues • A lot of hardcoded values in parser. Chip names, annotation names, etc. (100, 147, 393) • Hardcoded list of included annotations. (393) • Chip type map fragile – dependent on specific markers being present. • Annotations stored in memory in an unparsed state. Forces annotation line to be parsed for every element access. (368, 511) • All included annotations stored in memory. (42) • Would benefit from a Singleton pattern, could then avoid file access in static constructor, methods wouldn’t be static, etc. • Includes GUI elements, causing difficulty with test cases and programmatic usage (108).
Proposed fixes • Determine and specify relationship between Microarray data objects and Annotation information. What will be the impact if annotations not available? • User requested annotation loading – separate step. • Allow for multiple annotation formats, support non-Affymetrix and custom. • Do not create custom index file. • Allow user specified filtering of annotations. • Explore open source disk based indexes and databases. For example, Berkley DB, hsqldb. • Proper MVC structure, AnnotationParser class simply for loading and parsing data, does not cause GUI events. (Although this can be said of many Data classes, see CSExprMicroarraySet.java:125).