270 likes | 442 Views
TRANSCRIPTOME 2002 Seattle, WA. The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ. Peg Folta Lawrence Livermore National Laboratory 3/12/02. I.M.A.G.E. maintains world’s largest publicly available cDNA collection.
E N D
TRANSCRIPTOME 2002 Seattle, WA The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National Laboratory 3/12/02
I.M.A.G.E. maintains world’s largest publicly available cDNA collection 5,819,514 clones arrayed cumulative arrayed * I.M.A.G.E. clones account for 64% of human ESTs in GenBank
The I.M.A.G.E. collection has been shaped by projects (C-GAP, MGC…) Species Developmental state Clonesequence Tissue Library Method
Redesign of data management system Informatics focus this year was on tools to characterize and query the collection. • IMAGEne – mature clustering tool • IMAGEne Tissue – allows searching of tissue type dominance in clusters • IQ – Intelligent Query tool allows mining of I.M.A.G.E. data • Library/plate query – allows selective searching of libraries and plates • Problem report and query – allows users to report or query problems related to I.M.A.G.E. clones
Known Clusters I.M.A.G.E. Singletons IMAGEne-Human Process 279,262 Lower quality I.M.A.G.E Sequences 14,566 NCBI Ref Seq IMAGEne 2,289,020 Quality I.M.A.G.E sequences 623,294 Sequences 1,676,516 Sequences Remaining Sequences >50 basepairs of contiguous, non-repeat sequence 67,521 14,566 268,472 Candidate Clusters w/consensus
Known gene clusters with full length I.M.A.G.E. clones have doubled in number. Cluster coverage Avg. gene length 1578 3392 2763 3380 1896
Known Gene Cluster distribution of full length clones avg. length = 948
Candidate gene clusters consensus sequence and contigs are generated by CAP4 61,314 4,971 824 227 95 40
Singleton: Wheat within the chaff 305 full insert sequences are singletons. 62,143 singletons have a 3’ PolyA site. Avg. length is 547
IMAGEne Tissue query allows searching for tissue proportions within clusters.
Introducing the Intelligent Query - IQ • For a given category (currently clone and library) a user can specify a query based on key database attributes. • The user can specify the fields returned. • Various result format options (HTML, text) • Initial version was rolled out last summer • New functionality to be added this year (additional categories, etc.)
Next specify what clone centric results will be provided and in what format.
Similarly specify what library centric results will be provided.
Other tools to mine I.M.A.G.E. information Query for reported problems. Query plates from libraries.
Master vs. GenBank LLNL Replication QC on-going
Ongoing QC results Error in replication @ LLNL On-going Comparing master to GenBank
Next for I.M.A.G.E. Informatics • Extensive expansion of query tools and data access • IMAGEne non-species specific • Analysis of human cluster candidate genes and singletons • Redo of web site, easier to navigate MUCH influenced by public needs…..you have a say!
This work was partially funded by the NIH and was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48. Acknowledgements • LLNL • Christa Prange, I.M.A.G.E. PI • Tim Harsch, Amber Johnston, Julie Amundson • Sponsors • DOE, Marv Stodolsky • NIH, Bob Strausberg image.llnl.gov