440 likes | 587 Views
The GeneCards TM Project at the Weizmann Institute of Science. http://bioinformatics.weizmann.ac.il/cards/. • For each gene - a card with displayed data. and links to entries in major databases. • Genes with HUGO nomenclature symbols. and others.
E N D
The GeneCardsTMProject at the Weizmann Institute of Science
http://bioinformatics.weizmann.ac.il/cards/ • For each gene - a card with displayed data and links to entries in major databases • Genes with HUGO nomenclature symbols and others • Automatic data mining and integration • Advanced human-computer interaction
chromosome gene DNA sequence disease mutation medical applications protein research article RNA gene genetic chromosomal alias map location marker similar mouse gene
GeneCards: From Chaos to Order A card for each gene Aliases o DNA, RNA o Protein o Chromosomal location o Disorders o Medical applications o Related mouse gene o Research articles o Links to more data o Data is retrieved and integrated automatically
link to link to link to link to GeneCard: Integrated Data and Starting Point Mining and A Starting point Integration of Data for More Data Entries in Data Sources GeneCard of GeneCards link to Data Sources other other of GeneCards Data Sources Data Sources
HUGO nomenclature gene symbol Accession ID to other databases LocusLink or HUGO location If chromosome 21 A typical GeneCard: RUNX1
For chromosome 21 only Sequence accessions Information on proteins
Homologues Single nucleotide polymorphisms Disorders and mutations Medical news from Doctor’s guide Published literature
Start new search Snapshot of additional GeneCard fields Additional information
Current GeneCards Data Sources and Links HUGO GDB OMIM SWISS-PROT LocusLink UDB UniGene MGD DOTS UCSC GenBank PubMed CroW 21 Doctor’s Guide HUGE euGenes Genatlas ATLAS HGMD TGDB BCGD MTDB RZPD MIPS PDB BLOCKS HORDE dbSNP ENSEMBL SBCELEGANS GeneLynx IMGT SOURCE
Gene sources 13,046 HUGO 360 LocusLink MGD 8,951 CroW 21 63
Simple search box search keywords results no results gene 1: name spell corrections - ... keyword ... - ... ... keyword . query modification outside resources gene 2: name - keyword ... How to search and find?
Some GeneCards Statistics 27,612 GeneCards(November, 2001) 13,548 HUGO approved genes 2,646,185 Accesses to GeneCards(at WIS since January 1, 1998) 25 Mirror sites around the world
Sample preparation Hybridization Signal detection Data analysis Genechip Procedure Fluidic station Scanner Software
ChipCards - A Functional Integration Tool for DNA Array Data Tsviya Olender, Shirley Horn-Saban, Marilyn Safran, Vered Chalifa-Caspi, Michal Ronen and Doron Lancet The Crown Human Genome Center The Weizmann Institute of Center, Rehovot 76100
About ChipCards • ChipCards correlates DNA array data with comprehensive information from gene-specific databases. It is currently implemented for the Affymetrix GeneChip. • ChipCards’s output is an HTML table with essential additional information for each gene including: gene symbol, functional definition, accession number, protein information, chromosomal location and EST data. • Human data is integrated with GeneCards, UDB and Unigene. • Mouse data is integrated with information about the human orthologue via GeneCards, HomoloGene and MGD.
An Extract of Human Expression Data After ChipCards Processing NCBI link GeneCards link UDB link A snapshot of ChipCards’s result, with human Affymetrix expression data as input. Each probe set has a link to NCBI, GeneCards and UDB. Information about the cDNA sources of the gene is extracted from Unigene and is given as a separate column in the table. The same for UDB coordinates.
Murine Expression Data After ChipCards Processiong Human orthologes data Human’s Unigene link NCBI link GeneCards link NCBI link Murine’s Unigene link A snapshot of ChipCards output for Mouse Affymetrix expression data. Each probe set is linked to NCBI and Unigene. Information about the human orthologue is integrated into the table and includes links to NCBI, GeneCards and Unigene.
Current Research - Adding Cards for Genes that Don’t Yet Have a Name Assembly-based Unigene 1 resources cluster 2 3 Gene 4 sequence 5 tag GeneCard Unique for novel persistent gene gene identifier
Version 3.0 Project Goals Improving flexibility, allowing automated parameterized generation from partial sets of sources and/or genes, and appending to an existing database Providing an Application Programming Interface for users of the generation software to incorporate their own data Standardizing the format of the database to use XML
Project Goals (cont’d) Providing a foundation for supplying a stable identifier for each GeneCard, even when no known gene symbol exists Improving the maintainability, testability, and quality of the software Providing a seamless migration path from Version 2.xx while maintaining the current look and feel and functionality
Perl not originally designed as an OOP language Type safety, proper encapsulation and aggregation aren’t enforced Can be between 20 and 50 % slower Allows for more robust implementations Greater modularity More comprehensible interface to modules Better abstraction of software components Less namespace pollution Greater code reusability Software scalability Cleaner and more compact code Pros and Cons of Using OOP BUT
The 3.0 Hybrid Solution • Combines an object-oriented skeleton with some non object-oriented internals • The large data structure of gene-based data is implemented as a hash of hashes, avoiding numerous costly instantiations • All other major components, including the extractors and administration classes, are implemented as objects
GeneCards Architecture • Generation Software UniGene Extractor GeneCards Database SwissProt Extractor API Customized Extractor Support Functions Display Software
Generation Software Classes An underlying layer of support tools that manage extracting data from locally mirrored files and the internet, proxy connections, verification, security, file management, caching, conflict detection, error handling, statistics, and XML output formating A set of extractor classes, one for each source of information using source-specific algorithms and heuristics (adapted from pervious versions of GeneCards). Methods include new, prepare and search A template for building extractor classes. All such classes can create new or append to old entries, as well as generate data for all entries (genes) at once, or one at a time A main class that handles building sets of cards according to parameterized partial ordering rules
The XML-Based Database XML is a meta-language that supports customized tags for describing and providing semantic meaning to structured data Typed elements are arranged within other elements to form a nested hierarchy The data is grouped by source in the XML files, but can be retrieved by function: <GCresource>SWISSPROT<GCresource>OMIM <protein> <disorder>Colorectal Cancer <disorder>Germline Cancer </disorder> </disorder></GCresource> </protein> <GCresource>GENECLINICS <GCresource><disorder>Li-Fraumeni Syndrome </disorder> </GCResource> Each extractor module is responsible for its own Document Type Definition (DTD) specification to ensure that the XML is well formed and valid Files are stored in a hierarchical directory structure, one file per gene
The Display Software Currently in the design phase Want to maintain the current look and feel while providing the flexibility of easy customization Will use XML Perl parser modules in cgi scripts Search will be expanded beyond current text-based capabilities to include context-specific searches
3.0 Project Status and Open Issues Procedural programs/ad-hoc flat file format Object-oriented methodology/standardized XML Easy to add new extractors Flexible and extensibile Performance , Searchingstrategies
Original public databases Data mining Semantic Integration Source-specific information Megabase Integration Integrated chrmosomal maps Unified Database (UDB) Data mining and integration Thesaurus UDB
Sequence-Based Repositioning (SBR) Placing finished genomic sequences on UDB map. Map fine tuning in sequenced regions.
SBR (Sequence Based Repositioning) Elimination of overlaps between contigs Object repositioning UDB original map SBR map
Search Results - a Map Slice to GeneCard to Unigene to MarkerCard
GeneCards Success Stories • GeneCards as a bookmark for linkage analysis • Mutations that were polymorphisms and not disease-causing • Adult-onset diabetes without obesity in India • Work on Chromosome 21 at the Weizmann Institute • PVT – a heart disease found in Israeli Beduins • Parkinson’s disease paper
Frequently Asked Questions • What’s special about GeneCards? • Can I interface my own data? • Can I access my own in-house database mirrors instead of public internet sites?
GeneCards/UDB Team current: Avital Adato Vered Chalifa-Caspi Michal Lapidot Zvia Olender Naomi Rosen Marilyn Safran, head Orit Shmueli Irina Solomon Doron Lancet, PI alumni: Michael Rebhan Shai Shen-Orr Inga Peter Jaime Prilusky Michal Ronen Hershel Safer Julie Stampnitzky Liora Yaar