The GeneCards TM Project at the Weizmann Institute of Science

The GeneCardsTMProject at the Weizmann Institute of Science

http://bioinformatics.weizmann.ac.il/cards/ • For each gene - a card with displayed data and links to entries in major databases • Genes with HUGO nomenclature symbols and others • Automatic data mining and integration • Advanced human-computer interaction

chromosome gene DNA sequence disease mutation medical applications protein research article RNA gene genetic chromosomal alias map location marker similar mouse gene

GeneCards: From Chaos to Order A card for each gene Aliases o DNA, RNA o Protein o Chromosomal location o Disorders o Medical applications o Related mouse gene o Research articles o Links to more data o Data is retrieved and integrated automatically

link to link to link to link to GeneCard: Integrated Data and Starting Point Mining and A Starting point Integration of Data for More Data Entries in Data Sources GeneCard of GeneCards link to Data Sources other other of GeneCards Data Sources Data Sources

HUGO nomenclature gene symbol Accession ID to other databases LocusLink or HUGO location If chromosome 21 A typical GeneCard: RUNX1

For chromosome 21 only Sequence accessions Information on proteins

Homologues Single nucleotide polymorphisms Disorders and mutations Medical news from Doctor’s guide Published literature

Start new search Snapshot of additional GeneCard fields Additional information

Improved Single Nucleotide Polymorphisms Summaries

Current GeneCards Data Sources and Links HUGO GDB OMIM SWISS-PROT LocusLink UDB UniGene MGD DOTS UCSC GenBank PubMed CroW 21 Doctor’s Guide HUGE euGenes Genatlas ATLAS HGMD TGDB BCGD MTDB RZPD MIPS PDB BLOCKS HORDE dbSNP ENSEMBL SBCELEGANS GeneLynx IMGT SOURCE

Gene sources 13,046 HUGO 360 LocusLink MGD 8,951 CroW 21 63

Simple search box search keywords results no results gene 1: name spell corrections - ... keyword ... - ... ... keyword . query modification outside resources gene 2: name - keyword ... How to search and find?

Some GeneCards Statistics 27,612 GeneCards(November, 2001) 13,548 HUGO approved genes 2,646,185 Accesses to GeneCards(at WIS since January 1, 1998) 25 Mirror sites around the world

The Affymetrix System

Sample preparation Hybridization Signal detection Data analysis Genechip Procedure Fluidic station Scanner Software

ChipCards - A Functional Integration Tool for DNA Array Data Tsviya Olender, Shirley Horn-Saban, Marilyn Safran, Vered Chalifa-Caspi, Michal Ronen and Doron Lancet The Crown Human Genome Center The Weizmann Institute of Center, Rehovot 76100

About ChipCards • ChipCards correlates DNA array data with comprehensive information from gene-specific databases. It is currently implemented for the Affymetrix GeneChip. • ChipCards’s output is an HTML table with essential additional information for each gene including: gene symbol, functional definition, accession number, protein information, chromosomal location and EST data. • Human data is integrated with GeneCards, UDB and Unigene. • Mouse data is integrated with information about the human orthologue via GeneCards, HomoloGene and MGD.

Example of GeneChip output before ChipCards processing

An Extract of Human Expression Data After ChipCards Processing NCBI link GeneCards link UDB link A snapshot of ChipCards’s result, with human Affymetrix expression data as input. Each probe set has a link to NCBI, GeneCards and UDB. Information about the cDNA sources of the gene is extracted from Unigene and is given as a separate column in the table. The same for UDB coordinates.

Murine Expression Data After ChipCards Processiong Human orthologes data Human’s Unigene link NCBI link GeneCards link NCBI link Murine’s Unigene link A snapshot of ChipCards output for Mouse Affymetrix expression data. Each probe set is linked to NCBI and Unigene. Information about the human orthologue is integrated into the table and includes links to NCBI, GeneCards and Unigene.

Current Research - Adding Cards for Genes that Don’t Yet Have a Name Assembly-based Unigene 1 resources cluster 2 3 Gene 4 sequence 5 tag GeneCard Unique for novel persistent gene gene identifier

Version 3.0 Project Goals Improving flexibility, allowing automated parameterized generation from partial sets of sources and/or genes, and appending to an existing database Providing an Application Programming Interface for users of the generation software to incorporate their own data Standardizing the format of the database to use XML

Project Goals (cont’d) Providing a foundation for supplying a stable identifier for each GeneCard, even when no known gene symbol exists Improving the maintainability, testability, and quality of the software Providing a seamless migration path from Version 2.xx while maintaining the current look and feel and functionality

Perl not originally designed as an OOP language Type safety, proper encapsulation and aggregation aren’t enforced Can be between 20 and 50 % slower Allows for more robust implementations Greater modularity More comprehensible interface to modules Better abstraction of software components Less namespace pollution Greater code reusability Software scalability Cleaner and more compact code Pros and Cons of Using OOP BUT

The 3.0 Hybrid Solution • Combines an object-oriented skeleton with some non object-oriented internals • The large data structure of gene-based data is implemented as a hash of hashes, avoiding numerous costly instantiations • All other major components, including the extractors and administration classes, are implemented as objects

GeneCards Architecture • Generation Software UniGene Extractor GeneCards Database SwissProt Extractor API Customized Extractor Support Functions Display Software

Generation Software Classes An underlying layer of support tools that manage extracting data from locally mirrored files and the internet, proxy connections, verification, security, file management, caching, conflict detection, error handling, statistics, and XML output formating A set of extractor classes, one for each source of information using source-specific algorithms and heuristics (adapted from pervious versions of GeneCards). Methods include new, prepare and search A template for building extractor classes. All such classes can create new or append to old entries, as well as generate data for all entries (genes) at once, or one at a time A main class that handles building sets of cards according to parameterized partial ordering rules

The XML-Based Database XML is a meta-language that supports customized tags for describing and providing semantic meaning to structured data Typed elements are arranged within other elements to form a nested hierarchy The data is grouped by source in the XML files, but can be retrieved by function: <GCresource>SWISSPROT<GCresource>OMIM <protein> <disorder>Colorectal Cancer <disorder>Germline Cancer </disorder> </disorder></GCresource> </protein> <GCresource>GENECLINICS <GCresource><disorder>Li-Fraumeni Syndrome </disorder> </GCResource> Each extractor module is responsible for its own Document Type Definition (DTD) specification to ensure that the XML is well formed and valid Files are stored in a hierarchical directory structure, one file per gene

The Display Software Currently in the design phase Want to maintain the current look and feel while providing the flexibility of easy customization Will use XML Perl parser modules in cgi scripts Search will be expanded beyond current text-based capabilities to include context-specific searches

3.0 Project Status and Open Issues Procedural programs/ad-hoc flat file format Object-oriented methodology/standardized XML Easy to add new extractors Flexible and extensibile Performance , Searchingstrategies

Original public databases Data mining Semantic Integration Source-specific information Megabase Integration Integrated chrmosomal maps Unified Database (UDB) Data mining and integration Thesaurus UDB

Sequence-Based Repositioning (SBR) Placing finished genomic sequences on UDB map. Map fine tuning in sequenced regions.

SBR (Sequence Based Repositioning) Elimination of overlaps between contigs Object repositioning UDB original map SBR map

Search Results - a Map Slice to GeneCard to Unigene to MarkerCard

A MarkerCard

GeneCards Success Stories • GeneCards as a bookmark for linkage analysis • Mutations that were polymorphisms and not disease-causing • Adult-onset diabetes without obesity in India • Work on Chromosome 21 at the Weizmann Institute • PVT – a heart disease found in Israeli Beduins • Parkinson’s disease paper

Frequently Asked Questions • What’s special about GeneCards? • Can I interface my own data? • Can I access my own in-house database mirrors instead of public internet sites?

GeneCards/UDB Team current: Avital Adato Vered Chalifa-Caspi Michal Lapidot Zvia Olender Naomi Rosen Marilyn Safran, head Orit Shmueli Irina Solomon Doron Lancet, PI alumni: Michael Rebhan Shai Shen-Orr Inga Peter Jaime Prilusky Michal Ronen Hershel Safer Julie Stampnitzky Liora Yaar

The GeneCards TM Project at the Weizmann Institute of Science

The GeneCards TM Project at the Weizmann Institute of Science

Presentation Transcript

Weizmann Institute of Science Summer 2010

Gilad Haran Chemical Physics Department Weizmann Institute of Science

Weizmann Institute of Science Israel

Weizmann Institute of Science Rehovot, Israel

Stacy Sulman American Committee for the Weizmann Institute of Science Davida Isaacson

STUDENTS PROBABILITY DAY Weizmann Institute of Science March 28, 2007

LibQUAL+ TM at the University of Lethbridge

A. Breskin , A. Lyashenko, R. Chechik Weizmann Institute of Science, Rehovot, Israel

David Harel The Weizmann Institute

Lecturer: Moni Naor Weizmann Institute of Science

The Weizmann Institute of Science, Rehovot, Israel

A. Cherlin Weizmann Institute of Science S. Yurevich Heidelberg University

eFolio Minnesota TM – A Project at the Crossroads

Department of Condensed Matter Physics Weizmann Institute of Science

Cryptography and Complexity at the Weizmann Institute

Lecturer: Moni Naor Weizmann Institute of Science

Lecturer: Moni Naor Weizmann Institute of Science

Roni Mualem and Bat-Sheva Eylon Department of Science Teaching The Weizmann Institute of Science

Alexander Milov Weizmann Institute, Israel

STUDENTS PROBABILITY DAY Weizmann Institute of Science March 28, 2007