320 likes | 446 Views
Bioinformatics: Glycomics A post-genomic paradigm. Ram Sasisekharan Biological Engineering Division Massachusetts Institute of Technology Cambridge, MA. Outline. Overview – “ omics ” Systems Biology Pre- and Post- Genomic bioinformatics Issues with Glycomics Addressing the Challenges
E N D
Bioinformatics: Glycomics A post-genomic paradigm Ram Sasisekharan Biological Engineering Division Massachusetts Institute of Technology Cambridge, MA
Outline • Overview – “omics” • Systems Biology • Pre- and Post- Genomic bioinformatics • Issues with Glycomics • Addressing the Challenges • New Research Models for Post Genomics Bioinformatics – the Glue Grants • Consoritum for Functional Glycomics • Conclusions
Proteomics RNA Genotype (DNA Sequence) Translation of Protein Sequence PHENOTYPE Genomics Glycomics An emerging paradigm Posttranslational Modification Central dogma in biology and the age of the “’omics”
Information Content in Biological Systems Sequence DNA: Nucleotides Protein: Amino Acids Carbohydrates:Monosaccharides Structure Secondary Tertiary Quaternary Interactions between molecules Biological Activity Molecular Cellular Tissue
Systems Biology Model Manipulate Molecular Genetics Bayesian Networks Chemical Genetics Boolean Networks Cell Engineering Mine Measure Biochemistry Bioinformatics Imaging Bioelectronics
What is Bioinformatics ? • Assimilation • Cataloging • Classification of biological information for • Model Creation • Prediction of behavior of a biological system for a given set of inputs Data Acquisition Web InterfaceData CurationKnowledge base Data Storage Database InfrastructureDatabase Design Data Dissemination Search EnginesSimple/Advanced Queries Model Creation Network of relationships between structural and functional attributes of biological macromolecules Tools forData Analysis Statistical Analysis ComparisonScoring functions Prediction System Behavior to set of Inputs
Landmarks in Genomics 1970s Advent of DNA sequencing 1980s DNA sequencing automated 1990s Era of Bioinformatics : Rapid computational manipulation, storage and dissemination of sequence information 1995 First whole genome sequenced 1999 First human chromosome sequenced 2001 Draft of human genome
Evolving framework for Bioinformatics Pre-genomics Bioinformatics • Representing sequence information - single alphabet code • DNA: {A,T,G,C} • Proteins: {A,C,D-I,K-N,P-T,V,W,Y} • Carbohydrates: not well defined • Storing Information – simple flat file databases • Sequence Databases – GenBank, SwissProt: Flat file databases without any annotation or structuring of gene and protein sequence information • Structure Databases – Protein Databank: Flat file database. Structural annotations like classification of structural superfamilies (SCOP) was created from PDB entries • Biological Activity – there was no real database that catalogued the important biological roles of biopolymers. Part of this information were stored as additional text fields in the sequence and structure databases Limited development of bioinformatics platforms for carbohydrates
Evolving framework for Bioinformatics Post-genomics Bioinformatics – Proteomics, Glycomics • Types of information • Data sets from high throughput experiments – Microarray, Mass spectroscopy and other analytical tools • Data sets from diverse experiments – mouse models to study the biological macromolecule in vivo, sensitive assays for studying interactions between proteins in a biological pathway • Types of Databases – Complex relational databases • Relational databases store different attributes obtained from high throughput experimental data and relationships between these attributes Increasing awareness of importance of carbohydrates in fundamental biological functions, yet little development on the bioinformatics applications to represent, store and manipulate information in carbohydrates
N-glycan diversity P P Golgi Types of Carbohydrates Branched Sugars: N-Glycans Asn-X-Thr/Ser OST Cytosol Nascent Polypeptide ER
Types of Carbohydrates Linear Sugars: Glycosaminoglycans Cell GAGs are the most acidic and information dense linear sugars
Representing Information in Carbohydrates Proteins and DNA – Backbone is mostly fixed, variations in building blocks is primarily due to variations in the side chain R groups R=4 R=20 In the case of carbohydrates, there are variations in the chemical configuration of the monosaccharide building blocks, linkage between monosaccharides and variations in the exocyclic substituitions (R groups) thereby making them highly information dense – both linear and branched sugar structures X: SO3- ; Y: Ac/SO3- - variation in the chemical configuration (I/G) and exocyclic sulfation pattern gives 32 building blocks – in comparison with 20 amino acids and 4 bases. High information density makes representation of information content in carbohydrates a challenging task – simple alphabetic codes don’t efficiently capture the information content
Carbohydrate – Protein Interaction: • Carbohydrate – Protein interactions are key in modulating cell-cell communication • Glycosylation on cell surface proteins act as recognition motifs for proteins on mutiple cell types including immune cells and pathogens • Due to multivalent interactions the binding between a single carbohydrate and lectin is weak and thus is hard to characterize Multivalent interactions between carbohydrates and proteins complicate the relationship between these interacting species
Glycosaminoglycan Paradigm • Cell surface proteoglycans comprise of long GAG polysaccharides that provide the cell with a “sugar coat” • GAGs interact with a multitude of signaling molecules in a sequence specific manner and modulate important biological processes • Different GAG sequences have differential affinities for a particular signaling molecules and this gradient in affinity plays a key role in “analog”regulation of biological function IL-8 TGF-b FGF INF- VEGF TNF Chemokine Enzymes Integrins Pathogens
2OST 6OST NDST 3OST Epimerase Characterization of Carbohydrate Structures • Biosynthesis is not template based and it involves several enzymes • There are multiple isoforms of these enzymes with different substrate specificities further increasing the diversity of structures • It is not possible at this time to amplify tissue derived carbohydrates due to their complex biosynthesis – low amounts of biological sample Complex Biosynthesis
Characterization of Carbohydrate Structures Challenges in Isolation and Purification • Due to the chemical heterogeneity, it is difficult to get pure amounts of homogeneous samples. • Often the sample analyzed is a mixture, therefore the sequence information in many cases cannot be fully determined – non deterministic system Partial information on carbohydrate structure due to limitations in their structural characterization poses significant challenges in storing and manipulating information content in carbohydrates
Advancing Glycomics – Key Issues • Representing Information in Carbohydrates is complicated – alphabetic codes are too cumbersome to handle information density • Dealing with analysis of low amounts of tissue derived material • As a result of the challenges in the structural analysis of carbohydrates, there is a need to develop tools to represent and manipulate partial/non-deterministic information on carbohydrate structures • “Analog” regulation of biological function by carbohydrates poses a challenge in providing functional attributes to specific carbohydrate structures
Addressing the Challenges Representing information in Carbohydrates – HSGAG as model system • Property encoded nomenclature (PEN) • Numerical scheme that optimally allocates bits to encode “properties” and the identity of the building block of biopolymers • Facilitates the use of mathematical operations to manipulate the information.
Features of PEN framework • Derived from an internal logic that is based on the chemical nature of the building block • Uses a numerical system and mathematical operations to perform manipulations • Can be easily extended to encode more variations either by using more bits or higher numerical base due to the flexibility of the number system • Facilitates comparison of “properties” directly since property encoded is a function of the chemical identity of the building block.
Dealing with low sample amounts –Sensitive MALDI-MS analysis • Matrix – Caffeic acid • Complex with Basic peptide – (RG)nR detected • Laser induced ionization leads to formation of molecular ions • Mass of saccharide is obtained to an accuracy of <1 Dalton, by subtracting mass of peptide from mass of complex • Accurately determine masses of picomolar amounts of sample typical of biologically important HSGAG oligosaccharides
Applications of PEN Mass Composition relationship The length, number of sulfates and acetates of a HLGAG oligomer can be unambiguously assigned for oligomers up to tetradecasaccharide
Applications of PEN: PEN-MALDI Sequencing Strategy Formalism: Hexadecimal binary notation based mass-line MADLI-MS Experimental composition,chemical enzymatic iterative All Possible Sequences Unique Solution
New Research Models for Post Genomics Bioinformatics – the Glue Grants • Alliance for Cellular Signaling (AfCS):To understand as completely as possible the relationships between sets of inputs and outputs in signaling cells that vary both temporally and spatially, i.e. how cells interpret signals in a context-dependent manner • Cell Migration Consortium:To accelerate progress in cell migration-related research by fostering interdisciplinary research activities and producing novel reagents and information • Consoritum for Functional Glycomics:Define the paradigms by which carbohydrate binding proteins function in cellular communication • Inflammation and Host Response Consortium: It is designed to acquire new scientific knowledge about the biological basis for different outcomes in injured patients.
Consoritum for Functional GlycomicsOrganization of the Core Facilites
Data Storage: Database Design • Overview • Classification of data • 6 key identifiers (name tags for data) – CBP ID, GT ID, Carb ID, Project ID, Microarray ID, Mouse Strain ID • Data Fields – provide structure to the type of data being entered. Selection of the appropriate data fields depends on what kinds of data will be entered • Linking data • Data fields pertaining to a specific attribute are stored in a table • Each table will be linked to other tables via common data fields or identifiers. • The data tables and their links form an “Ontology Diagram”
Data Storage: Database Ontology CBP-Carbohydrate Interaction Core C Core H PI Biochemical pathway Carbohydrate Ligand Binding Data CBP Cell line HIstology cDNA sequence Core F Mouse Strain Core C Immunology Yield Core G Core D PI Expression Mouse Studies PI
Data Storage: Relational Databases Author Name XYZ, … Email 1@2.3 Institution ABC … Protein CBP ID CBP001 GenBank GB0001 SwissProt SP00001 PDB 1XXX . . . characterized Characterization Carb ID Carb0001 Mass Spec MS-1.jpg NMR NMR.jpg Structure notation . . . was expressed using interacts with Protein Expression Cell Line BL21 Gel Image Img.jpg cDNA clone GB0002 . . . Carbohydrate Carb ID CBP001 Structure notation Carb DB Carb0001 PDB 1XYZ . . . characterized using characterized Author Name MNO, … Email 2@3.4 Institution IJK …
Conclusions • In the post – genomics era, high through put experimental methods are generating large data sets pertaining to multiple sequence, structure and functional attributes of genes and proteins – Transition from Traditional Biology Information driven “Systems Biology” • With constantly increasing computational power, there has been a big leap in development of bioinformatics tools to deal with large data sets • Increasing awareness of the role of carbohydrates in fundamental biological processes modulating cell-cell and cell-matrix interactions • Development of bioinformatics applications for carbohydrates has many challenges due to their complexity and heterogeneity • Addressing these challenges would enable the development of bioinformatics for glycobiology to provide a more comprehensive description of the “state” of a biological system and to better predict the “response” of a biological system to a given set of “inputs”