370 likes | 488 Views
Pavel Morozov March 3. Legionella Functional Genomics Project. Modulation of host-cell gene expression. Adhesion, invasion. Inhibition of lysosome fusion. Evasion. Recruitment of ER. Replication. Legionella pneumophila.
E N D
Pavel Morozov March 3 Legionella Functional Genomics Project.
Modulation of host-cell gene expression Adhesion, invasion Inhibition of lysosome fusion Evasion Recruitment of ER Replication Legionella pneumophila • An intracellular pathogen that can invade and replicate inside human macrophages and causes potentially fatal human infection Legionaires' disease. • Transmitted through inhaling mist droplets containing the bacteria. • Has extraordinary ability to survive in many different ecological niches (axenic cultures, biofilms with other organisms and intracellular vacuoles of amoebae, ciliates and human cells). • In order to relpicate Legionella should be inside if protozoa (amobae, acanthamoeba) which are single-cell eukaryotes, or macrophages of human lungs or monocites.
Complete genome of LEgionella pneumophila (strain Phyladelphia 1). region 3: efflux genes in direct chain Legionella pneumophila (strain Phyladelphya 1) genome. The highlighted regions were noteworthy due to their possession of different than average G+C content and GC skew in addition to skewed strand preference of ORFs. These computationally determined regions turn out to contain gene clusters that belong to specific categories (e.g., ribosomal protein cluster), or those corresponding to points of genome rearrangements or acquired by horizontal transfer. Some examples are shown in more detail below. genes inreverse chain C+G content GC skew region 7: tra/trb region (F-plasmid)
Project goals • Study molecular mechanisms (genetics and regulation) of • Legionella ability to survive in different ecological niches. • Legionella infection. • Extended genome annotation of Legionella species (Phyladelphia, Paris, Lens strains). • Custom whole-genome microarrays. • Network reconstruction and modeling.
Microarray Design. History of Legionella Microarrays. September 2005 2,997 70-mer oligos Whole-genome array 3,005 genes in duplicates 640 reference controls October 2003 3,230 clones 90% of the genome June 2001 • 1344 clones in triplicate • 40% of the genome
Requirements for Microarray Probes. The goal was to design 70-mer probes covering all protein- and RNA- coding genes and control probes for testing background and array properties. Requirements common to all probes: should not contain short nucleotide stretches that are too abundant; should be free from secondary structure elements; should have approximately same melting temperature Requirements specific to probes specific to genes: 70-mers should be unique (occure once) in experimental system (Legionella, Human, E.coli); Requirements specific to array control probes: should not not exist in experimental system (Legionella, Human, E.coli)
Microarray probe design using unique oligonucleotides of particular length. 5’ CDS or genomic sequence 3’14-meroligonucleotides uniqueoligonuclleotides overrepresented 8-mers 70-mer microarray probe In simplified form probe selection can be described like selection of regions with maximum number of unique oligonucleotides (in this case of length 14 bp) and minimal number of overrepresented shorter oligonucleotides (in this case 8 bp). In actual study we have to use oligonucleotides of different length and also check for the probe melting temperature. Using unique oligonucleotide for designing probes automatically removes secondary structure issues.
ancestors descendants Chosing length of oligonucleotides DNA or RNA (genomic or mRNA sequence). n n+1 n+2 n+3 n+k
Distribution of ancestors and descendant of various length. For each position we can define the length L at which the nucleotide, starting at this position became unique. All oligonucleotides in this position longer than L will be also unique. Also there are two types of unique oligonucleotides: those who contain unique oligonucleotide of smaller length and those who do not, we name them ancestors and descendants. It is enough to keep information about first occurrence of oligonucleotide for each position in order to have complete information about distribution of unique oligonucleotides for particular sequence region. Distributions of ancestral and descendant unique oligonucleotides by it’s length. Solid line denote sum of two distributions, dotted line denote distribution of ancestral oligonucleotides and dashed line stands for descendats. A) Results of simulation for genomes of size 1mb. B) Real data for human chromosome X.
Design of probes using unique oligonucleotides positional information. Sequence region and ancestors for each position (-1 if not known) : a t g c a c t a g c t a g c t a g t c g … 12,14,-1,-1,15,10,10,11,10,14,-1,-1,13,15,12,-1,-1,-1,12,16… P1 P2 Pi For each potential probe Pi can be defined vector of number of unique oligonucleotides of various length (both ancestors and descendants):Vi={0,0,0,0,0,0,0,2,3,4,2,3,5,6,7}. A Golden Standard vector can be defined asG={0,0,0,0,0,0,n1,n1-1,n1-2,n1-3…}. An Euclidian distance is a relible choise of a measure for the estimation of distance between Vi and G: D(Vi,G)=√ ∑L (Vi(j)-G(j))² where L set of oligonucleotide length used. A probes with minimal distance to golden standard we choosed.
Finding unique oligonucleotides. Olig Space without Space with length coding coding 4 256 32 5 1,024 128 6 4,096 512 7 16,384 2,048 8 65,536 8,192 9 262,144 32,768 10 1,048,576 131,072 11 4,194,304 524,288 12 16,777,216 2,097,152 13 67,108,864 8,388,608 14 268,435,456 33,554,432 15 1,073,741,824 134,217,728 16 4,294,967,296 536,870,912 17 17,179,869,184 2,147,483,648 18 68,719,476,736 8,589,934,592 19 274,877,906,944 34,359,738,368 20 1,099,511,627,776 137,438,953,472 • Enumerating oligonucleotides • Binary arithmetic : 00 stands for A, 01 for T, 10 for C and 11 for G. Binary:01110001 Decimal:142 T G A T • Enumeration is complete, dense, and nonredundant. • Counting oligonucleotides • direct counting • Complete space of possible oligonucleotides grows as 4n. • Memory size of current computers allows to handle oligonucleotides up to 16 on PC, up to 18 on Sun Solaris. With algorithm enhancements we can go up to 24 (but no need).The best resolution for human genome provided by length 18 and most bacterial genomes 12-14. • computable on desktop- computable on workstation with big memory- computable on workstation with big memory with enhanced algorithm- hardly computable
0 1 0 0 1 1 0 0 • Symbol Length of first Overrepresented flag unique oligonucleotides • in this position Program realization and data formats. u_find.exeu_findm.exe for minimal oligonucleotidelength Marked for unique oligonucleotides fasta file List of fasta files (genomes etc.) u_find.exeu_findm.exefor all desired oligonucleotidelength u_code.exe Storing data in Rich FASTA format Results of the search for unique oligonucleotides are stored in “rich” Fasta format. Essentially it is linear record of positional information like regular Fasta file, but with coded additional information. u_design.exe Microarray probes
Design of control probes using non-existing oligonucleotides information. Goal: sequence which have no homology to any genome ( no blast hits over threshold) • Selecting nonexistent oligonucleotides • Overlapping and merging oligonucleotides • Choosing probes from merged sequences AATGCTAGCTA ATGCTAGCTAC CTAGCTACGGA AGCTACGGAAT AATGCTAGCTACGGAAT . . . . . . ATGCTAGCTACGGA Nonexisting oligonuclleotides Nonexisting sequence. Probe selection (temperature, secondary structure)
12 –mers; 39855 nonexistant out of 16777216 (0.24%); 640 probes selected
Properties of proposed probe design method. • Finding of unique and nonexistent oligonucleotides have linear computational time on the size of genomes used. • Once the unique and system is represented in “rich” fasta format, design of new probes became extremely fast and can be repeated as much as needed in order to create probes for new set of CDS or genomic region. • Probes, selected by using unique oligonucleotides automatically reduce the presence of hairpins on RNA secondary structure. • Method can be applied to experimental systems with multiple non-related genomes (genomes can be as far from each other as eu- and prokaryotes). • Method is efficient for control probe selection. • Problem: Method did not provide robust estimation of sequence homology between probe and the rest of genomes, at the same time selected probes have the lowest homology to the rest of genome possible. • Method provides valuable statstics about oligonucleotide usage in particular genomes and genome sets.
2,997 70-mer oligos, 3,005 genes in all (with duplicates) 640 reference controls
Legionella in Microbial Communities. • Biofilms are not just a bunch of microbes, they are a special environment, protected from harsh outside by a special polysaccharide layer, which is produced by other microbes in the community. • Microbial community in biofilms have shared metabolic and regulatory networks. • Biofilms provide excellent environment for horizontal gene transfer. • Since biofilms prevent antibiotics and other biocide from getting to the pathogens biofilms are significant reservoir of health-hazardous pathogens. • Legionella can survive in biofilms, but cannot form it by itself, only as part of the microbial community.
Similar applications and potential use of proposed method. • Evolutionary studies (Traces of ancient events?). Hsieh et.al., Minimal model for genome evolution and growth. Phys Rev Lett. 2003 Jan 10;90(1):018101. Jordan et.al., A universal trend of amino acid gain and loss in protein evolution.Nature. 2005 Feb 10;433(7026):633-8. Epub 2005 Jan 19. • Use in organism and sequence identification – metagenomics. Metagenomics: "the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.“ (Chen and Pachter, University of California, Berkeley) Bailey & Ulrich, Molecular profiling approaches for identifying novel biomarkers.Expert Opin Drug Saf. 2004 Mar;3(2):137-51. Review. Palmer et.al., Rapid quantitative profiling of complex microbial populations.Nucleic Acids Res. 2006 Jan 10;34(1):e5.
Clickable Interactive Interface ADAPTERS LAYER: Converting and performing requests, formatting output.UNIX web server, Perl scripts, JAVA, C. Browser HTML JAVA Local Databases Remote Databases Remote Methods Local Methods Memory Engine UpdateEngine SQL Engine request data transfer supervision Client Side Server Side
Solved technical problems: solved ongoing Setting up the server side Setting up mySQL server and services Tools for importing and parsing external databases scripts to process flat files (perl, mySQL): extracting related information fomatting into SQL database Formatting into static HTML scripts to pars remote databases (perl, java, mySQL): extracting related information fomatting into SQL database Formatting into static HTML update engine (under construction) WEB page development (HTML, JavaScript, CSS) Testing with Explorer, Fire Fox, Opera, Safari.
Sources of Information Proprietary data Publicly available data Results of computations Sequence/Genome NCBI EMBL TIGR Individual genomes Functional Domains Function and annotation PFAM PDB PRODOM PROSITE TRANSFAC SMART Pathways Categories Literature GO GeneNet MetaCyc MEDLINE
Current list of integrated databases Parsed for Legionella-related information, organized and stored locally: • NCBI • EMBL • UniProt • InterPro • PIRSF (PIR superfamily/family) • Pfam • PRINTS • PRODOM • PROSITE • HSSP • MedLine/PubMed • MetaCyc • NMPDR/FIG • KEGG
IntegratedTools WEB site scheme Precompiled Static interactive tables WEB server and scripts SQL database Static interactive gene descriptions Dynamic (by user requests) Interactive data retrieval into interactive tables Interactive genome map Interactive toolsBLAST, HMM, REMOTE_SEARCH (SMART, PROSITE etc.) Semi Dynamic Search History
Legionella Genome Browser. • Interactive. • You can: • Choose scale and region • Links to tables and annotation data • Choose annotation tracks to display and track parameters • Choose various color schemes • Add custom annotation tracks
Row operations: Select/Unselect, Show/Hiderows Columns (fields) operations: Show/hide column Sorting columns Interactive tables
Snapshots of the NMPDR annotation pages Region comparisons in other genomes by sequence homology: icmR icmP L.pn Phil1 Coxiella burnetii
Visualization of the gene expression in NMPDR system pathway reactions expression ratios Legionella gene info
Study gene expression: • during intracellular growth and under various environmental stresses • axenically- and protozoan-grown Legionella • in Legionella-containing biofilms 4. Develop models (gene networks and reporter genes) that describe relevant patterns of gene expression: (gene networks =expressed genes + their regulators)
560 assignments LegCyc:181 pathways 678 assignments 72% Expressed genes: Original Gene Function Assignments ~3000 genes BLAST ORF Finders KEGG Pathways MetaCyc GeneOntology Plus Missing Members • Use lower stringency search • BLAST expected genes to Legionella genome sequence • Search for probable motif combinations Confirm absence of these genes:
histidine biosynthesis LegCyc Legionella metabolic pathway overview(a portion)
1 2 3 4 5 6 7 8 Search for transcription factor binding sites + Predicted operons Clusters of co-expressed genes lvrA lvh TF site prediction (in silico). • Promoter manipulations • Co-expressed gene sets • Regulatory networks Experimental confirmation of the predicted promoters. Transcription start sites. Use of confirmed motifs to identify additional co-regulated genes.
Columbia Genome Center • Jing Ju lab, S. Kalachikov, S. Pompu • Gene expression microarrays • Clusters of coexpressed genes • Regulatory genes knockout results (expression) • Molecular biology methods • Gene expression microarrays • RT-PCR • Transcriptional factors • promotor verification • Microbiology Department • Prof. Shuman • Gene knockout • Phenotypic analysis • Computational Analysis • Morozov Pavel, Morozova Irina • operon structures • putative promotors and transcriptional regulation sites • detailed gene annotation • regulatory network reconstruction