400 likes | 608 Views
Comparative genomics approach in promoter analysis - Orthologous promoter databases and conserved motif search. Endre Barta. Conserved motifs in the non-coding regions of the genome. 3’UTR binding sites, miRNA target sequences. Intronic binding sites. Multiple Conserved Sequences (MCS).
E N D
Comparative genomics approach in promoter analysis - Orthologous promoter databases and conserved motif search Endre Barta
Conserved motifs in the non-coding regions of the genome 3’UTR binding sites, miRNA target sequences Intronic binding sites Multiple Conserved Sequences (MCS) • Objectives: • Finding orthologous promoter regions • Defining conserved motifs • Searching in conserved motifs • Analysing the data Transcription Factor Binding Sites in the promoter region
DoOP (Database of Orthologous Promoters http://doop.abc.hu) Reference species Choosing first exons Genes H. sapiens / A. thaliana Based on NCBI annotation Genomic sequences (from different DNA databanks; genome projects) 500, 1000 and 3000 bp upstream regions from orthologous first exons Aligning the first exons Orthologous first exons
Exon data are from the NCBI’s reference sequence annotations of human and Arabidopsis genome sequences gene complement(1279..4993) /locus_tag="At5g01010" /note="synonym: TOPTELOMERE.1; expressed protein" /db_xref="GeneID:831893" mRNA complement(join(1279..1646,1745..1780,1914..1961, 2435..2509,2748..2799,2872..2934,3303..3383,3602..3658, 3761..3801,3926..4004,4101..4257,4334..4466,4551..4678, 4764..4993)) /locus_tag="At5g01010" /product="expressed protein" /transcript_id="NM_120177.3" /db_xref="GI:42567550" /db_xref="GeneID:831893" CDS complement(join(1527..1646,1745..1780,1914..1961, 2435..2509,2748..2799,2872..2934,3303..3383,3602..3658, 3761..3801,3926..4004,4101..4257,4334..4466,4551..4678, 4764..4923))
Different types of annotated first exons • No annotated 5’ UTR, the length of the first exon is> 50 bp. • No annotated 5’ UTR, the length of the first exon is< 50 bp. • There is an annotated 5’ UTR, the CDS starts inthe first exon and it is> 50 bp. • Same as No. 3, but < 50 bp. • There is one or more 5’UTR exon(s), the first UTR exon is > 50 bp. • Same as No. 5, but < 50 bp. • Wrong annotation Frequency of H. sapiens gene types (5n and 6n collapsed) in the chordate section of the DoOP database, version 1.4
Using BLAST to find orthologous first exons • Very critical, and the most time consuming step • TBLASTN, WuBlast gives too many false positives we use BLASTN (gives more false negatives) • We use Bioperl modules to parse Blast results • The most difficult task is to find the real orthologs among paralogs • We use a simple algorithm; the best hit (considering the alignment length and the score) is most probably from the orthologous gene
Defining conserved motifs in the DoOP database Orthologous promoter group for example tGGGAGTGCG ATTTCGGGTC CACAGAGCTC tctgcgcggt gctggggcat -GGGAGCCCG AGGCCGAGGC CGCCGAGCTC G-cgtacggt ---------- ---------- ---------- CACTGGGATC A-acgtcaac ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- CGCTGAGATC A--------- ---------- Multiple sequence alignment conserved motifs Consensus sequence CRCNGaGMTC All consensus sequences from all groups Consensus database . . .
Creating the web interface • ENSEMBL / EPD / TAIR links • Annotating repetitive elements • Multiple alignment (DIALIGN) • Defining conserved motifs • Creating MySQL database • PHP / HTML programming Barta, E., Sebestyén, E., Pálfy, T.B., Tóth, G., Ortutay, C. P. and Patthy, L. (2005) DoOP: Databases of Orthologous Promoters, collections of clusters of orthologous upstream sequences from chordates and plants. Nucleic Acids Res. 33: D86-D90.
Are the conserved motifs indeed binding sites for transcription factors? • For good quality motifs we may safely assume that the answer is yes (or in other words they probably take part in transcription regulation) How to prove? • Experimentally (see the case study later) • Comparing conserved motifs with known TFBSs • Comparing ChiP-on-chip results (in a few years)
Only primates chordates Only mammals The number of motifs depends on the evolutionary distance between species in a promoter group
Conserved known TFBSs in the promoter region of Centromere protein A (CENP-A) gene motif m3: ggGTCAcgTGAc motif m4: cCcggcccGgaGc N-Myc SAP-1
Conserved TATA box in the promoter of COL1A2 (Collagen alpha 2(I))
Homo sapiensGGAGGGCG---------------------------GGAGGATGCGGAGGGCGGAGG---- Gallus gallusGCAGGGCG---------------------------AGGGGCGGGGAACGTCTGAAAAAAA Sus scrofaGAAGGCCG---------------------------GGGGGATGGGGAGGGCGGAGG---- Rattus norvegicusAGAGGGCG---------------------------GGTGGCTGGGGAGGGCGGAGG---- Danio rerioAACAGGA--------------------------------------------GGAG----- Takifugu rubripesAACAGGA--------------------------------------------GGAG----- Ornithorhynchus anatinusGGAGGct----------------------------------------------------- Dasypus novemcinctusAGAGGACG---------------------------GGTGGATGGGGAGGGCGGAGG---- Callithrix jacchusGGAGGGCG---------------------------GGAGGATAGGGAGGGCGGAGG---- Sorex araneusGGAG--------------------------------------GGGGAGGGCGGAGG---- Mus musculusAGAGGGCG---------------------------GGTGGCTGGGGAGGGCGGAGG---- Meleagris gallopavoGCAGGGCG---------------------------AGGGGCGGGGAACGTCTGAAAAAAA Taeniopygia guttataGGGGCGAG---------------------------AGGGGCGTGGGACGGCTGAGGGGAA Papio anubisGGAGGGCG---------------------------GGAGGATGGGGAGGGCGGAGG---- Bos taurusGAggtggggggagttggggggaggaaggccagagcGGGGGATGGGGAGGGCGGAGG---- Canis familiarisGAAGGGCG---------------------------GGGGGATGGGGAGGGCGGAGG---- Felis catusGAAGAGAG---------------------------GGGGGATGGGGAGGGCGGAGG---- Tetraodon nigroviridisAACAGGA--------------------------------------------GGAGG---- Pan troglodytesGGAGAGCG---------------------------GGAGGATGCGGAGGGCGGAGG---- „Box 3A, Sp1 binding site GGGCGG” Imagaki et al. 1994. JBC, 269, 14828-34
Searching between conserved motifs (MOFEXT program) Consensus database Query sequence Searching . . . Query sequence Search algorithm A window of given length (wordsize) ttRcGGWACCTgTaa comparing sequences, and calculating scores atGCTGAgRCGgAACCTGcGGAAC Next sequence
Searching between conserved motifs (MOFEXT program) Consensus database Query sequence Searching . . . Query sequence Search algorithm ttRcGGWACCTgTaa A window of given length (wordsize) comparing sequences, and calculating scores atGCTGAgRCGgAACCTGcGGAAC Next sequence
Searching between conserved motifs (MOFEXT program) Consensus database Query sequence . . . Query sequence Search algorithm ttRcGGWACCTgTaa A window of given length (wordsize) Extending the hit atGCTGAgRCGgAACCTGcGGAAC Next sequence Hit above the cutoff score extended hit
The MOFEXT (MOtif Find and EXTend) program Written in standard C, available upon request Usage: mofext -l mypatterns1.list mypatterns2.list -p query1 query2 -m matrix.txt -w 10 -s 95 Options: -h Display this, but you know it, because you see it :-) -l The databases. Space separated, maximum 50 -p Query patterns. Space separated, maximum 50 -m The similarity matrix filename -w Word size. Default: 6 -s The similarity percentage limit. Default: 80 -e If you add this, the output is not the similarity score but the percentage -a If you add this, instead of print the database sequence, print the matched region Output is a plain text file in table format The program is also suitable for searching protein sequences
The DoOPSearch website (http://doopsearch.abc.hu) • Conserved motifs database: from the current DoOP database • Web interface allowing to change the same parameters as in the command line version • The result is linked to the DoOP database • It is possible to sort and/or filter the result (score, ext. score, length, GO annotation!) • FUZZNUC search in the DoOP promoter sequences (MOFEXT uses only the conserved motifs to search, while FUZZNUC searches the whole promoters)
Utilization of DoOP data and the MOFEXT program • Studying the promoter region of different genes. Making guess about putative TFBSs • Finding conserved motifs in the promoter regions of other genes (possible co-regulation) Case study: SOX9 binding sites in the promoter region of matrilin-1 gene • Studying the evolution of TFBSs • Drawing regulation networks based on similar conserved motifs • Studying conservation in the core promoter region
PE1 element in the DoOP database (in silico data) Human CTTCTGCAAGCAAAGGAGCCCTTGTGGTCAG Chimp CTTCTGCAAGCAAAGGAGCCCTTGTGGTCAG Macaque CTTCTGCAGGCAAAGGAGCCCTTGTGGTCAG Dog CTTCTGCAGGCAAAGGGGCCCTTGTGGTCCG Cattle CTTCTGCAGGCAAAGGAGCCCTTGTGGTCAG Elephant CTTCTGCAGGCAATGGAGCCCTTGTGGTTAG Mouse CTTCTGCAGGCAAAGGGGCCCTTGTGGTCAG Rat CTTCTGCAGGCAAAGGGGCCCTTGTGGTCAG Chicken CTTCTCCGAGCAATGGAGCCATTGTGGAGGG Consensus CTTCTgCaRGCAAaGGRGCCcTTGTGGtcaG • Finding further orthologous sequences by hand • defining the consensus sequence Search in the DoOP CH-1.3 1000bp consensus motifs database MOFEXT program, wordsize: 8, cutoff: 95%
Similarity to the motif found in the promoter of the MyBP-H gene Matrilin-1 ARGCAAAGGRGCCATTG :.:::: :..::::::: MyBP-H AGGCAAGGSAGCCATTG
Matrilin-1 TCTgCaRGCAAaGGRGCCa : : ::.:::::::. ::: Collagen, type IV, alpha 2 TTTCCAGGCAAAGGGCCCA CTgCaRGCA Matrilin-1 :::::.::: CTGCAGGCA Brain link protein 2 Matrilin-1 CTgCaRGCAAaGGRGCC :: : .: ..:::.::: Collagen alpha 1(II) CTCCGAGGRRAGGGGCC Matrilin-1 consensus CTTCTgCaRGCAAaGGRGCCaTTGTGGtcaG Matrilin-1 aRGCAAaGGRGCC melanoma-associated chondroitin :.:::::: .::: sulfate proteoglycan 4 AGGCAAAGCAGCC Matrilin-1 aGGRGCCaTTGTGGTcaG ::..:::: ::. ::: : Cartilage intermediate AGRGGCCACTGKAGTCGG -layer protein Hits from other extracellular matrix specific genes
PE1 element in the promoter region of the chicken matrilin-1 gene (experimental data) Rentsendorj, O., Nagy, A., Sinko, I., Daraba, A., Barta, E. and Kiss I. (2005) „Highly conserved proximal promoter element harbouring paired Sox9-binding sites contributes to the tissue- and developmental stage-specific activity of the matrilin-1 gene.” Biochem J. 389:705-716
Searching for SOX9 binding sites in the DoOP promoter database • Using the WWCAAWG consensus with FUZZNUC against all human 1000 bp promoters (no mismatch, search in complement) 21991 hits (0 mismatch) 370531 hits (1 mismatch) • Using the [AT][AT]CAA[AT]GN(1,3)C[AT]TTG[AT][AT] paired consensus (Bridgewater et al 2003, NAR 31:1541-53) with FUZZNUC (max two mismatches) 21492 hits • Using the WWCAAWG consensus with MOFEXT program (conserved motifs from 1000 bp DoOP promoters, wordsize 7, cutoff 81% = 1 mismatch) 51865 hits • Using the WWCAAWGNNNNCWTTGWW paired consensus with MOFEXT program (conserved motifs from 1000 bp DoOP promoters, wordsize 16, cutoff 71%) 358 hits • Matrilin-1 motif m26 Score: 26 Hit GcTCTRGTTGCTTCTGCARGCAAAGGAGCCCTTGTGGTCaGAGGGGCCTCYgRAGCCY |||.| | |||. WWCAAWGNNNNCWTTGWW Query
Studying the evolution of TFBSs • Motifs conserved between fishes and mammals can be considered as ancient motifs • We need to refine the automatically generated motif consensus collections in order to find real common motifs
Paired mesoderm homeobox protein 2B (Paired-like homeobox 2B) (PHOX2B homeodomain protein)DoOP, chordate v1.3, 1000, 80400097 • choosing motif m16 • searching all chordate 1000 bp consensus sequences with the MOFEXT program using wordsize=10
PHOX2B AAGGGCAGGGA ATPase family, AAA domain containing 1 ::::::::::: Voltage-dependent L-type calcium channel alpha-1D 17 AAGGGCAGGGA FAM35A Digestive tract-specific calpain stb. PHOX2B GAGAATCGTCACCCAACTTTCATTATTTCCA :::::: : : : :::::::::::::: Unknown GAGAATTCTAATATATTTTTCATTATTTCCA PHOX2B GATTGAATTAAAGGGCAGGGAG ::::: : : ::::::::::: Protein FAM3D GATTGTTTAACAGGGCAGGGAG AAATTGGATCAGgGagAATCGTCACCCAACTTTCATTATTTcCAAgtaGtGTGATTGAATTAAAGGGCAgGgag PHOX2B CCCAACTTTCATTATTTCCAAG PERB11 family member :::::::::::::: ::: : in MHC class I region24 CCCAACTTTCATTAGCACCAGG PHOX2B TTTCATTATTTCCAAGTA ::::::: ::::::: :: RNA binding motif protein 18 TTTCATTCTTTCCAACTA Hits from MOFEXT search with the M16 motif of PHOX2B homeodomain protein gene promoter
Drawing regulation networks based on similar conserved motifs • Four model cartilage specific genes • Finding long (30-35) bp. conserved motifs using MEME • Searching in DoOP 1000 bp. motifs using MOFEXT program (ws:8 cutoff:81) AGC1: Aggrecan (Chondroitin sulfate proteoglycan ) HAPLN1: Link protein (Proteoglycan link protein MATN1: Matrilin-1 (Cartilage matrix protein MATN3: Matrilin-3 Link protein, MEME motif 1:
Conclusions and future plans • The DoOP database and the MOFEXT program is suitable for the analysis of transcription regulation • The Matrilin-1 promoter shows interesting features: a longer conserved block with paired binding sites (1-2 mismatch is possible in each site) • These methods may be suitable for finding regulatory networks based on conserved motifs and studying evolution of TFBSs • We plan to further improve these methods • To cluster conserved motifs • To study paralogue evolution in promoter regions (sub- or neofunctionalisation) • What about plants?
Bioinformatics group: Endre Sebestyén, Tamás Pálfy, Tibor Nagy, Gábor Tóth Students from the ELTE: Áron Szenes, János Molnar Collaborative partners: Ibolya Kiss (BRC, Szeged) and László Nagy (DTE, Debrecen) Swedish EMBnet node, UPPMAX computer facility