10 likes | 96 Views
2 - SMILE and the searching strategy used
E N D
2 - SMILE and the searching strategy used To study structured motifs we used a software called SMILE (Structured Motifs Inference, Localization and Evaluation) which is based on an algorithm introduced in Marsan and Sagot (Marsan, L. and M. -F. Sagot, Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. J. of Comput. Biol. 7, 345–362.). It works with an index (suffix tree) of the set of sequences instead of working directly with the sequences. SMILE takes as input a set of unaligned biological sequences and a list of parameters. The parameters correspond to the properties that the patterns sought must satisfy. SMILE outputs all motifs in the input sequences that match such properties. The motifs SMILE can handle are complex as they may be composed of any specified number of parts, or sub-motifs. We call such sub-motifs the various boxes of the motif. An assumption is that the occurrences of the boxes of a motif must always appear in the same relative order in the sequences. Each one of the boxes composing the structured motif has its own user-defined characteristics. Other parameters describe characteristics of the whole motif. We are mainly interested in focusing on the HSE structured motif, whose structure is very particular, being composed by three boxes, the first and the last of which are identical, while the middle box is the reverse complement of the other two. This is strongly suggesting for a particular conformation of the DNA segment, that can be the responsible of the genetic regulative function. Indeed, SMILE allows to set parameters in such a way to focus the search on all the possible three-boxes motifs, arranged in a general pattern of the type: XYZ_Z*Y*X*_XYZ where X,Y,Z represent any DNA base, and X*, Y* and Z* represent their complements. We also tried to investigate possible spatial correlations between the patterns found. We also run searches for the GATA motif, in order to assess its statistical relevance. For the whole motif in our queries we have asked that the motif occurs in all the input sequences in an exact way, and that is composed of all the three boxes. For each box we have asked a length 3 and a distance with the next box that ranges from 2 to 14. All the motifs extracted according to the specified structural parameters are classified according to their statistical significance. SMILE offers two ways of performing such evaluation. We used the one that compares the number (and the distribution on the input sequences) of occurrences of the motifs found in the original sequence, with their occurrences in another set of related biological sequences that are not supposed to contain the motif and that are obtained by means of a random shuffling of the original sequences that maintains the distribution of fragments of length 3 (this number has been suitably chosen as it is the same as the length of the boxes). 1 - BIOLOGICAL BACKGROUND: Gene regulation and structured motifs Structured genetic motifs, functioning as regulatory elements, are short DNA sequences that determine the timing, location and level of gene expression. Although often only 5 to 20 bp in length, they are critical in understanding gene regulation. Experimental procedures for regulatory element discovery, such as electrophoretic shift assays or in vivo analysis such as DNA transformation with reporter genes, are long procedures that typically verify one element at a time. Therefore, computational methods have been developed to predict regulatory elements and their locations in a high-throughput manner. Tetrahymena thermophila and heat shock protein 70 genes Genes induced by stresses represent excellent models to identify new genetic elements involved in the control of gene expression. Our attention is mainly focused on genes of the heat-shock protein family. The expression of the heat shock genes is known to be regulated mainly at transcriptional level. The inducibility of the heat shock genes in response to various environmental stresses, depend on the activation of the heat shock factors (HSF). HSF bind to highly evolutionary conserved heat shock regulatory elements (HSE) that are composed by at least three adjacent and inverse repeats of the motif 5’nGAAn 3’.One inducible hsp70 gene was cloned from Tetrahymena thermophila and the promoter was characterized. It showed to contain several HSE motifs with canonical and non-canonical sequences and a new genetic element with repetitive GATA sequences, that resembles the element specific for GATA binding factors (Fig.1) . Electrophoretic mobility shift assays and mutational changes followed by in vivo analysis with a reporter gene revealed that the canonical HSE plays a determinant role in the induction of hsp70 gene transcription and that the repetitive GATA sequences are necessary for the hsp70 expression. By searching into the entire Tetrahymena genome recently completely sequenced, other genes of the same family (and also other stress genes) were identified. Their promotersequences represent the data we analized using SMILE. Fig. 1 ACA_TGT_ACA GTT_AAC_GTT ATC_GAT_ATC ATG_CAT_ATG GAA_TTC_GAA (HSE motif) hsp70 div1 ** hsp70* hsp70 div2 ** 3- RESULTS a) HSE-motif and other similar motifs The following table summarizes the results obtained investigating for three-boxes structured motifs searched into the hsp70 genes of Tetrahymena thermophila. Motif Score ACA_TGT_ACA 1.01 ATG_CAT_ATG 0.70 GTT_AAC_GTT 0.70 ATC_GAT_ATC 0.55 TGA_TCA_TGA 0.44 CTA_TAG_CTA 0.38 TAG_CTA_TAG 0.34 TTG_CAA_TTG 0.25 CAA_TTG_CAA 0.22 AGA_TCT_AGA 0.22 TCT_AGA_TCT 0.21 GAA_TTC_GAA 0.12 TTC_GAA_TTC 0.11 CTT_AAG_CTT 0.10 The score indicates the deviation from randomness; a score >0 indicates that the pattern is statistically significant. The yellow box is highlighting the HSE pattern, which is experimentally proved to be involved in gene regulation. No experimental evidences are available for the other motifs at the moment. Figure 2 shows a very schematic representation about the localization of the most significant motifs found, includingthe HSE motif. A preliminary correlation analysis has given no indication about possible cooperation of these motifs in gene regulation, but more work is necessary to address this problem. b) GATA-motif GATA motif results very frequent in the searched genes, and highly repeated along the upstream sequences. This causes a low (but significant) score, and it is very difficult to represent in a graph similar to that in the Fig. 2, because of its abundance. Correlation studies are in progress to investigate possible association of several GATA boxes in a single functional motif. hsp70 div3 ** hsp70 div4 ** nGAAnnTTCnnGAAn WGATAR Initation of translation 3’ 5’ ……. Initation of transcription HSE GATA ATG Diagramatic representation of the T. thermophila hsp70 promoter region including among others, the HSE and GATA regulatory motifs involved in the stress gene activation as shown by experimental analysis. The canonical sequences of each motif are reported above the corresponding box (n: any nucleotides; W: A/T; R:G/C) * Gene characterized by experimental analysis * * Genes identified by searching in the T. thermophila genome Fig. 2 Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*, Roberto Marangoni^, Nadia Pisanti^, Sabrina Barchetta*, Cristina Miceli* antonietta.laterza@unicam.it marangon@di.unipi.it pisanti@di.unipi.it sabrina.barchetta@unicam.it cristina.miceli@unicam.it * Dipartimento di Biologia M.C.A., Università di Camerino ^ Dipartimento di Informatica, Università di Pisa