280 likes | 744 Views
Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence. Henry Hyun-il Paik Bioinformatics, School of Informatics Indiana University. Outline. What CpG islands are The Known Relations between CpG islands and Genes
E N D
Sequence analysis of CpG islands reveals possiblefunctional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School of Informatics Indiana University
Outline • What CpG islands are • The Known Relations between CpG islands and Genes • Motivation and Goal • Data set • Procedures • Results • Discussion
What CpG islands are? • CpG dinucleotides are rare in mammal DNA • DNA Methylation only occurs at CpG sites • Methylated cytosines may be converted to thymine by deamination over evolution • CpG TpG • CpG islands are short stretches of DNA with higher frequency of the CG sequence • Usually they are not methylated
What CpG islands are? • Definition from Gardiner-Garden & Frommer • At least 200 bases long • G+C content: > 50% • observed CpG/expected CpG ratio: >= 0.6 • Definition from Takai & Jones • Longer than 500 bp • G+C content: > 55% • observed CpG/expected CpG ratio: >= 0.65 • With this definition, these CpGi’s are more likely to be associated with the 5’ regions of genes and exclude most Alu’s • There are about 29,000 such regions in the human genome
CpG islands & Genes • CpG islands located in the promoter regions of genes can play important roles in gene silencing • Housekeeping genes • Almost all housekeeping genes are associated with at least one CpG island • CpG islands are starting 5’ to the transcription start site and covering one or more exons and introns • Tissue specific genes • About 40 % tissue specific genes are associated with islands • The position of these islands is not strongly toward the transcription start site as in the housekeeping genes
CpG islands & Genes • Not all CpG islands are associated with genes • Ioshikhes & Zhang determined the features to discriminate the promoter-associated and non-associated CpG islands • There are methylation-prone and methylation-resistant CpG islands • Feltus et. al. found patterns to discriminate methylation-prone from methylation-resistant CpG islands
CpG islands & Genes 5’ end CpGi Gene Promoter CpG islands Gene Gene CpG islands in body Gene 3’ end CpG islands
Motivation and Objective • Our project was inspired by these ideas • Mechanical definition follows the definition as it is • At least 200 bases long • G+C content: > 50% • observed CpG/expected CpG ratio: >= 0.6 • We tried to find “Semantic meaning” of CpG islands : Co-relation between CpG islands & Gene Functions • Are there any significant CpGi patterns related to the gene functions?
Motivation and Objective CpGi 1 Gene 1 CpGi 2 Gene 2 • We assume that gene1 and gene2 have similar function • Then gene 1 sequence and gene 2 sequence are probably similar. • Our Goal is to find CpGi patterns when genes have similar function
Data Set • Reference: • Larsen F., Gundersen, G., Lopez L., Prydz H. • CpG island as Gene Markers in the Human Genome • Genomics 13:1095-1107 (1992) • Total number of entries: 1711 • Entries with no islands: 1212 • Entries with islands: 499 • Total number of islands: 928 • The Length of CpG islands • Average size of islands: 465 bp • Shortest detectable island: 200 bp • Largest island: 3340 bp
Procedures Fasta all-to-all Comparison Clustering Clustering By BAG MEME Motif (Pattern) Discovery & Search for each cluster MAST Database search with CpG islands patterns BLAST
Clustering • We use a clustering program, BAG by Sun Kim • We compare each CpG island to all CpG islands using fasta for the input of BAG • BAG makes clusters based on sequence similarity
Motif Discovery & Search • MEME discovers patterns for each cluster • To see the significance of a pattern, MAST searches all CpG islands with the pattern • We can see how significant the pattern is or how often the pattern occur according to E value • Profiles are made to represent each cluster
BLAST • The entire GenBank was searched with CpG island profile, not with Gene • We see how efficiently the profile can find the genes that have similar function • This verifies the validity of the profile
Results • There are 26 clusters in which members have similar gene function among total 115 clusters • These 26 clusters are divided into two categories depending on CpGi location • 18 clusters have CpGi’s in coding region • 8 clusters have CpGi’s in promoter region
Results • One example from CpGi in body • Cluster # 18 : Human heat-shock protein HSP70B' gene • Meme • Mast • profile sequence ATCATCGCCAACGACCAGGGCAACCGCACCACCCCCAGCTACGTGGCCTT • Blast
Results • One example from promoter CpGi • Cluster # 25 : Human gene for creatine kinase B • Meme • Mast • Profile sequence GAGGAGTCCTACGAAGTGTTCAAGGATCTCTTCGACCCCATCATTGAGGA • Blast
Discussion • The blast result implies that both CpG islands in promoter region and in CDS are good markers for gene sequences • Even though there are small numbers of promoter CpG islands, they represented their clusters significantly • Since many CpG islands tend to cover exons, they can be used to identify transcripts • Need more data to support this result and to make generic patterns
Acknowledgement • Dr. Sun Kim • Dr. Paul Ma • Arvind • Bioperl community