170 likes | 266 Views
Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences. Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine
E N D
Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Advancing Practice , Innovation, and Instruction through Informatics October 20, 2008
The Genome Sequence • The human genome contains… 3 billion nucleotides 20 to 25 thousand genes Two-thirds of the genome made of repetitive elements (2 billion nucleotides) ATGGCACTGAGCTCCCAGATCTGGGCCGCTTGCCTCCTGCTCCTCCTCCTCCTCGCCAGCCTGACCAGTGGCTCTGTTTTCCCACAACAGGTGAGAGCCCAGTGGCCTGGGTCCTTAGCAGGGCAGCAGGGATGGGAGAGCCAGGCCTCAGCCTAGGGCACTGGAGACACCCGAGCACTGAGCAGAGCTCAGGACGTCTCAGGAGTACTGGCAGCTGAACAGGAACCAGGACAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTTGAGGCAGGCAGCCCACTTGAGGTCAGTTTGAGACCAGCCTGGCCAACATGGTAAAACCCCGTCTCTACTAAAAATACAAAAGTTAGCCAGGCTTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGACTGAGGCAGGAGAATTGCTTGAACCCGCAAGGTGGAGGTTGCACAGTGAGCTGAGATTGCACCACTGCACTCCAGCCTGGCAACAGAGCAAGACTCCATCTCCAAAAAAGAACAGAAATCAATGAAGCACCGAGTGACAGGGACTGGAAGGTCCTAATTCCATGGGTATTTACGGAACCCCTACGCCGTGTGGAGTCTTATTCTAGACAGTGGGGACGAGGCCATGAACAAGGTAGATGAGAGAGGAGATTTCTCCATCCTGGTCAGGGAATTTGTTAAAGACTGATGAAAACATGAATAAATAATTGTGTCTAGTACATTCTATTCGTGAATCTCATAACAGACAGTGGTAGAGTGACCGTGACCCATTCGCCACACAGTAGAGTCACTTTTTTGGTTTGTTTTTTAGAGACAGGGTCTTCCTCTGTTGCTGAGGCTGGAGTGCAGTGGTGCAGTCATAGTTCACTGCAGCCTCAACCTCCTGTGCTCAAGCAATCCTCCCACCTCAGCGTCCCAAGTAGCTGGGACAGCAGGCACATGCCACGGGTTGGGGGACCACAGGCATGGTCAAGGGGCTGGCAGTCAAGCAAGTG
Genomic Patterns Short Tandem Repeats (STRs) 1 to 6 nucleotides repeated in tandem Variable Number Tandem Repeats (VNTRs) Same as short tandem repeats Number of repeats variable across individuals CpG Islands A sequence of > 500 nucleotides C+G content of > 55% High frequency of CG dinucleotides …CGCGCCGGACGTTACGCGCGCCGCGAAACGCGCGCCGGACGGCGCCGCAAACGGCCGCGCGTAC…
300 bp >1,000 bp Genomic Patterns Palindromes A sequence that is like a normal palindrome (mom, racecar, …) One half is a complement of the other in reverse order. LINE-1 Elements Retrotransposon of >1,000 nucleotides High A+T content Poly A tail ALU Elements Retrotransposon of ~300 nucleotides with High G+C content Recognition site for alu endonuclease Segment high in A content A poly A tail
ALU/LINE-1 Expansions VNTRs Palindromes STRs CpG Islands Abnormal Methylation Alternative Structures Cancer Disease High Mutability Genomic Instability Disease Relevance
Challenges in Pattern Mining Computational tools for pattern mining must be… Scalable Genomes are large 3 billion nucleotides Genes are small 3 thousand nucleotides Genomes of different organisms vary greatly in size Flexible Types of patterns differ There are variations within a single type of pattern Flexibility in resolution of analysis Nonparametric New and unknown patterns Explorative analysis Currently, there are no tools that are scalable, flexible, and nonparametric for genomic pattern mining
Pattern Mining Toolkit Applications layer contains programs that utilize features computed by tools layer and also the preprocessed layer to compute specific commonly known patterns such short tandem repeats, DNA palindromes, short and long interspersed nuclear elements, etc.
Foundation Layer Efficient Preprocessing of Genome Sequence • Repetitive patterns appear next to each other • Allows for efficient computation of patterns Data Preprocessing: Suffix array computation Longest common prefix array computation
Tools Layer Find Ngram Counts Compare Ngram Counts Locate Specific Patterns TTAAAAAAAA-TTTTTTAAAA 10 251555 TAAAAAAC-GTTTTTAA 8 276649 CAAAAAAG-CTTTTTAG 8 312629 TCTCTACTAAAAAT-ATTTTTAAAAAAAA 14 364179 TGAAAAACA-TGTTTTAAA 9 449648
Tools Layer Large Repeats Find RegEx 23 17 29441 CAGATTTGAAACACTCTTTTTGT 24 93 4161 ATATCTTCGTATAAAAACAAGACA 25 123 292054 TTTTCAGAAACTGCTTTGTGATGTG 31 255 3983 GAAACGGGATTTCTTTATATTATGCTAGACA Find Perplexity
Explorative pattern analysis in chromosome 19 5 MB 250 KB
Explorative pattern analysis in chromosome 19 5 MB 250 KB 10 KB
Explorative pattern analysis in chromosome 19 5 MB 250 KB 10 KB 1 KB
Feature analysis of the centromere of the X chromosome Perplexity drops near the centromere region that is highly repetitive, containing ngrams that are unique to this region.
Pattern landscape of chromosome 19 Duplication events
Ackowledgements Madhavi GanapathirajuThahir Mohamed Kamiya Mopwani Thank you! Visit us at Department of Biomedical Informatics University of Pittsburgh www.dbmi.pitt.edu/madhavi Cathedral of Learning, University of Pittsburgh