160 likes | 330 Views
Detection of Transcription Factor Binding Sites. Michael Morra CSE 4939W. Background. DNA is comprised of a combination of 4 chemical bases Adenine – A Thymine – T Guanine – G Cytosine - C. Background (Continued). Each individual organism has a unique DNA sequence
E N D
Detection of Transcription Factor Binding Sites Michael Morra CSE 4939W
Background • DNA is comprised of a combination of 4 chemical bases • Adenine – A • Thymine – T • Guanine – G • Cytosine - C Image from : http://www.genetest.org/page5.html
Background (Continued) • Each individual organism has a unique DNA sequence • The DNA sequence contains information which can be used by a cell to construct proteins • Each set of instructions within this sequence is called a gene Image from: http://www.buzzle.com/articles/point-mutations.html
Transcription Factors • To regulate the expression of genes, proteins known as transcription factors are used • Each transcription factor binds to the DNA sequence, turning a gene on or off Image from: http://www.cs.uiuc.edu/homes/sinhas/work.html
Binding Sites • The portions of the DNA where the transcription factors are able to bind are known as binding sites • A single transcription factor’s binding sites may vary
Introduction • The detection of binding sites is important to understanding the regulatory network of an organism • As binding sites can vary considerably, searching for them within a DNA sequence is tedious
Project • Implement a method used to accurately and precisely discover the locations of transcription factor binding sites within a DNA sequence.
Data • 4 species (Human, Mouse, Fruit Fly & Yeast) • Human • 26 Transcription Factors, 300 binding sites • Mouse • 12 Transcription Factors, 98 binding sites • Fruit Fly • 6 Transcription factors, 51 binding sites • Yeast • 8 Transcription Factors, 75 binding sites
Multiple Sequence Alignment • To be able to analyze the data effectively, each transcription factor’s binding sites need to be aligned • http://www.ebi.ac.uk/Tools/clustalw2/index.html >s1 GACTTTTCGCT >s2 CGATTTTCTCG >s3 GCATTTTCCCA >s4 AGAGAAAACCC >s5 GAATAACCCAAGAGAAA >s6 ACAGAAAAATC >s7 CGAGAAAATCG >s8 TGGTTTTCCCG >s9 GGGTTTCTCCC
Scoring • Berg and von Hippel method • l = length of the sequence to be scored • j = position in the sequence • nj = number of times a base occurs at position j in the alignment • tj = base at position j in the sequence to be scored • nj(0) = most common base at position j
Scoring Example • ACTCA • n1(0)= 3 • n2(0)= 2 • n3(0)= 2 • n4(0)= 2 • n5(0)= 2 • n1(A)= 3 • n2(C)= 1 • n3(T)= 2 • n4(C)= 1 • n5(A)= 2 • Score = log(1) + log(1.5/2.5) + log(1) + log(1.5/2.5) + log(1) = -0.443697499
Leave One Out Cross Validation • To determine the effectiveness of the algorithm, a cross validation technique is used • This technique involves leaving one binding site out when the multiple sequence alignment is performed, and then scoring that left out sequence • If the algorithm is effective, the left out sequence should score higher than the majority of other binding sites within that species. (>80-90%)
Implementation • C++ • Input • Multiple Sequence Alignment of a transcription factor’s binding sites • All binding sites of a species • Output • Scores • Results of Leave One Out Cross Validation
Desired Functionality • Deal with cases where the sequence to be scored is longer or shorter than the multiple sequence alignment • Slide the sequence over the alignment and take the highest scoring portion
Timeline • Oct 4th – Oct 18th • Create multiple sequence alignments for all transcription factors • Oct 18th – Nov 15th • Implement scoring algorithm in C++ • Nov 15th – Nov 29th • Implement leave one out methods • Nov 29th – Dec 6th • Tweaks and Improvements
Questions? Image from: http://www.ideacenter.org/contentmgr/showdetails.php/id/954