590 likes | 709 Views
Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence. LO Leung Yau 7 th May, 2009. Outline. Biological Background Objective Current Approaches Various Models Problem: Insufficient Data Proposed Approach Predict TFBS from protein sequence
E N D
Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence LO Leung Yau 7th May, 2009
Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns
Biological Background – Cell • Basic unit of organisms • Prokaryotic • Eukaryotic • A bag of chemicals • Metabolism controlled by various enzymes • Correct working needs • Suitable amounts of various proteins Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)
Biological Background – Protein • Polymer of 20 types of Amino Acids • Folds into 3D structure • Shape determines the function • Many types • Transcription Factors • Enzymes • Structural Proteins • … Picture taken from http://en.wikipedia.org/wiki/Protein http://en.wikipedia.org/wiki/Amino_acid
Biological Background – DNA & RNA • DNA • Double stranded • Adenine, Cytosine, Guanine, Thymine • A-T, G-C • Those parts coding for proteins are called genes • RNA • Single stranded • Adenine, Cytosine, Guanine, Uracil Picture taken from http://en.wikipedia.org/wiki/Gene
gene Biological Background –DNA RNA Protein Picture taken from http://en.wikipedia.org/wiki/Gene
Biological Background –DNA RNA Protein Other functions Promoter regions Genes Transcription Factors Binding sites Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS).
Biological Background –DNA RNA Protein Other functions Promoter regions Genes Transcription Factors Binding sites Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS).
Importance of Inferring Transcriptional Regulatory Network • Revealing the working of a cell and life • Related to many diseases • Genetic disorders • Understanding them will help us • Understand the diseases • Design drugs to cure the diseases • Engineering genetics
To infer transcriptional regulatory network (gene network) from genetic and experimental data, utilizing different data sources as/when appropriate Objective
Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns
Current Approaches • Main Data Source • Gene Expression Microarray Data • Models • Parts Lists • Topology Models • Control Logic Models • Dynamic Models • Problem • Insufficient Data
Gene Expression Microarray Data • High throughput • Measures RNA level • Relies on A-T, G-C pairing • Can monitor expression of many genes Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment
Gene Expression Microarray Data Picture taken from http://en.wikipedia.org/wiki/DNA_microarray
Various Models of Transcriptional Regulatory Network (Gene Network) • Different level of details • Parts Lists • Topology Models • Control Logic Models • Dynamic Models • Boolean Network • Petri Nets • Difference and Differential Equations • Finite State Linear Model (FSLM) • Stochastic Networks [86, 87, 88]
Parts List • The basic components of the gene network that we model • Including • Genes • Transcription Factors • Promoters • Transcription Factor Binding Sites • … gene
Dynamic Models • Describe and simulate the dynamic changes in the state of the system • Predicting the network’s response to various environmental changes and stimuli. • Boolean Network • Petri Nets • Difference and Differential Equations • Hybrid: Finite State Linear Model (FSLM) • Stochastic Networks
Boolean Network [42, 93, 1, 55]
Boolean Network –Yeast Fission Example 10 Genes 1024 States [22]
Petri Nets - Example [79, 34, 67, 92]
Difference and Differential Equations • Continuous concentration of various molecules • For difference equation, time is discrete • For differential equation, time is continuous • In general, they have the form [15, 24, 96]
Difference and Differential Equations • Usually, the interactions are assumed to be linear • The model needs many parameters Interpretation: >>0 means gene n activates gene 1
Finite State Linear Model (FSLM) [91, 2, 66]
Stochastic Networks • In the real world, stochastic effects may play an important role • Some stochastic models have been proposed • Noisy Networks • Probabilistic Boolean Networks • Simulating a stochastic model is more computationally expensive • Depending on the purpose, stochastic models may not be necessary
Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns
Problem – Insufficient Data • In microarray data • Many genes • Small number of conditions/time points • Lead to unreliable estimated model [17, 53]
Current Directions to Solve Insufficiency Problem • Analysis Techniques for Small Sample Size • Regularization • Akaike Information Criterion (AIC) • Bayesian Information Criterion (BIC) • Minimum Description Length (MDL) • … • New model • Integrate Multiple Microarray Data • Heterogeneous sources • Different experiment settings [21, 77, 54, 62, 104, 72, 84] [60, 107, 48, 8, 38]
Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns
Other functions Promoter regions Genes Transcription Factors Binding sites Proposed Approach – Use Sequence There is a lot of information in genome sequence We should try to use them!
Proposed Approach – Core Components 1 The interaction between genes can therefore be inferred. Binding Sites? Binding Sites? Transcription Factor? Transcription Factor? 2 3 DNARNAProtein DNARNAProtein
Microarray Data Proposed Approach – Core Components Missed! Gene TF Gene TF Gene TF Gene Gene Our approach gives initial network! Can be used together with other approaches TF Gene Extra!
……………..LYDVAEYAGVSYQTVSRVV ……………. ……………..gaaggGGTCAAGGTGACCgg…………… Component 1: Protein Sequence Binding Sites • Need to predict • Binding domains of a protein • The DNA segment bound by the domain • The pattern bound by the protein • Need to search for occurrence of the pattern • Better motif model is helpful Protein DNA Picture taken from http://en.wikipedia.org/wiki/DNA-binding_domain
……………..LYDVAEYAGVSYQTVSRVV ……………. Component 2: Protein Sequence Transcription Factor ? • Need to distinguish between • Transcription factors, and • Other proteins • Characteristic motifs in binding domains are helpful features Transcription Factor Other Proteins
Component 3: DNA RNA Protein Sequence Trivial, only TU • DNA pre-mRNA • Pre-mRNA mRNA • mRNA Protein sequence Alternative splicing! Genetic code of amino acids is known and quite universal Picture taken from http://en.wikipedia.org/wiki/Alternative_splicing
Proposed Plan and Phases Started! Will start soon Preparatory Main Classifiers Initial Network Construction & Testing Stage
Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns
Short Term Subtasks • Q-gram Indexed Approximate String Matching Tool • Exploring Different Motif Models • Motifs with gaps • Develop an Improved Tool to Search Significant Patterns and Calculate p-value • Deterministic Finite Automata (DFA) • Finite Markov Chain Imbedding (FMCI) • Pattern Markov Chain (PMC) Already Done.
Filtered out regions, do not bother to do fully sensitive checking Target (Text/DB/…) sequence Q-gram Indexed Approximate String Matching Tool • IDEA: quickly discard parts of the target which CANNOT contain a match • A kind of pruning • Pruning is a successful strategy in many problems Pattern
Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns
Exploring Different Motif Model • Popular Motif Model • Position Weight Matrix (PWM) • Assumptions • Fixed-length contiguous • Independency of nucleotides • Easily handle wildcards • But difficult to handle gaps • Has been successful in some datasets • But perform poorly in Tompa(2005) dataset
Exploring Different Motif Model • Aim: • To explore if motifs with gaps fit the data • To explore different notions of “over-represented” • Approach: • de novo motif discovery on existing dataset • Assuming different models • Assuming different notions of “over-represented”
Exploring Different Motif Model • Models Tested
Scores s1 c1 s2 c2 s1+s2+s3+s4 4 s3 c3 c4 s4 X times Background Model P(> X times in background) P(TFBS | c1,c2,..,c4) P(TFBS)P(c1,c2,..,c4 | TFBS) P(c1,c2,…,c4) = Exploring Different Motif Model - Notions of “over-represented” • Count score: • P-value: • Estimated probability:
Preliminary Results – Max F-measure Recall = TP/(TP+FN) Precision = TP/(TP+FP) F-Measure = 2pr/(p+r)
Preliminary Results – Tompa Recall = TP/(TP+FN) Precision = TP/(TP+FP) F-Measure = 2pr/(p+r)
Preliminary Results – Tompa Recall = TP/(TP+FN) Precision = TP/(TP+FP) F-Measure = 2pr/(p+r)