HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara ttsukaha@indiana.edu kobestory@hotmail.com

Milton Taylor Laboratory • Using microarrays and bioinformatics technologies to develop better treatments for HCV (Virahep-C project) • Only known treatment for HCV is treatment with interferon-alpha (IFN-a), or more recently combination treatment of pegylated IFN-a and Ribavirin • Interferons were discovered as proteins that inhibit virus replication, and are induced in mammalian cells in response to virus infection

PBMC Experiment • PBMC was isolated from group of healthy individuals, and treated with IFN-a alone, or with Ribavirin. • By microarray experiment results, expression of large number of genes were either up-regulated or down-regulated • It was of interest to analyze the upstream region of these genes for the presence of motifs (ISRE and GAS)

Goal of My Project • Build a computer model that effectively searches ISRE and GAS sequences in human genes • ISRE/GAS both work as a promoter • ISRE drives the expression of most of type I IFN stimulated genes (and some gamma) • GAS drives the expression of type II IFN stimulated genes • Genes that contain ISRE / GAS express more with IFN than ones that do not • Generalize to be able to search any motif in the future

Type I IFN Signal Transduction IFN/ HETERODIMER 2 p48 1 (IRF-9) JAK1 TYK2 P STAT2 ISGF3 STAT1 Transcription ISRE CYTOPLASM NUCLEUS

The Situation • We have a list of known motifs to refer to • Numerous ISRE and GAS are known and published • We have sets of sequences from microarray experiments that is • likely to contain motifs…S1 (up-regulated genes) • unlikely to contain motifs…S2 (down-regulated genes, and random genes) • To detect motifs, build a model M(+) using the list of known motifs • Occurrences of the model will be detected in both S1 and S2

How to Solve • Still, it is difficult to accurately predict motifs • Motifs are short in length, and also divergent • So, occurrences in S1 and S2 are difficult to distinguish • We overcome this problem by a discriminative model refinement approach • We make two models: M(+)…from known motifs M(-)…from false motifs • Iteratively refine the models, and separate the occurrences in S1 and S2

HISPIG

Methods Used • HMMER • Log-likelihood Method • Both with iterative model refinement approach

HMMER • Detects ISRE and GAS sequences (up-regulated genes, down-regulated genes and random genes) 1. Build a model with a list including known and functional motifs from journals by hmmbuild hmm consensus sequence 2. Parse promoter region of each gene 3. Look for occurrences of the consensus within the promoter region of the three gene groups by hmmsearch

Alignment File (.aln) • List of known motifs – as .aln file • Example of ISRE: IP10 AGGTTTCACTTTCCA ISG15 CAGTTTCGGTTTCCC Factor CAGTTTCTGTTTCCT Tla TAGTTTCACTTTTTG GBP TACTTTCAGTTTCAT ISG20 ATCTTTGACTTTGTC *** ***

Result for INDO gene (2 ISREs) Alignments of top-scoring domains: INDO: domain 1 of 2, from 4901 to 4915: E = 0.0097 *-> g g g a a a . t g a a a c t a<-* + g a a a + t g a a a c + a INDO 4901 TAGAAA a TGAAACCA 4915 INDO: domain 2 of 2, from 5370 to 5384: E = 0.18 *-> g g g a a a . t g a a a c t a <-* g ++ a a + g a a a c t a INDO 5370 TGAGAA a GGAAACTA 5384 negative strand

Iterative Model Refinement Model S1 : Sm n sequences were significant (may be functional) But that is too many to add 1. look for more occurrences Model S1 : Sm+n 2. rank the new sequences Let’s add only relevant k sequences 3. add top k sequences This is my new model for next iteration Model S1 : Sm+k

hmmsearch results (ISRE)

hmmsearch results (GAS)

Problems of hmmsearch • Number of significant motifs detected • ISRE >>> GAS (in terms of e-value) • Cannot tell whether the detected motifs are functional or not • E-value is the only measure • GAS overlap between different gene groups • 25% between up-regulated and random • As in previous slides, occurrences detected from the different gene groups are hard to distinguish

Log Likelihood Method • Calculate scores for each detected motif to tell whether functional, and to discriminate gene groups • Score = log (M(+) / M(-)) • M(+)… Known motifs, M(-)… False motifs • 1 pseudo count for each nucleotide per 10 sequences • If the log-likelihood score for the given motif is • positive… the motif is functional if also have significantly low e-value • negative… the motif is not functional

List of known & functional motifs 1. build model ISRE1 CAGTTT.. ISRE2 TAGTTT.. GAS1 TTTCAA.. Model(+) 2. search occurrences of M(+) in negative model List of false positive motifs 3. build model ISRE1 TACTTT.. ISRE2 AGGCTT.. GAS1 TATGAA.. Model(-) Concept of Models(+/-)

A G C T -3% -14% -12% -71% Base Composition Tweaking • All known functional ISRE has two “TTT”s • Without tweaking, a motif with a “TTT” and a “TCC” will receive high log-likelihood score • To solve this problem, we look for high percentage nucleotides, and make them dominant • Example: base composition of a certain column A G C T -0.1% -0.1% -0.1% -99.7% tweak!

Iteration and Model Refinement Model(+) S(+)1 : S(+)n Model(-) S(-)1 : S(-)n First iteration (model refinement) Model(+) S(+)1 : S(+)n Model(-) S(-)1 : S(-)n Second iteration (model refinement) Model(+) S(+)1 : S(+)n Model(-) S(-)1 : S(-)n

up-regulated genes AVG random genes AVG

Search Result of HISPIG • Numerous potentially functional ISRE and GAS were detected from 100 most up-regulated genes (both known and unknown) • Approximately 80% of the genes had either functional ISRE or GAS • Numerous genes contain unknown functional motifs that match with other gene expression experiments previously shown in journals • All motifs included in the model were concluded to be functional

Improvement of log-likelihood • Re-aligning process of model refinement • Rank sequences that match criteria by 1. e-value 2. log-likelihood score 3. both (not easy to implement algorithm) • Convincing if 2. works better than others • Which model to refine each iteration • Only positive? Only negative? Both?

Measuring the Reliability of the Program • Best Way – Do wet lab experiments to see if a detected unknown motif is really functional • Alternative 1. Remove some known and functional sequences from the initial model 2. See if the program still detects those in the end

Reliability Experiment (ISRE)

AcknowledgementsSun KimMilton TaylorStuart Young

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions