160 likes | 256 Views
More on iParameterEstimation. Bob Zimmermann 27 April 2005. iParameterEstimation Data Flow. Sequence. Parameter Template. ACTATTACGTATTAGGATCCGAATGAGGATTA…. Dispatcher. Feature Mapping. State Annotation. Backend. TTAA. AAGG. CCTT. GTATT. TGCA. GCTC. TCCA. Annotation. gHMM.
E N D
More on iParameterEstimation Bob Zimmermann 27 April 2005
iParameterEstimation Data Flow Sequence Parameter Template ACTATTACGTATTAGGATCCGAATGAGGATTA… Dispatcher Feature Mapping State Annotation Backend TTAA AAGG CCTT GTATT TGCA GCTC TCCA Annotation gHMM Model1 Model2 Model3 Model4
What Kind of Model? • Existing Model: Learn XML (not hard) • (SDT, LUT, WMM, WAM, CDS, ISO) • New Model: Learn OO Perl (not too hard) • Inherit from the iPE::Model class and count • AltSplice Model: Invent an algorithm to locate instances (kinda hard)
Overview on Adding Models • Update DTD • Update XML* • Update gHMM.pm • Add a file YourModel.pm to iPE/Models • Count and test!
DTD • XML files are (sometimes) linked to a DTD: Document Type Definition • Allows us to do some simple preprocess error checking • Is a human- and computer-readable definition of the expected pattern of the data • DTD:XML::Class Definition:Object
param_template … sequence_models … string_model “Stop” string_submodel “TAG” string_submodel “TGA” string_submodel “TAA” XML Hierarchy • XML is a hierarchical data description language • Example:
ELEMENTs, ATTLISTs • The generic hierarchy is described in DTD <!ELEMENT param_template (author, date)> <!ELEMENT author (#PCDATA)> <!ELEMENT date (#PCDATA)> • This might describe the following: <?xml version="1.0"?> <!DOCTYPE param_template SYSTEM "param_template.dtd"> <param_template> <author>Bob Zimmermann</author> <date>4/26/05</date> </param_template>
param_template … sequence_models … string_model “Stop” string_submodel “TAG” string_submodel “TGA” string_submodel “TAA” XML Hierarchy • XML is a hierarchical data description language • Example:
ELEMENTs, ATTLISTs, cont’d <!ELEMENT param_template (author, date, states, init_model, trans_model, state_durations, sequence_models, conservation_models)> … <!ELEMENT sequence_models (string_model|fixed_string_model)+> <!ELEMENT string_model (string_submodel|fixed_string_submodel)*> <!ELEMENT string_submodel (string_submodel|fixed_string_submodel)*> <!ELEMENT fixed_string_submodel (#PCDATA)> <!ELEMENT fixed_string_model (#PCDATA)> …
ELEMENTs, ATTLISTs, cont’d • Elements have zero or more attributes and data <!ATTLIST sequence_models > <!ATTLIST conservation_models > <!ATTLIST string_model name CDATA #REQUIRED source CDATA #REQUIRED states CDATA #REQUIRED focus CDATA #REQUIRED length CDATA #REQUIRED begin CDATA #REQUIRED end CDATA #REQUIRED data CDATA "" model (SDT|WAM|WMM|WWAM|LUT|CDS|MIX|ISO|SIG) #REQUIRED submodels CDATA #REQUIRED>
Bringing it all together <string_model name="Start" model="SDT" source="DNA" states="Einit0 Einit1 Einit2 Einit- Esngl" begin="-6" end="3" focus="6" length="12" submodels="2"> <string_submodel name="ATG" model="WMM" submodels="0" /> <fixed_string_submodel name="NNN" model="WMM"> . . . . </fixed_string_submodel> </string_model> <string_model name="Stop" model="SDT" source="DNA" states="Eterm Eterm0- Eterm1- Eterm2- Esngl" begin="L" end="L+12" focus="3" length="12" submodels="2"> <string_submodel name="TAA" model="WMM" submodels="0" /> <string_submodel name="TAG" model="WMM" submodels="0" /> <string_submodel name="TGA" model="WMM" submodels="0" /> <fixed_string_submodel name="NNN" model="WMM"> . . . . </fixed_string_submodel> </string_model>
OO PERL • Objects: HREFs ($object->{membr}) • Classes: Packages • Methods: subs ($object->method(arg);) • Inheritence: @ISA, use base (“”);
iPE Object Hierarchy, Revisited Estimator AnnotatedSequence Model Locus Duration Emission Transition Initial Explicit WMM Geometric LUT … …
Extending Model Base Class • Container for an array of scalar values, representing the parameters • Update iPE/gHMM.pm • Add a new .pm file to the Model Directory
What You Will Be Responsible For • Construction • Zeroing out (Pseudocounting) your parameters and null parameters • Counting Positives and Nulls • Apply a weighted count to every base you see • Normalizing, Calculating the log-odds • Outputting a Zoe header • Other output formats will have auto-generated header • Outputting your Parameters • Whatever state you are in, counted, normalized or logized, print params tab-separated and human readable
Future Topics • Feature-level Parallelization • Cluster Parallelization • EM (Baum-Welch) Estimator