130 likes | 292 Views
Annotating States Poorly, The Hazards of Perl SVs, and Annotating States Well. or, “Oh, the things I know” Bob Zimmermann 2005 June 6. The Problem. In iParameterEstimation, we want to count overlapping transcripts with the appropriate amount of “confidence” we have in the state being there.
E N D
Annotating States Poorly, The Hazards of Perl SVs, and Annotating States Well or, “Oh, the things I know” Bob Zimmermann 2005 June 6
The Problem • In iParameterEstimation, we want to count overlapping transcripts with the appropriate amount of “confidence” we have in the state being there. Einit0 Intron0 Exon0 Einit0 Intron0 Exon0
The Solution • Create a “Layered Annotation” of states, and weight each count depending on the number of features we see at this position Einit0 Intron0 Exon0 Einit0 Intron0 Exon0
The Solution, Cont’d • Group these positions into regions of consistent layering Einit0 Intron0 Exon0 Einit0 Intron0 Exon0
Original Approach • Create a set of layers on the fly, as necessary. • Each layer is an array of positions indicating if a state is present. for each transcript { for each layer { check if layer is covered for the transcript region.* if not, our_layer = this_layer. } if our_layer is not defined, create our_layer = new layer put transcript on our_layer.** } • * is O(#transcripts*mean_trans_len*#layers) • ** is O(#transcripts*mean_trans_len)
Nope! • It ran for 20 minutes before running out of memory on my machine. • What I should have checked: /bio/pro/Homo_sapiens/assembly/hg17/nscan_mgc_1/predictions/1/chr1.eval: Transcript All Count 1790.00 Average Length 44306.74 • This makes O(#transcripts*mean_trans_len) > O(#transcripts2)
Why Did It Run Out of Memory, Though? • It was a ridiculous idea. • In C, we would use a set of pointers to the state sequences. • Pointer = 8 bytes, chr1 = 246 mb • 2 GB for each layer. • chr1 required 15 layers (ugh!) • The situation is much worse in Perl, though.
Perl’s SV • Perl variables are represented as a Scalar Value: • $a = 10; • This value can be converted to a string: • $a .= “ Ten”; • ($a eq “10 Ten”) == 1 • Perl retains the string value and integer value.
HOW? • Perl interpreter is a C program. • All SVs are represented with a two tier struct: $a = “hello world”; $a = 15602;
More Concretely, struct STRUCT_SV { /* struct sv { */ void* sv_any; /* pointer to something */ U32 sv_refcnt; /* how many references to us */ U32 sv_flags; /* what we are */ }; struct xpvnv { char * xpv_pv; /* pointer to malloced string */ STRLEN xpv_cur; /* length of xpv_pv as a C string */ STRLEN xpv_len; /* allocated size */ IV xiv_iv; /* integer value or pv offset */ NV xnv_nv; /* numeric value, if any */ };
What Can We Learn From This? • Size is 28 bytes, at least. • For a long sequence, something like • @arr = split(“”, $seq); is a very bad idea. • The good news: • Perl strings scale well in memory with length (about the same as C strings) • Allocation is sophisticated, and often smarter than most C code
#!/usr/bin/perl use Devel::Peek ‘Dump’; $a = 10; Dump $a; print $a."\n"; Dump $a; SV = IV(0x80ad60) at 0x809f90 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 10 10 SV = PVIV(0x801820) at 0x809f90 REFCNT = 1 FLAGS = (IOK,POK,pIOK,pPOK) IV = 10 PV = 0x101bd0 "10"\0 CUR = 2 LEN = 3 Another Cool Thing to Note:
The Happy Ending • Transcripts are represented by coordinate tuples (min,max) • Sort by min coordinate - O(Nlog N) • N is the total number of transcripts • Gather overlapping transcripts into a Layered Annotation O(n2). • n is the average number of overlapping transcripts (something slightly more than 1.3). • When a transcript doesn’t overlap, tie it up and repeat. • Takes <1 minute for chr1 and chr2 and uses little memory.