1 / 13

Annotating States Poorly, The Hazards of Perl SVs, and Annotating States Well

Annotating States Poorly, The Hazards of Perl SVs, and Annotating States Well. or, “Oh, the things I know” Bob Zimmermann 2005 June 6. The Problem. In iParameterEstimation, we want to count overlapping transcripts with the appropriate amount of “confidence” we have in the state being there.

noel
Download Presentation

Annotating States Poorly, The Hazards of Perl SVs, and Annotating States Well

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotating States Poorly, The Hazards of Perl SVs, and Annotating States Well or, “Oh, the things I know” Bob Zimmermann 2005 June 6

  2. The Problem • In iParameterEstimation, we want to count overlapping transcripts with the appropriate amount of “confidence” we have in the state being there. Einit0 Intron0 Exon0 Einit0 Intron0 Exon0

  3. The Solution • Create a “Layered Annotation” of states, and weight each count depending on the number of features we see at this position Einit0 Intron0 Exon0 Einit0 Intron0 Exon0

  4. The Solution, Cont’d • Group these positions into regions of consistent layering Einit0 Intron0 Exon0 Einit0 Intron0 Exon0

  5. Original Approach • Create a set of layers on the fly, as necessary. • Each layer is an array of positions indicating if a state is present. for each transcript { for each layer { check if layer is covered for the transcript region.* if not, our_layer = this_layer. } if our_layer is not defined, create our_layer = new layer put transcript on our_layer.** } • * is O(#transcripts*mean_trans_len*#layers) • ** is O(#transcripts*mean_trans_len)

  6. Nope! • It ran for 20 minutes before running out of memory on my machine. • What I should have checked: /bio/pro/Homo_sapiens/assembly/hg17/nscan_mgc_1/predictions/1/chr1.eval: Transcript All Count 1790.00 Average Length 44306.74 • This makes O(#transcripts*mean_trans_len) > O(#transcripts2)

  7. Why Did It Run Out of Memory, Though? • It was a ridiculous idea. • In C, we would use a set of pointers to the state sequences. • Pointer = 8 bytes, chr1 = 246 mb • 2 GB for each layer. • chr1 required 15 layers (ugh!) • The situation is much worse in Perl, though.

  8. Perl’s SV • Perl variables are represented as a Scalar Value: • $a = 10; • This value can be converted to a string: • $a .= “ Ten”; • ($a eq “10 Ten”) == 1 • Perl retains the string value and integer value.

  9. HOW? • Perl interpreter is a C program. • All SVs are represented with a two tier struct: $a = “hello world”; $a = 15602;

  10. More Concretely, struct STRUCT_SV { /* struct sv { */ void* sv_any; /* pointer to something */ U32 sv_refcnt; /* how many references to us */ U32 sv_flags; /* what we are */ }; struct xpvnv { char * xpv_pv; /* pointer to malloced string */ STRLEN xpv_cur; /* length of xpv_pv as a C string */ STRLEN xpv_len; /* allocated size */ IV xiv_iv; /* integer value or pv offset */ NV xnv_nv; /* numeric value, if any */ };

  11. What Can We Learn From This? • Size is 28 bytes, at least. • For a long sequence, something like • @arr = split(“”, $seq); is a very bad idea. • The good news: • Perl strings scale well in memory with length (about the same as C strings) • Allocation is sophisticated, and often smarter than most C code

  12. #!/usr/bin/perl use Devel::Peek ‘Dump’; $a = 10; Dump $a; print $a."\n"; Dump $a; SV = IV(0x80ad60) at 0x809f90 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 10 10 SV = PVIV(0x801820) at 0x809f90 REFCNT = 1 FLAGS = (IOK,POK,pIOK,pPOK) IV = 10 PV = 0x101bd0 "10"\0 CUR = 2 LEN = 3 Another Cool Thing to Note:

  13. The Happy Ending • Transcripts are represented by coordinate tuples (min,max) • Sort by min coordinate - O(Nlog N) • N is the total number of transcripts • Gather overlapping transcripts into a Layered Annotation O(n2). • n is the average number of overlapping transcripts (something slightly more than 1.3). • When a transcript doesn’t overlap, tie it up and repeat. • Takes <1 minute for chr1 and chr2 and uses little memory.

More Related