90 likes | 155 Views
Biology is the science of reverse-engineering life. Living organisms are molecular machines capable of replicating themselves The functional unit of life is the “cell” 1-100 um in diameter contains a primary information store: the genome. The Structure of a Genome.
E N D
Biology is the science of reverse-engineering life • Living organisms are molecular machines capable of replicating themselves • The functional unit of life is the “cell” • 1-100 um in diameter • contains a primary information store: the genome
The Structure of a Genome • Generally one or several strands of a polymer, called DNA, packaged into “chromosomes” • Information is encoded as the order of the monomer sub-units (of types “A”, “C”, “G”, “T”) in the linear polymer • Each cell carries the entire genome of the organism.
The Nature of DNA • Linear, water-soluble, molecular data storage • The polymer is actually “double-stranded”-each strand the “reverse-complement” of the the other • The double strand is 2 nm in diameter • Each monomer unit, “base”, added lengthens the strand by 0.34 nm
DNA Storage Density • Genome length of an average bacterium is 2 megabases (Mb) • Human genome 3 gigabases (Gb) • Typical DNA “prep” solution contains about 25 petabytes/ml. (A ml is about 20 drops of a liquid.)
Perl and Genomics • Good: • Perl is quick to write • Excellent for parsing • DWIM is good for the typical biologist • Bad: • Not as fast running • Result: frequently middleware
My perl scripts • 167 in my /bin • Most are for either dealing with system stuff or parsing output from other programs • A few are meant to directly analyze “sequence data”
Example Sequence Analysis Program • ssr3.pl • “ssr” is “simple sequence repeat” aka “microsatellite” • E.g: >Echinomicrosat_01_B04_T7 XXXXXCAGAAGCGCTTCACAATTAAAAGCAAATCATACAAATATGATCAT CAGGCAGGCTATTTGAACACACTGTTTCGCACTGAACTCATAGTCACATT TCAGTCGTTCAGTGAGATGATTCATATGGCATAATTTGAACTGACGTTCG CTCTGACTATCGTTCAGCTCGTTGTGGGCACAATCGTTAGTCAGTTCGTT CACTCAACCACACACACACACACACACACGGAAACATCAGATTCGAGCTA AGCTCTTATTACAGCTGATCAGTAGGAGCACTGTTAGACAGTCTACTAAA TCAATATCAATTATCCCCCCCACACAACCATGGCTTCTGXXXXX
Example run of ssr3.pl %ssr3.pl Echinomicrosat_01_B04_T7.fasta Name Seq Len Range # of repetitions of sub unit Sub unit Echinomicrosat_01_B04_T7 344 209-228 10 of repeat "CA" ----------------------------- >Echinomicrosat_01_B04_T7 XXXXXCAGAAGCGCTTCACAATTAAAAGCAAATCATACAAATATGATCAT CAGGCAGGCTATTTGAACACACTGTTTCGCACTGAACTCATAGTCACATT TCAGTCGTTCAGTGAGATGATTCATATGGCATAATTTGAACTGACGTTCG CTCTGACTATCGTTCAGCTCGTTGTGGGCACAATCGTTAGTCAGTTCGTT CACTCAACCACACACACACACACACACACGGAAACATCAGATTCGAGCTA AGCTCTTATTACAGCTGATCAGTAGGAGCACTGTTAGACAGTCTACTAAA TCAATATCAATTATCCCCCCCACACAACCATGGCTTCTGXXXXX
ssr3.pl core routine: while ( $sequences{$x} =~ m/#Capture each ssr sub-unit within tolerance #Note "?" for lazy capture. Ensures "AC" is #the repeat unit instead of "ACAC" for example ([ACGT]{$min_repeat_unit_len,$max_repeat_unit_len}?) \1{$min_repeat_num,} /gix ) { my $repeat_unit = $1; my $start_of_ssr = $-[0]+1; my $end_of_ssr = $+[0]; my $ssr = $&; my $ssr_length = length($ssr)/length($repeat_unit);