130 likes | 299 Views
Short introduction to perl & gff. Marcus Ronninger The Linnaeus Centre for Bioinformatics. Motivation. Bioinformatics yields lots of information The information have to be mined Build or modify text files Small changes can take long time with lots of data
E N D
Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics
Motivation • Bioinformatics yields lots of information • The information have to be mined • Build or modify text files • Small changes can take long time with lots of data • Example: Change every letter to lower case • With script programming this could be done in less than a second
perl • Practical extraction and report language • Scripts • Object oriented programming • Graphical web interface, CGI • Possibilities • BioPerl
Example Example of a very simple perl script, to_lower_case.pl #!/usr/bin/perl -w use strict; my $seqfile = $ARGV[0]; my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile"; open (OUTFILE, "> $outfile"); while(<SEQ>){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); } }
Useful tools for parsing files • Scalar $ • Array @ • Regular expression /.fasta/ • Split, @chars = split //, $word • Substitute s/old-regex/new-string/ • Upper and lower case: uc, lc • Escape characters: \n \t \s etc • sub
General feature format, gff • AKA “gene finding format” • A format for handling output from different feature finding programs • Processes can be decoupled but the result can still be put together • Makes it easy to include external algorithms
General feature format The construction of the format is very simple. The values are tab-delimited. SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 1. 2. 3. 4. 5. 6. 7. 8. 1. Sequence name 2. Source of the feature 3. Feature type 4. Start 5. End 6. Score - most feature finding programs have some kind of score for the found motif 7. Strand - can either be + or - 8. Frame - 0, 1, 2, .
Small example A small script that transforms known transcription factor binding sites into a .gff file #Gfap #Known TFBS (Besnard et al 1991) #count backwards form the TSS #start -14 AP-2: ccccaccccc -101 NF-1: tgggctgcggccca -116 Hgcs: ctgggctgcggc -117
Example Basically the same procedure as the perl example above $seqlength = 5000; $gff = “”; while (<LIT>){ if ($_ =~ /^#start/){ $rel_start = $'; } elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){ make_gff($_, $rel_start, "Literature"); } }
Example while (<LIT>){ if ($_ =~ /^#start/){ $rel_start = $'; } elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){ make_gff($_, $rel_start, "Literature"); } } sub make_gff{ my $start; my $stop; (my $seq, my $rs, my $type) = @_; my @feature = split(/\s+/, $seq); # now the array has the feature information if($type eq "Literature"){ $start = $seqlength + $rs + $feature[2]; $stop = $start + length($feature[1]) -1; $sign = '.'; $gff .= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$sign\n"; } etc.
Example Output: a file named lit.gff with the following contents AP-2: Literature AP-2: 4886 4895 undef . . NF-1: Literature NF-1: 4871 4884 undef . . Hgcs: Literature Hgcs: 4870 4881 undef . . This can now be loaded into programs thatsupport the gff format, e.g. Apollo
Apollo • Gff files is boring to view as they are • Use graphical software • Apollo, a sequence annotation editor • Great for viewing gff files together with the sequence
References • Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly • http://www.sanger.ac.uk/Software/formats/GFF/ • http://www.fruitfly.org/annot/apollo/.