BioPerl

BioPerl Group Presentation 28th April, 2003

BioPerl Group Members • Manish Anand • Jonathan Nowacki • Ravi Bhatt • Arvind Gopu

Introduction • Objective of BioPerl: • Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. • Background: • Started in 1995 • One of the oldest open source Bioinformatics Toolkit Project

Motivations: • Biological data is complex. Standard modules will provide consistent encapsulation that will avoid the independent development of similar but incompatible modules. • Functions created by other developers that operate on common BioPerl modules can be linked at will. • Existing functionality of well-tested modules can be reused and extended through inheritance. • Component-based design allows for complex yet maintainable programs.

Why Perl ? • Powerful text parser & manipulator. • Flexible syntax for either OO or non-OO design. • Rapid development of easily extendible, adaptable OO modules and schemas. • Component-oriented. • Easy Web CGI scripting.

So what is BioPerl? • Higher level of abstraction • Re-usable collection of Perl modules that facilitate bioinformatics application development: • Accessing databases with different formats • Sequence manipulation • Execution and Parsing of the results of molecular biology programs • Catch? BioPerl does not include programs like Blast, ClustalW, etc • Uses system calls to execute external programs

So what is BioPerl? (continued…) • 551 modules (incl. 82 interface modules) • 37 module groups • 79,582 lines of code (223,310 lines total) • 144 lines of code per module • For More info: BioPerl Module Listing

mean 144 Some statistics

Downloading modules • Modules can be obtained from: • www.CPAN.org (Perl Modules) • www.BioPerl.org (BioPerl Modules) • Downloading modules from CPAN • Interactive mode • perl -MCPAN -e shell • Batch mode • use CPAN; • clean, install, make, recompile, test

Installing Modules • Steps for installing modules • Uncompress the module • Gunzip file.tar.gz • Tar –xvf file.tar • perl Makefile.PL • make • make test • make install

Installation on Windows (ActiveState) • Using PPM shell to install BioPerl • Get the number of the BioPerl repository: • PPM>repository • Set the BioPerl repository, find BioPerl, install BioPerl: • PPM>repository set <BioPerl repository number> • PPM>search * • PPM>install <BioPerl package number> • Download BioPerl in archive form from • http://www.BioPerl.org/Core/Latest/index.shtml • Use winzip to uncompress and install

Directory Structure • BioPerl directory structure organization: • Bio/ BioPerl modules • models/ UML for BioPerl classes • t/ Perl built-in tests • t/data/ Data files used for the tests • scripts/ Reusable scripts that use BioPerl • scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. • doc/ "How To" files and the FAQ as XML

Platforms • HP-UX / Itanium ('ia64') 11.0 • HP-UX / PA-RISC 11.0 • MacOS • MacOS X • CygWin NT_5 v. 1.3.10 on Windows 2000 5.00 • Win32, WinNT i386 • IRIX64 6.5 SGI • Solaris 2.8 UltraSparc • OpenBSD 2.8 i386 • RedHat Linux 7.2 i686 • Linux i386

Sequence Object Creation Sequence Creation : $sequence = Bio::Seq->new( -seq => ‘AATGCAA’); $sequence = Bio::Seq->new( -file => ‘sequencefile.fasta’); $sequence = Bio::Seq->new( -file => ‘sequencefile’, -ffmt => ‘gcg’ ); Flat File Format Support : Raw, FASTA, GCG, GenBank, EMBL, PIR Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP

Sequence Types Different Sequence Objects: • Seq – Some annotations • RichSeq – Additional annotations • PrimarySeq – Bare minimum annotation ( id, accession number, alphabet) • LocatableSeq – Start, stop and gap information also • LargeSeq – Very long sequences • LiveSeq – Newly sequenced genomes

Accessing local database $seq = Bio::Seq->new( ‘-seq'=>'actgtggcgtcaact', ‘-desc'=>'Sample Bio::Seq object', '-display_id' => 'something', '-accession_number' => 'accnum', '- alphabet' => 'dna' );

Accessing remote database $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); $seq2 = $gb->get_Seq_by_acc('AF303112'); $seqio = $gb-> get_Stream_by_id(["J00522","AF303112","2981014"]);

Sequences – By description use Bio::Biblio; my $biblio = new Bio::Biblio; my $collection = $biblio->find(“cancer”); while ($collection->has_next) { print $collection->get_next; }

Sequence – Accession numbers # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();

Sequences Manipulations use Bio::Seq; my $seq = Bio::Seq->new( -seq => 'ATGGGGGTGGTGGTACCCT', -id => 'human_id', -accession_number => 'AL000012', ); print $seq->seq() ; # create seq object print $seq->revcom->seq() ; #Reverse-Comp print $seq->translate->seq() ; # Translate DNA to RNA

Manipulating sequences Jonathan Nowacki Information from: bioperl.org NCBI Pubmed Tigr.org

Manipulating sequence data with Seq methods • Allows the easy manipulation of Bioinformatics Data • Specific Parts of various annotated formats can be selected and rearranged. • Unwanted information can be voided out of reports • Important information can be highlighted, processed, stored in arrays for graphs/charts/etc with relative ease • Information can be added and subtracted in a flash

The Code #!/usr/local/bin/perl use Bio::Seq; use Bio::SeqIO; my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta'); my $seqobj = $seqin->next_seq(); my $seq = $seqobj->seq(),"\n"; #plain sequence print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n"; $seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines print $seq,"\n";

Before

After

Practical Applications (Phred Phrap) Phred is a base-calling program for DNA sequence traces. SeqIO can parse tracefiles in alf, ztr, abi, ctf, and ctr format (important note abi is in binary) Good for program compatibility, compression, automation, and many other reasons.

Practical Applications (Phred Phrap) An Abi trace file sample:

Practical Applications (Phred Phrap) Phrap is a leading program for DNA sequence assembly. Phrap is routinely used in some of the largest sequencing projects in the Human Genome Sequencing Project and in the biotech industry. Bio::SeqFeature objects allow for easy sorting and rearranging of both plain data and data contained in multiple files and output them in a format that is compatible with Phrap.

Practical Applications Phred Phrap… a simple example

Programs that can benefit • Phred/Phrap • Consed • Accelrys: GCG Wisconsin Package • Sam-T99 • Staden Package • Some Tigr and SWBIC programs • Any other program that requires information in a special format

Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord) • Molecular Weight • Monomer Counter • Codon Counter • DNA weights • RNA weights • Amino Weights • More

The Code #!/usr/local/bin/perl use Bio::PrimarySeq; use Bio::Tools::SeqStats; my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1'); $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $hash_ref = $seq_stats->count_monomers(); foreach $base (sort keys %$hash_ref) { print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n";}

The Results

More Code use SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj); $weight = $seq_stats->get_mol_wt(); -returns the molecular weight $monomer_ref = $seq_stats->count_monomers(); -counts the number of monomers $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence -counts the number of codons

Monomer

Why the Large and The Small MW? Note that since sequences may contain ambiguous monomers (eg "M"meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wtreturns a two-element array containing the greatest lower bound andleast upper bound of the molecule. (For a sequence with no ambiguousmonomers, the two elements of the returned array will be equal.)

The Greatest Bounds Just for formalities here is a definition of the greatest lower bound andleast upper bound of the molecule Let A be an ordered set and X a subset of A. An element b is called an upper bound for the set X if every element in X is less than or equal to b. If such an upper bound exists, the set X is called bounded above. Let A be an ordered set, and X a subset of A. An element b in A is called a least upper bound(or supremum) for X if b is an upper bound for X and there is no other upper bound b' for X that is less than b. We write b = sup(X). By its definition, if a least upper bound exists, it is unique. http://www.shu.edu/projects/reals/infinity/defs/lub_sup.html

Statistics Specifics For DNA (and RNA) sequences, single-stranded weights are returned. The molecular weights are calculated for neutral - ie not ionized - nucleic acids. The returned weight is the sum of the base-sugar-phosphate residues of the chain plus one weight of water to account for the additional OH on the phosphate of the 5' residue and the additional H on the sugar ring of the 3' residue.

Identifying restriction enzyme sites (Restriction Enzyme) • Bioperl's standard RestrictionEnzyme object comes with data for more than 150 different restriction enzymes. • To select all available enzymes with cutting patterns that are six bases long: • $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI'); @sixcutters = $re->available_list(6); • sites for that enzyme on a given nucleic acid sequence can be obtained using • $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); # $seqobj is the Seq object for the dna to be cut @fragments = $re1->cut_seq($seqobj);

Identifying restriction enzyme sites (Restriction Enzyme) (more) • Adding an enzyme not in the default list is easily as this: • $re2 = new Bio::Tools::RestrictionEnzyme('-NAME' =>'EcoRV--GAT^ATC', '-MAKE' =>'custom');

Identifying amino acid cleavage sites (Sigcleave) • Sigcleave is a program (originally part of the EGCG molecular biology package) to predict signal sequences, and to identify the cleavage site based on the von Heijne algorithm. • This help us find out whether the amino acid sequence contains a cleavable signal sequence for directing the transport of the protein within the cell.

Identifying amino acid cleavage sites (Sigcleave)(The Code) $seqobj = Bio::Seq->new(-seq => "AALLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV"); use Bio::Tools::Sigcleave; $sigcleave_object = new Bio::Tools::Sigcleave ( -seq => $seqobj, threshold => 3.5, -desc => 'test sigcleave protein seq', -type => 'AMINO'); %raw_results = $sigcleave_object->signals; $formatted_output = $sigcleave_object->pretty_print;

Identifying amino acid cleavage sites (Sigcleave)(more) threshold setting controls the score reporting • default 3.5 (see G. Heijne, Nucleic Acids Res 1986 Jun 11;14(11):4683-90) • signals will return a perl hash containing the sigcleave scores • keyed by amino acid position • pretty_print returns a formatted string of results • input is raw sequence (or file containing a sequence) • not a sequence object such as PrimarySeq, • “type” in the Sigcleave object is “amino” • whereas in a Seq object it is “protein”

Miscellaneous sequence utilities: OddCodes, SeqPattern For some purposes it's useful to have a listing of an amino acid sequence showing where the hydrophobic amino acids are located or where the positively charged ones are. Bioperl provides this capability via the module Bio::Tools::OddCodes. To quickly see where the charged amino acids are located along the sequence we perform: use Bio::Tools::OddCodes; $oddcode_obj = Bio::Tools::OddCodes->new($amino_obj); $output = $oddcode_obj->charge( );

Practical Applications Specificity and stability of four-chain coiled coils Charged Amino Acids Conserved in the Aromatic Acid/H+ Symporter Family of Permeases Are Required for 4-Hydroxybenzoate Transport by PcaK from Pseudomonas putida Rings of negatively charged amino acids determine the acetylcholine receptor channel conductance. Play role of positively charged amino acids in ATP recognition by human P2X(1) receptors. And much much more Information from NCBI and PubMed articles

More OddCode The SeqPattern object is used to manipulate sequences that include perl ``regular expressions''. Use Bio::Tools::SeqPattern; $pattern = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)'; $pattern_obj = new Bio::Tools::SeqPattern('-SEQ' => $pattern, '-TYPE' => 'dna'); $pattern_obj2 = $pattern_obj->revcom(); $pattern_obj->revcom(1); # returns expanded rev complement pattern.

Converting coordinate systems (Coordinate::Pair, RelSegment) • Coordinate systems are important for figuring out the relative and absolute positions of a sequence/gene on a Contig or chromosome. • Due to the finite size of reads and contigs, and the reading of negative strands mapping coordinate systems can become complex. • Bioperl supplies two methodologies • The Coordinate::Pair approach (low level) • The Bio::DB::GFF::RelSegment approach

Coordinate::Pair approach First you define an input coordinate system and an output coordinate system Both of which contains a start position, end position and strand. The end is paramount to dealing with unfinished assemblies where the coordinate system ends when one reaches the end of the sequence of a clone or Contig. Once one has defined the two coordinate systems, one defines a ``Coordinate::Pair'' to map between them. Then you use the following code:

Coordinate::Pair approach $input_coordinates = Bio::Location::Simple->new (-seq_id => 'propeptide', -start => 1000, -end => 2000, -strand=>1 ); $output_coordinates = Bio::Location::Simple->new (-seq_id => 'peptide', -start => 1100, -end => 2100, -strand=>1 ); $pair = Bio::Coordinate::Pair->new (-in => $input_coordinates, -out => $output_coordinates ); $pos = Bio::Location::Simple->new (-start => 500, -end => 500 ); $res = $pair->map($pos); $converted_pos = $res->gap->start;

Bio::DB::GFF::RelSegment approach Bio::DB::GFF::RelSegment approach is designed more for handling coordinate transformations of sequence features rather than for transforming arbitrary coordinate systems. With Bio::DB::GFF::RelSegment you define a coordinate system relative to a specific feature (called the ``refseq''). You also have access to the ``absolute'' coordinate system (typically of the entire chromosome.) You can determine the position of a feature relative to some other feature simply by redefining the relevant reference feature (ie the ``refseq'') with code like this:

BioPerl

BioPerl

Presentation Transcript

Bioperl modules

BioPerl

BioPerl

Beginning BioPerl for Biologists

Bioinformatica BioPerl

BioPerl - documentation

BioPerl – An Overview

BioPerl

Bibliographic Query Service in bioperl

BioPerl

BioPerl

Introducing Bioperl

Installing Bioperl

Introduction to bioperl

BioPerl

Introducción a Bioperl

Bioperl modules

BioPerl