1 / 37

Bioinformatics Course Day 4

Perl Extensions: BioPerl and Ensembl API. Bioinformatics Course Day 4. BioPerl. Collection of Perl scripts and modules Facilitate development of Perl scripts for bioinformatics applications – not ready-to-use programs! Object-orientated Perl code

betrys
Download Presentation

Bioinformatics Course Day 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perl Extensions: BioPerl and Ensembl API Bioinformatics CourseDay 4

  2. BioPerl • Collection of Perl scripts and modules • Facilitate development of Perl scripts for bioinformatics applications – not ready-to-use programs! • Object-orientated Perl code • Generated by biologists, bioinformaticians, computer scientists • Different levels of complexity

  3. What are modules? • Perl extensions • .pm ending • inserted via 'use' or 'require' statement • can be nested: • Bio::DB::GenBank • /sw/lib/perl5/5.8.6/Bio/DB/GenBank.pm • contain code and documentation • perldoc Bio::DB::GenBank

  4. Applications for BioPerl • Sequence retrieval and manipulation • Data re-formatting • Run and parse output from bioinformatics programs: • Blast • ClustalW • Tcoffee • GenScan • HMMER • ...

  5. What is an Object? • Example: Bio::Seq Object use Bio::Perl $seqobj = get_sequence('swissprot', 'TLR4_HUMAN');

  6. What is an Object? • Example: Bio::Seq Object new function available load module use Bio::Perl $seqobj = get_sequence('swissprot', 'TLR4_HUMAN'); generates object

  7. What is an Object? • Example: Bio::Seq Object new function available load module use Bio::Perl $seqobj = get_sequence('swissprot', 'TLR4_HUMAN'); generates object provides methods $sequence_part = $seqobj->subseq(1,100); $translation = $seqobj->translate();

  8. What is an Object? • Example: Bio::Seq Object new function available load module use Bio::Perl $seqobj = get_sequence('swissprot', 'TLR4_HUMAN'); generates object provides methods $sequence_part = $seqobj->subseq(1,100); $translation = $seqobj->translate(); combinations possible $trans_trunc_rev = $seqobj->trunc(100,200)->revcom->translate();

  9. BioPerl's Objects • Sequence Objects (representation of various types of sequences): • Seq • PrimarySeq • LocatableSeq • RelSegment • LiveSeq • LargeSeq • RichSeq • SeqWithQuality

  10. BioPerl's Objects • Location Objects: • Where is a feature on a sequence? • Alignment objects • Blast reports • ...

  11. Objects vs Functions • Different access method • Example: reading an EMBL file: Function: $seq = read_sequence($file, 'embl') Object: $seqio = Bio::SeqIO->new( -format => 'embl' , -file => $file ); $seqobj = $seqio->next_seq();

  12. Bio::Perl example • Load the module use Bio::Perl; • Get the sequence $seqobj = get_sequence('swissprot', 'TLR4_HUMAN'); • Now you have a Bio::Seq object!

  13. Bio::Perl is not BioPerl! • BioPerl: • whole collection of Bio-related Perl extension • Bio::Perl • just one of many modules • “Easy first time access to BioPerl via functions” • “Functional access to BioPerl for people who don't know objects” • limited functionality • nice starter

  14. DB entry to Objects • Conversion of parts of the data into objects: ID TLR4_HUMAN STANDARD; PRT; 839 AA. AC O00206; Q9UK78; Q9UM57; Bio::PrimarySeq OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; ... Bio::Species FT REPEAT 52 76 LRR 1.. FT REPEAT 77 100 LRR 2.. Bio::SeqFeatureI

  15. Bio::Seq – Formats AB1,ABI ABI tracefile format ALF ALF tracefile format CTF CTF tracefile format EMBL EMBL format EXP Staden tagged experiment tracefile format Fasta FASTA format Fastq Fastq format GCG GCG format GenBank GenBank format PIR Protein Information Resource format PLN Staden plain tracefile format SCF SCF tracefile format ZTR ZTR tracefile format ace ACeDB sequence format game GAME XML format locuslink LocusLink annotation (LL_tmpl format only) phd phred output qual Quality values (get a sequence of quality scores) raw Raw format (one sequence per line, no ID) swiss Swissprot format

  16. Access through Bio::Seq object perldoc Bio::Seq (methods returning strings): $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->alphabet(); # one of 'dna','rna',or 'protein' $seqobj->seq_version() # when there, the version $seqobj->keywords(); # when there, the Keywords line $seqobj->length() # length $seqobj->desc(); # description $seqobj->display_id(); # the human readable id of the sequence

  17. Derived Bio::Seq objects perldoc Bio::Seq (methods returning strings): $seqobj->trunc(5,10) # truncation from 5 to 10 as new object $seqobj->revcom # reverse complements sequence $seqobj->translate # translation of the sequence • Example: $seqobj = read_sequence($file); if ($seqobj->alphabet eq 'dna' or $seqobj->alphabet eq 'rna') { $revcom = $seqobj->revcom; write_sequence('', 'fasta', $revcom); }

  18. Other Bio::Perl features • Remote Blast: $seqobj = read_sequence($file); $blast_report = blast_sequence($seqobj); write_blast(">blast.out", $blast_report); • Also possible to run stand-alone Blast

  19. Other Bio::Perl features • Generate alignments: # load module use Bio::Tools::Run::Alignment::Clustalw; # define parameters @params = ('ktuple' => 2, 'matrix' => 'BLOSUM'); # build a clustalw alignment factory $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params); # Pass the factory a list of sequences to be aligned. $aln = $factory->align('TLRs.fa'); # $aln is a SimpleAlign object

  20. Other Bio::Perl features • Work with alignments: $aln->length $aln->no_residues $aln->is_flush $aln->no_sequences $aln->score $aln->percentage_identity $aln->consensus_string(75) ??????L?LS?N?I??????????L??L??L?L??N?????????????? ???F?????L??L?L??N???????????????L??L?L??????????? ?????????????????????????????????????????????F??L? ?L??L?L????????????????L?????L???????????????????? ???????L?????????????????????L???????????????????? ?????????L???????????????????????????????????????? mostly Leucines conserved

  21. Other Features • sequence statistics (chemical description, residue count, word frequency) • finding restriction enzyme sites • finding amino acid cleavage sites • show and add sequence annotation • gene detection • manipulate phylogenetic trees • statistics for population genetics

  22. BioPerl Documentation NAME Bio::Perl - Functional access to BioPerl for people who don't know objects SYNOPSIS use Bio::Perl; # will guess file format from extension $seq_object = read_sequence($filename); # forces genbank format $seq_object = read_sequence($filename,'genbank'); # reads an array of sequences @seq_object_array = read_all_sequences($filename,'fasta'); # sequences are Bio::Seq objects, so the following methods work # for more info see Bio::Seq, or do 'perldoc Bio/Seq.pm' print "Sequence name is ",$seq_object->display_id,"\n"; print "Sequence acc is ",$seq_object->accession_number,"\n"; print "First 5 bases is ",$seq_object->subseq(1,5),"\n";

  23. More Info • less `locate bptutorial` (or perl -d `locate bptutorial`) • perldoc Bio::Perl • www.bioperl.org • BioPerl course: www.pasteur.fr/recherche/unites/sis/formation/bioperl/

  24. Ensembl • joint project between EMBL-EBI and the Sanger Centre • automatic annotation on selected eukaryotic genomes • free access to all the data and software • ww.ensembl.org

  25. dog chimp zebrafish pufferfish mosquito honey bee yeast Ensembl Organisms • Human • mouse • fly • worm • chicken • cow • rat • local installations (Arabidopsis)

  26. Ensembl Website

  27. Ensembl Website

  28. Ensembl Pipeline • Perl-based scripts • run programs to detect and annotate genes • compare genomes • provide Web graphics • all data stored in MySQL database

  29. Ensembl release cycle • Data sets and software updates approximately ten times a year • Versions for web code and databases • All older versions (back to 2004) accessible • Registry allows easy switch between versions

  30. Ensembl database • very rich data set • complex database layout • user-friendly Web interface • DB components: • Core • Compara • abstract layers through API (Perl, Java) • anonymous @ ensembldb.ensembl.org • e.g. homo_sapiens_core_38_36

  31. Ensembl core • Genome sequences and annotation info • Gene transcripts • Protein models • Assembly information • CDNA and protein alignments • External references • Markers • Repeats regions

  32. Other Ensembl data sets • EST databases • Variation databases • Both with application programming interface (API), e.g. Perl modules

  33. Ensembl compara • Multi-species database • Genome-wide species comparison • Re-calculated for each release • Pair-wise whole genome alignments • Synteny sets • Orthologue predictions • Protein family clusters

  34. Compara: Genome comparison • Which methods are available? mysql> SELECT * FROM method_link; +----------------+----------------------+ | method_link_id | type | +----------------+----------------------+ | 1 | BLASTZ_NET | | 2 | BLASTZ_NET_TIGHT | | 3 | BLASTZ_RECIP_NET | | 4 | PHUSION_BLASTN | | 5 | PHUSION_BLASTN_TIGHT | | 6 | TRANSLATED_BLAT | | 7 | BLASTZ_GROUP | | 8 | BLASTZ_GROUP_TIGHT | | 101 | SYNTENY | | 201 | ENSEMBL_ORTHOLOGUES | | 202 | ENSEMBL_PARALOGUES | | 301 | FAMILY | +----------------+----------------------+

  35. Compara: Genome comparison • Which genomes are available? mysql> SELECT * FROM genome_db WHERE genome_db_id IN (1, 11); +--------------+----------+-------------------------+------------+ | genome_db_id | taxon_id | name | assembly | +--------------+----------+-------------------------+------------+ | 1 | 9606 | Homo sapiens | NCBI34 | | 11 | 9031 | Gallus gallus | WASHUC1 | +--------------+----------+-------------------------+------------+

  36. Compara: Genome comparison • Which genomes were compared? • mysql> SELECT * FROM method_link_species_set WHERE method_link_species_set_id = 71; • +----------------------------+----------------+--------------+ • | method_link_species_set_id | method_link_id | genome_db_id | • +----------------------------+----------------+--------------+ • | 71 | 1 | 1 | • | 71 | 1 | 11 | • +----------------------------+----------------+--------------+ • BLASTZ_NET (method_link_id = 1) has been used for linking all the species of this set: Human (genome_db_id = 1) and Chicken (genome_db_id = 11).

  37. More Info • www.ensembl.org • www.ensembl.org/info/software • Tutorials • Database schema outline • Man pages, e.g. perldoc Bio::Ensembl::DBSQL::DBAdaptor

More Related