880 likes | 1.12k Views
BioJava Core API. Java for Bioinformatics?. Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support libraries for modern technologies (XML, WebServices, JDBC). Scales well from small to industrial strength enterprise sized programs.
E N D
Java for Bioinformatics? • Cross platform means develop on one platform deploy on any. • Widely accepted industry standard. • Lots of support libraries for modern technologies (XML, WebServices, JDBC). • Scales well from small to industrial strength enterprise sized programs.
Java for Bioinformatics? • Object Oriented. • Rapid development due to • Very strict types • Simple clear syntax • Exception handling and recovery • Cross platform • Extensive class library • Code reuse
What is BioJava ? • A collection of Java objects that represent and manipulate biological data • Not a program, rather a programming library • Open source (LGPL) open for all development, even commercial. Not ‘sticky’ or ‘viral’.
What is BioJava ? • Collection of objects to assist bioinformatics research • Started at EBI/Sanger in 1998 by Matthew Pocock and Thomas Down • 25+ developers have contributed (5 core)
What is BioJava ? • BioJava has acquired 1100+ classes, 130,000+ lines of code. • Uses CVS version control, JUnit testing and ANT builds. • It now has a fairly stable API. • 76 packages!
Where is BioJava • Home Page • www.biojava.org • BioJava in Anger • http://www.biojava.org/docs/bj_in_anger/ • Mailing Lists • biojava-l@biojava.org • biojava-dev@biojava.org • Nightly Builds • http://www.derkholm.net/autobuild/
Obtaining BioJava • Download • http://www.biojava.org/download/ • Get binaries, source and docs • biojava-live (requires cvs) • cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/biojava login • Password is ‘cvs’ • cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/biojava checkout biojava-live • cvs update -Pd
Compiling biojava-live • Requires the ANT build tool • http://jakarta.apache.org/ant/ • The ANT tool will use build.xml to • Arrange source code • Compile source • Make jar file • Make Java docs • Build demos • Build and Run tests • Change to biojava-live; type ant • Unit testing requires JUnit • http://junit.sourceforge.net/
Setting up BioJava • Put the following JAR files on your class path: • biojava.jar • bytecode-0.92.jar • commons-cli.jar • commons-collections-2.1.jar • commons-dbcp-1.1.jar • commons-pool-1.1.jar
BioJava Design • Uses some reasonably “advanced” concepts • Design by Interface • Protected or Private constructors • Factory classes and Methods • Flyweight/ Singleton objects
Interfaces Hide Implementation • In BioJava there are several implementations of the Distribution interface. • Any can be legally returned by a method that returns a Distribution (the returning method may even return different ones depending on the situation). • Any can be legally used as an argument to a method that requires a Distribution. • All are guaranteed to contain a minimal set of common methods.
Flyweight and Singleton Objects • A Singleton is a class with only one instance and only one access point. • A Singleton will need a Private constructor and may be static (e.g. AlphabetManager). • A Flyweight object uses sharing to support large numbers of fine-grained object efficiently. • For example in BioJava there is only ever one instance of the DNA Symbol “A”. A sequence of A’s is really just a list of pointers to that one object.
Factory and Static methods • Sometimes it is useful to prevent a user from directly constructing an object via a constructor. • If the construction is complex. • If the choice of the optimal implementation is best left to the API developer. • If important resources are best protected from end users e.g. Singletons/ Flyweights. • Rather than instantiating the object via its constructor a static method or Factory object is used
Examples • Static method: • FiniteAlphabet dna = DNATools.getDNA(); • Static field: • DistributionFactory df = DistributionFactory.DEFAULT; • Factory method: • Distribution d = df.createDistribution(dna);
Two Levels of BioJava • Macro type programming • Tools classes (SeqIOTools, DistributionTools etc). • Static methods for common tasks. • Full programming • Lots of customizations and ‘plug and play’ possible. • More exposure to the sharp edges of the API. Less documentation.
Symbols • In BioJava the DNA residue “A” is an object. • In Bioperl “A” would be a String. • The “A” object is part of the sequence not the sequence. • “A” from DNA is not equal to “A” from RNA or “A” from Protein.
Why not Strings? • DNA A != RNA A != Protein A • For Strings “A”.equals(“A”); • DNA Alphabet also contains K,Y,W,S,R,M,B,D,G,V,N
Why not Strings? • Object Y contains C and T, The String “Y” doesn’t contain anything • Translation HashMaps with Strings are flawed. • Biojava GGN translates to GLY • String GGN maps to null • A fully redundant String to String HashMap translation table requires 4096 keys!
Symbols are Canonical • DNATools.a() == DNATools.a(); • There is only one instance of ‘a’ • DNATools.a().equals(DNATools.a()); • ProteinTools.a() != DNATools.a(); • Even on Remote JVM’s! • During serialization Alphabet indexing is transient and ‘reconnected’ via readResolve() methods.
Alphabets • A set of Symbols • Alphabets can be infinite • DoubleAlphabet, IntegerAlphabet • Some Alphabets have a Finite number of Symbols • DNA, RNA etc • Alphabet and FiniteAlphabet interfaces
org.biojava.bio.Alphabet boolean contains(Symbol s) Returns whether or not this Alphabet contains the symbol. List getAlphabets() Return an ordered List of the alphabets which make up a compound alphabet. SymbolgetAmbiguity(java.util.Set syms) Get a symbol that represents the set of symbols in syms. SymbolgetGapSymbol() Get the 'gap' ambiguity symbol that is most appropriate for this alphabet String getName() Get the name of the alphabet. SymbolgetSymbol(java.util.List rl) Get a symbol from the Alphabet which corresponds to the specified ordered list of symbols. SymbolTokenizationgetTokenization(java.lang.String name) Get a SymbolTokenization by name. void validate(Symbol s) Throws a precanned IllegalSymbolException if the symbol is not contained within this Alphabet.
org.biojava.bio.FiniteAlphabet • In addition to the previous methods void addSymbol(Symbol s) Adds a symbol to this Alphabet Iterator iterator() Retrieve an Iterator over the Symbols in this Alphabet. void removeSymbol(Symbol s) Remove a symbol from this alphabet. int size() The number of symbols in the alphabet.
The Default Alphabets • DNA (a,c,g,t) • RNA (a,c,g,u) • PROTEIN (all amino acids including ‘Sel’) • PROTEIN-TERM (all PROTEIN plus “*”) • STRUCTURE (PDB structure symbols) • Alphabet of all integers (Infinite Alphabet) • Can generate SubIntegerAlphabets • Alphabet of all doubles (Infinite Alphabet)
Getting the common Alphabets import org.biojava.bio.symbol.*;import java.util.*;import org.biojava.bio.seq.*;publicclass AlphabetExample {publicstaticvoid main(String[] args) { Alphabet dna, rna, prot;//get the DNA alphabet by name dna = AlphabetManager.alphabetForName("DNA");//get the RNA alphabet by name rna = AlphabetManager.alphabetForName("RNA");//get the Protein alphabet by name prot = AlphabetManager.alphabetForName("PROTEIN");//get the protein alphabet that includes the * termination Symbol prot = AlphabetManager.alphabetForName("PROTEIN-TERM");//get those same Alphabets from the Tools classes dna = DNATools.getDNA(); rna = RNATools.getRNA(); prot = ProteinTools.getAlphabet();//or the one with the * symbol prot = ProteinTools.getTAlphabet(); } }
SymbolLists are made of Symbols • org.biojava.bio.symbol.SymbolList • A sequence of Symbols from the same Alphabet. • Uses biological coordinates from 1 to length • cf String from 0 to length-1
Doesn’t this waste memory? • A SymbolList is not really a List of Symbol Objects. • Rather a List of Object references. • Still a bit heavier than a char[] but not serious. T C A G AACGTGGGTTCCAACT
The Bigger Picture AlphabetManager “DNA” “Protein” T C A G AACGTGGGTTCCAACT
The SymbolList interface void edit(Edit edit) Apply an edit to the SymbolList as specified by the edit object. AlphabetgetAlphabet() The alphabet that this SymbolList is over. Iterator iterator() An Iterator over all Symbols in this SymbolList. int length() The number of symbols in this SymbolList. String seqString() Stringify this symbol list. SymbolListsubList(int start, int end) Return a new SymbolList for the symbols start to end inclusive. String subStr(int start, int end) Return a region of this symbol list as a String. SymbolsymbolAt(int index) Return the symbol at index, counting from 1. List toList() Returns a List of symbols.
String to SymbolList import org.biojava.bio.seq.* import org.biojava.bio.symbol.*; publicclass StringToSymbolList { publicstaticvoid main(String[] args) {try {//create a DNA SymbolList from a String SymbolList dna = DNATools.createDNA("atcggtcggctta");//create a RNA SymbolList from a String SymbolList rna = RNATools.createRNA("auugccuacauaggc");//create a Protein SymbolList from a String SymbolList aa = ProteinTools.createProtein("AGFAVENDSA");}catch (IllegalSymbolException ex) {//this will happen if you use a character in one of your strings that is//not an accepted IUB Character for that Symbol. ex.printStackTrace();} } }
SymbolList to String import org.biojava.bio.symbol.*; publicclass SymbolListToString { publicstaticvoid main(String[] args) {SymbolList sl = null; //code here to instantiate sl //convert sl into a String String s = sl.seqString(); } }
The Sequence Interface • A Sequence is a SymbolList with more information. • In addition to Annotatable and SymbolList: String getName() The name of this sequence. String getURN() A Uniform Resource Identifier (URI) which identifies the sequence represented by this object. • Also implements FeatureHolder which allows addition of Feature Objects.
Quickly generate a Sequence import org.biojava.bio.seq.*;import org.biojava.bio.symbol.*;publicclass StringToSequence {publicstaticvoid main(String[] args) {try {//create a DNA sequence with the name dna_1 Sequence dna = DNATools.createDNASequence("atgctg", "dna_1");//create an RNA sequence with the name rna_1 Sequence rna = RNATools.createRNASequence("augcug", "rna_1");//create a Protein sequence with the name prot_1 Sequence prot = ProteinTools.createProteinSequence("AFHS", "prot_1"); }catch (IllegalSymbolException ex) {//an exception is thrown if you use a non IUB symbol ex.printStackTrace(); } } }
Ambiguity Symbols • Ambiguous or Fuzzy data is a fact of life, especially with sequencing. • DNA traces can contain symbols such as n, r, w, v, h, k, y, n etc. • In BioJava DNA symbols a, c, g, t are AtomicSymbols. • Ambiguous symbols like y are BasisSymbols.
BasisSymbols • A BasisSymbol may be represented as a list of one or more Symbols. • BasisSymbol extends Symbol. • Ambiguity Symbols are always BasisSymbols • getSymbols() The list of symbols that this symbol is composed from.
AtomicSymbols • AtomicSymbols are not ambiguous. • They cannot be further divided into Symbols that are valid members of the parent Alphabet. • In the case of compound Alphabets they can be divided into valid Symbols from component Alphabets.
AtomicSymbols • The AtomicSymbol interface extends BasisSymbol but adds no new methods only behaviour contracts. • AtomicSymbol instances guarantee that getMatches() returns an Alphabet containing just that Symbol and each element of the List returned by getSymbols() is also atomic.
Atomic and Basis AlphabetManager “DNA” W BasisSymbol T AtomicSymbols A AATW
Translating Ambiguity • BioJava handles translation of ambiguity very smoothly. • DNA ‘n’ = [a,c,g,t] • Transcribes to RNA ‘n’ [a,c,g,u] • ggn translates to Gly • agn translates to [Ser, Arg] • Most protein ambiguities have no ‘token’ and are printed as ‘X’
CrossProduct Alphabets • A CrossProductAlphabet is a combination of two or more Alphabets. • Any type of CrossProductAlphabet is possible • Dimers (DNA x DNA) • Codon (DNA x DNA x DNA) • Conditional ((DNA x DNA) x DNA) • Mixed ((DNA x DNA x DNA) x PROTEIN)
Finite and Compound Alphas (DNA x DNA x DNA) BasisSymbol GNG (DNA x DNA x DNA) AtomicSymbols ACA GTG DNA AtomicSymbols T C A G [AAC][GTG]GGTTCCAACT
What are they good for? • Codon Symbols (DNA x DNA x DNA). • Many analysis Classes such as Count and Distribution use Symbol as an argument. A hexamer can be an AtomicSymbol. • Phred is DNA x Integer • 1st and Higher order Markov Models use CrossProductAlphabets.
How do I make a CrossProductAlphabet? import java.util.*;import org.biojava.bio.seq.*;import org.biojava.bio.symbol.*;publicclass CrossProduct {publicstaticvoid main(String[] args) {//make a CrossProductAlphabet from a List List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);//get the same Alphabet by name Alphabet codon2 = AlphabetManager.generateCrossProductAlphaFromName( "(DNA x DNA x DNA)“ );//show that the two Alphabets are canonical System.out.println(codon == codon2); } }
Making Triplet Views on a SymbolList import org.biojava.bio.seq.*;import org.biojava.bio.symbol.*;publicclass CodonView {publicstaticvoid main(String[] args) {try {//make a DNA SymbolList SymbolList dna = DNATools.createDNA("atgcccgcgtaa"); System.out.println("Length of dna " + dna.length());//get a Codon View (window size of three) SymbolList codons = SymbolListViews.windowedSymbolList(dna, 3); System.out.println("Length of codons " + codons.length());//get a Triplet View SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3); System.out.println("Length of triplets "+ triplets.length()); }catch (Exception ex) { ex.printStackTrace(); } } }
Getting a Symbol for a Codon import java.util.*;import org.biojava.bio.seq.*;import org.biojava.bio.symbol.*;publicclass MakeATG {publicstaticvoid main(String[] args) {//make a CrossProductAlphabet from a List List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);//get the codon made of atg List syms = new ArrayList(3); syms.add(DNATools.a()); syms.add(DNATools.t()); syms.add(DNATools.g()); Symbol atg = null;try { atg = codon.getSymbol(syms); }catch (IllegalSymbolException ex) {//used Symbol from Alphabet that is not a component of codon ex.printStackTrace(); } System.out.println("Name of atg: "+ atg.getName()); } }
Breaking a Codon into its Parts import java.util.*;import org.biojava.bio.seq.*;import org.biojava.bio.symbol.*;publicclass BreakingComponents {publicstaticvoid main(String[] args) {//make the 'codon' alphabet List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l);//get the first symbol in the alphabet Iterator iter = ((FiniteAlphabet)alpha).iterator(); AtomicSymbol codon = (AtomicSymbol)iter.next(); System.out.print(codon.getName()+" is made of: ");//break it into a list its components List symbols = codon.getSymbols();for(int i = 0; i < symbols.size(); i++){if(i != 0) System.out.print(", "); Symbol sym = (Symbol)symbols.get(i); System.out.print(sym.getName()); } } }