460 likes | 652 Views
A search engine for phylogenetic tree databases. David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent). PhyloFinder. http://pilin.cs.iastate.edu/phylofinder/. Outline. Introduction PhyloFinder queries Implementation
E N D
A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)
PhyloFinder http://pilin.cs.iastate.edu/phylofinder/
Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements
Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements
Issues in Phylogenetic Databases • Taxonomic consistency • Species may appear in multiple trees by different but synonymous names. • Homonyms • Misspellings • Querying capability • Storage/representation • Exploiting classification trees (e.g., NCBI) • Clustering capabilities • Distance measures • Aggregation (synthesis) capabilities • Supertrees • Visualization
Exploiting taxonomic classifications • The leaves in phylogenetic trees may represent different taxonomic levels • A classification tree can allows us to locate trees that contain a taxon, as well as its descendants or ancestors. • E.g., a “Pinaceae" query would identify trees that contain “Pinus thunbergii” or “Abies alba”
TreeBASE capabilities • Search by • taxon • author • citation • study accession number • matrix accession number • structure (topology) • Tree surfing
TreeBASE limitations • Taxonomic name consistency • Querying • Few options • Does not exploit classification • Can’t identify ancestors/descendants • Visualization • Clustering and aggregation (supertrees)
PhyloFinder • A search engine for tree databases • Not a database • Allows powerful phylogenetic queries • Handles synonymous taxonomic names (via TBMap) • Handles misspellings. • Exploits taxonomic classification • Offers precise options for identifying different types of subtrees and metrics for identifying similar trees. • Provides a visualization tool with links to GenBank and TBMap. • Fast • Efficient storage and filtering
PhyloFinder Design • Uses simple but powerful techniques • Inverted index for filtering • Nested-set representation of trees • Least common ancestor queries directly on database • Off-the-shelf spell-checking technology • Can be used with anyphylogenetic database • E.g., PhyLoTA browser • However, set-up is not (yet) automatic
Outline • Introduction • Queries • Storage and querying • Acknowledgements
PhyloFinder Queries • Taxonomic queries involve a single taxon or set of taxa. • Phylogenetic queries take as input a phylogenetic tree • Locate trees that match it in some specified way.
Taxonomic Queries • Contains: Given a list of taxa, return all trees that contain all or any of these names. • Similar to Boolean “AND” and “OR” searches. • Automatically searches for synonymous taxa • Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. • E.g., if the query taxon is “birds", identify all trees that contain bird taxa. • Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.
Phylogenetic Queries • Tree mining: Given a query tree Q, find the database trees that exhibit Qin some way. Options: • Return the trees that have Qas an embeddedsubtree. • Return the trees that refine Q. • Similarity: Given a query tree Q and a specified similarity measure, return trees in database ranked by decreasing similarity from Q. • Requires at least 3 taxon overlap
Phylogenetic queries: Notation 1 • T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
Phylogenetic queries: Notation 1 • T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
Phylogenetic queries: Notation 1 • T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
Phylogenetic queries: Notation 2 • T|A is obtained from T(A) by suppressing all internal nodes that have only one child. a b c d e f g
Phylogenetic queries • Let Q be a query tree with leaf set A. • Q is an embedded subtree of T if and only if it is identical to T|A. • Q is refined by T (T refines Q) if T|A is a refinement of Q.
Phylogenetic queries: Refined by Q embedded in T Q refined by T
Similarity queries • Return trees ranked by a similarity score • Score is a percentage between 0 and 100% reflecting how similar query tree is to candidate tree. • PhyloFinder’s similarity measures: • Robinson-Foulds (RF) similarity • Least common ancestor (LCA) similarity • Score takes degree of taxon overlap into account.
Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements
Least Common Ancestors (LCAs) a b c d e f g
Least Common Ancestors (LCAs) a b c d e f g LCA(b,e)
Storage: Nested intervals (4,4) (5,5) (6,6) (8,8) (9,9) (10,10) b c e a d f • Ancestor/descendant relationship is easy to determine • The between predicate defines subtrees • LCAs are easily computed • Find common ancestor with largest Node_ID (Node_ID,RMD_ID) (3,5) (7,9) (2,9) (1,10)
Cornus 2 4 8 16 32 64 128 Spigelia 1 2 3 5 8 13 21 34 Hedera 13 16 Storage: Inverted index • For each taxon, store a list of all trees that contain it. • Easy to find trees containing any or all elements in a list of taxa • Used as a filter
Building the inverted index • Input trees:1: (((man,pan),gorilla),pongo),2. (((human, coprinus),cryptomonas),zea_mays), . . . ,N: (((dogs,homo_sapiens),pig),lambs) • Convert trees into lists of taxa:man pan gorilla pongo, human coprinus . . . . . . • Synonymy preprocessing: Replace names by TBMap name clusters: tc1 tc2 tc3 tc4 , tc1 tc5 . . . . . . • Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs). tc1 1 2 3 4 tc2 1
Query Processing: Outline Candidate trees: Q: Consult inverted index Results: Compare against Q using LCA queries
Implementing Phylogenetic Queries • Idea: Use LCA queries to compare ancestor-descendant relationships in Q with those in T. • M(x) and M(y) have the same relationship in T as x and y have in Q Q can be embedded in T. • Advantage: Database trees need not be read into main memory.
Implementing Taxonomic Queries • Use Boolean (union/intersection) operations on the inverted index • Example: Querying for “birds" • Find all bird species in the database trees using the NCBI taxonomy tree. • Use inverted index to retrieves the tree ID lists for each bird species. • Return the union of these lists.
Tree visualization • Other tree visualization tools are available: • Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482. • Sanderson 2006. Bioinformatics 22: 1004-1006. • Zmasek & Eddy. 2001. Bioinformatics 17: 383-384. • We developed our own to • avoid plug-ins, and • easily highlight query results and provide outlinks to GenBank and TBMap.
Spelling • Suggestions come from TreeBASE and NCBI • Uses GNU Aspell • Modified to handle special characters (`-', `&', '.') and compound words.
Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements
Under construction • Unrooted trees • Supertree methods • MRP, MRF, MMC • Desktop version • Automatic update • Suggestions?
Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements
Thanks to • Rod Page for TBMap • Bill Piel for TreeBASE data • Mike Sanderson • Oliver Eulenstein • National Science Foundation (grant EF-0334832)