240 likes | 420 Views
BNS: An LDAP-based Biomolecule Naming Service. Robert Kincaid Daniel Kluesing Aditya Vailaya. Outline. Problem statement and design goals BNS architecture BNS use cases LDAP Final thoughts. Problem. There is an increasing need to connect related genomic and proteomic measurements
E N D
BNS: An LDAP-based Biomolecule Naming Service Robert KincaidDaniel KluesingAditya Vailaya
Outline • Problem statement and design goals • BNS architecture • BNS use cases • LDAP • Final thoughts
Problem • There is an increasing need to connect related genomic and proteomic measurements • However, no universally accepted/used identifiers exist for biomolecules (GenBank, RefSeq, Unigene, PIR, Swiss-Prot … ) • High-throughput measurements make manual association of related measurements impractical • We need a practical solution that uses today’s data
Initial Motivating Use Cases • Generate a “view” of data that is formed by the “join” of: • A microarray and a protein array • A microarray and mass spec proteomics data • An Agilent and a brand X microarray • A commercial oligo array and a home-brew cDNA array
Solution • A high-speed biomolecule Name/IDresolver • Converts between different identifier schemes based on gene locus or transcript • Converts between different states of transcriptiongene->transcript->protein • Converts between gene symbols and aliases • Easy to deploy and code applications • Platform and language neutral • Explores the research questions of feasibility and usefulness of- Name/ID resolver - LDAP
System Is Not • A sequence database • Primarily an annotation system • Intended to be updated by users • Not an object/interface naming service • A complete, definitive system
BNS – Biomolecule Naming Service • Research Prototype: • Based on LDAP for easy deployment and wide platform support • Derived from LocusLink data NCBI CLIENT APPLICATION BNS API BNS NAME/ID RESOLVER LDAP PROTOCOL LDAP API LDAP-BASED NAME SERVER FTP (via HTTP proxy) LOCUSLINK DOWNLOAD ANDCONVERSIONSCRIPTS
Example Entry (LDIF) dn: locus=1,org=Homo sapiens,dc=BNSobjectClass: bnsobjectlocus: 1sym: A1BGname: alpha-1-B glycoproteinug: Hs.373554summary: The protein encoded by this gene is a plasma glycopro . . .org: Homo sapienschr: 19q13.4altsym: A1Baltsym: ABGaltsym: GABgbaccn: AC010642. . .gbaccn: W25099 dn: transcript=NM_130786,locus=1,org=Homo sapiens,dc=BNSobjectClass: bnstranscriptlocus: 1transcript: NM_130786nm: NM_130786np: NP_570602prod: alpha 1B-glycoprotein
Object Model • BNSConnection • Connect/Disconnect to LDAP server (local or remote)connect(String url, String org) • Query, Lookup functionsBNSObject lookupID(String id)String resolveTranscriptPair(String refseqID)List lookupSymbolList(String symbol) • BNSObject • Returned by query/lookup methods • Get/Set methods for attributes • Various text output functions provided for conveniencetoString(), toText(), toTabbedText(), toHTML()
Example • try { • // STEP 1: Connect to the ldap server • conn.connect("ldap://localhost"); • // STEP 2: Do some BNS calls • System.out.println(conn.lookupSymbol("ABL1").toText()); • // STEP 3: Disconnect - That's all there is to it! • conn.disconnect(); • } • catch (BNSException e) { • e.printStackTrace(); • } JavaCode LOCUS 25SYMBOL ABL1ALIAS ABL, JTK7, p150, c-ABLDESCRIPTION v-abl Abelson murine leukemia viral oncogene homolog 1UNIGENE ID Hs.14635GENBANK K00009, AAA51895, M13099, AAA51896, U07563, AAB60393, AAB60394, . . .TRANSCRIPTS NM_005157, NP_005148, , v-abl Abelson murine leukemia viral oncogene homolog 1 isoform a NM_007313, NP_009297, , v-abl Abelson murine leukemia viral oncogene homolog 1 isoform bGENE ONTOLOGY cellular component : 0005634 : nucleus biological process : 0007048 : oncogenesis. . . Output
Example • try { • // STEP 1: Connect to the ldap server • conn.connect( "ldap://localhost“ ); • // STEP 2: Do some BNS calls • System.out.println( conn.resolveTranscriptionPair("NM_000018") ); • System.out.println( conn.resolveTranscriptionPair("NP_000009") ); • System.out.println( conn.resolveSymbol("PSCP") ); • System.out.println( conn.lookupSymbol("A1BG").get_description() ); • // STEP 3: Disconnect - That's all there is to it! • conn.disconnect(); • } • catch (BNSException e) { • e.printStackTrace(); • } JavaCode NP_000009 NM_000018 BRCA1 alpha-1-B glycoprotein Output
A real case – joining Microarray and MS data* Microarray 12626 Genes Mass Spec Proteomics741 Protein IDs GenBank/UniGene RefSeq/GenBank BNS 9419 (75%) 441 (60%) Locus 359 MS ID’s Matched to Microarray Features (48%) * Data provided by Joel Sevinsky and Natalie Ahn, Dept. of Chemistry and Biochemistry, University of Colorado, Boulder
High Throughput Use Cases • Annotation of biomolecule listsexample: microarray annotation, analysis bnsConnection.lookupID(“NM_00018”).toTabbedText(); • Ad-hoc creation of biomolecule lists via query example: create a theme-based microarrayList bnsObjects = bnsConnection.query(“godesc=*onco*”); • Merging biomolecule data with varied identifiersexample: joining high throughput measurements bnsConnection.resolveTranscript(“NM_00018”); bnsConnection.lookupID(“NM_00018”).getUnigene();
High Throughput Use Cases • Normalizing biomolecule ID’s to a common schemeexample: microarray annotationbnsConnection.lookupID(“NM_00018”).get_unigene(); • Validating gene symbolsexample: text mining if (bnsConnection.lookupSymbol(“PSCP”) != null) • Normalizing symbols to the official/preferred symbolexample: text mining, microarray annotation officialSym = bnsConnection.lookupSymbol(“PSCP”).get_sym();
Low Throughput Use Cases • Lookup single IDBNSObject bnsObj= bnsConnection.lookupID(“NM_00018”); • Display data for single IDexample: popup information dialog bnsObj.toHTML();
BNS Findings • A system like BNS is extremely useful and efficient: • New novel uses of genomic/proteomic data emerged beyond simple joins – text mining, annotation operations, chromosome mapping, etc. • Flexible range of associations possible - exact ID matches, transcript/product matches or looser locus matches • Simpler programming model than typical database access methods • Standardized object models and interfaces for performing “routine” name/id operations would enable rapid development of applications
Why LDAP • Sequence data easily conforms to a hierarchical directory structure • Sequence databases are often lookup only and are not updated by users (cf. SRS and flat file databases) • LDAP is scalable from very low end systems (slow laptops) to shared high-end servers • Cross-platform, variety of language support, flexible back-ends, open standard • Access control and security • Good performance for minimal cost
LDAP Issues • Approaches problem in a unique way • Can be confusing to newcomers • Easily overcome with modest experience • Potential rate and quantity of individual BNS queries is far beyond the expectations of email address book applications • Seems to work in practice • Assumed solvable by scalability More difficult to proxy through firewalls than HTML-based solutions • Socksification possible (trivial with Java)
LDAP Supports Distributed Architecture Query referral enables transparent federated searching across widely distributed data servers LDAP-BASED NAME SERVER OUTASIGHT PRIVATEDATA CLIENT APPLICATION BNS API Data is replicated from central curation server BNS NAME/ID RESOLVER LDAP-BASED NAME SERVER LDAP-BASED NAME SERVER LDAP API LDAP PROTOCOL LOCUSLINK LOCUSLINK
LDAP Findings • LDAP appears quite suitable for deploying this kind of system: • Performance appears to be good~20-200+ lookups/sec – usually bandwidth limitedqueries can be roundtrip optimizedserver-side in-memory caching possiblelow footprint allows client-side instance for special high-throughput needssubstantially faster than web-services equivalent* • Minimal infrastructure is requiredscalable from laptop to high-end multi-processor serveraccessible from many environments (Java, Perl, C/C++, Matlab, etc.) • Replication/Referral show promise for building distributed systems of biomolecule data*Based on data from Don Gilbert, Indiana Univ. (http://iubio.bio.indiana.edu/grid/directories)
Conclusion • Some form of consistent ubiquitous interface for performing BNS-like operations is useful and desirable • Efforts to create unified identifier schemes should consider a LocusLink-like organizing principle as these transcript/product relationships are important to emerging analyses • Properly overloaded ID conventions could eliminate the need for ID conversions (e.g. Hs12345M6789 vs. Hs12345P6789, Hs12346M*, etc) • LDAP shows promise as a useful lightweight high-performance delivery mechanism for biomolecule information
Availablility • http://openbns.sourceforge.net
Acknowledgements University of ColoradoNatalie AhnJoel Sevinski • AgilentPaul WolberKaren Shannon • Dean ThompsonAnnette AdlerAmir Ben-Dor Indiana UniversityDon Gilbert • Daniel Kleusing • Aditya Vailaya