Robert Kincaid Daniel Kluesing Aditya Vailaya

BNS: An LDAP-based Biomolecule Naming Service Robert KincaidDaniel KluesingAditya Vailaya

Outline • Problem statement and design goals • BNS architecture • BNS use cases • LDAP • Final thoughts

Problem • There is an increasing need to connect related genomic and proteomic measurements • However, no universally accepted/used identifiers exist for biomolecules (GenBank, RefSeq, Unigene, PIR, Swiss-Prot … ) • High-throughput measurements make manual association of related measurements impractical • We need a practical solution that uses today’s data

Initial Motivating Use Cases • Generate a “view” of data that is formed by the “join” of: • A microarray and a protein array • A microarray and mass spec proteomics data • An Agilent and a brand X microarray • A commercial oligo array and a home-brew cDNA array

Solution • A high-speed biomolecule Name/IDresolver • Converts between different identifier schemes based on gene locus or transcript • Converts between different states of transcriptiongene->transcript->protein • Converts between gene symbols and aliases • Easy to deploy and code applications • Platform and language neutral • Explores the research questions of feasibility and usefulness of- Name/ID resolver - LDAP

System Is Not • A sequence database • Primarily an annotation system • Intended to be updated by users • Not an object/interface naming service • A complete, definitive system

BNS – Biomolecule Naming Service • Research Prototype: • Based on LDAP for easy deployment and wide platform support • Derived from LocusLink data NCBI CLIENT APPLICATION BNS API BNS NAME/ID RESOLVER LDAP PROTOCOL LDAP API LDAP-BASED NAME SERVER FTP (via HTTP proxy) LOCUSLINK DOWNLOAD ANDCONVERSIONSCRIPTS

DirectoryStructureLDAPSchema

Example Entry (LDIF) dn: locus=1,org=Homo sapiens,dc=BNSobjectClass: bnsobjectlocus: 1sym: A1BGname: alpha-1-B glycoproteinug: Hs.373554summary: The protein encoded by this gene is a plasma glycopro . . .org: Homo sapienschr: 19q13.4altsym: A1Baltsym: ABGaltsym: GABgbaccn: AC010642. . .gbaccn: W25099 dn: transcript=NM_130786,locus=1,org=Homo sapiens,dc=BNSobjectClass: bnstranscriptlocus: 1transcript: NM_130786nm: NM_130786np: NP_570602prod: alpha 1B-glycoprotein

Object Model • BNSConnection • Connect/Disconnect to LDAP server (local or remote)connect(String url, String org) • Query, Lookup functionsBNSObject lookupID(String id)String resolveTranscriptPair(String refseqID)List lookupSymbolList(String symbol) • BNSObject • Returned by query/lookup methods • Get/Set methods for attributes • Various text output functions provided for conveniencetoString(), toText(), toTabbedText(), toHTML()

Example • try { • // STEP 1: Connect to the ldap server • conn.connect("ldap://localhost"); • // STEP 2: Do some BNS calls • System.out.println(conn.lookupSymbol("ABL1").toText()); • // STEP 3: Disconnect - That's all there is to it! • conn.disconnect(); • } • catch (BNSException e) { • e.printStackTrace(); • } JavaCode LOCUS 25SYMBOL ABL1ALIAS ABL, JTK7, p150, c-ABLDESCRIPTION v-abl Abelson murine leukemia viral oncogene homolog 1UNIGENE ID Hs.14635GENBANK K00009, AAA51895, M13099, AAA51896, U07563, AAB60393, AAB60394, . . .TRANSCRIPTS NM_005157, NP_005148, , v-abl Abelson murine leukemia viral oncogene homolog 1 isoform a NM_007313, NP_009297, , v-abl Abelson murine leukemia viral oncogene homolog 1 isoform bGENE ONTOLOGY cellular component : 0005634 : nucleus biological process : 0007048 : oncogenesis. . . Output

Example • try { • // STEP 1: Connect to the ldap server • conn.connect( "ldap://localhost“ ); • // STEP 2: Do some BNS calls • System.out.println( conn.resolveTranscriptionPair("NM_000018") ); • System.out.println( conn.resolveTranscriptionPair("NP_000009") ); • System.out.println( conn.resolveSymbol("PSCP") ); • System.out.println( conn.lookupSymbol("A1BG").get_description() ); • // STEP 3: Disconnect - That's all there is to it! • conn.disconnect(); • } • catch (BNSException e) { • e.printStackTrace(); • } JavaCode NP_000009 NM_000018 BRCA1 alpha-1-B glycoprotein Output

A real case – joining Microarray and MS data* Microarray 12626 Genes Mass Spec Proteomics741 Protein IDs GenBank/UniGene RefSeq/GenBank BNS 9419 (75%) 441 (60%) Locus 359 MS ID’s Matched to Microarray Features (48%) * Data provided by Joel Sevinsky and Natalie Ahn, Dept. of Chemistry and Biochemistry, University of Colorado, Boulder

High Throughput Use Cases • Annotation of biomolecule listsexample: microarray annotation, analysis bnsConnection.lookupID(“NM_00018”).toTabbedText(); • Ad-hoc creation of biomolecule lists via query example: create a theme-based microarrayList bnsObjects = bnsConnection.query(“godesc=*onco*”); • Merging biomolecule data with varied identifiersexample: joining high throughput measurements bnsConnection.resolveTranscript(“NM_00018”); bnsConnection.lookupID(“NM_00018”).getUnigene();

High Throughput Use Cases • Normalizing biomolecule ID’s to a common schemeexample: microarray annotationbnsConnection.lookupID(“NM_00018”).get_unigene(); • Validating gene symbolsexample: text mining if (bnsConnection.lookupSymbol(“PSCP”) != null) • Normalizing symbols to the official/preferred symbolexample: text mining, microarray annotation officialSym = bnsConnection.lookupSymbol(“PSCP”).get_sym();

Low Throughput Use Cases • Lookup single IDBNSObject bnsObj= bnsConnection.lookupID(“NM_00018”); • Display data for single IDexample: popup information dialog bnsObj.toHTML();

BNS Findings • A system like BNS is extremely useful and efficient: • New novel uses of genomic/proteomic data emerged beyond simple joins – text mining, annotation operations, chromosome mapping, etc. • Flexible range of associations possible - exact ID matches, transcript/product matches or looser locus matches • Simpler programming model than typical database access methods • Standardized object models and interfaces for performing “routine” name/id operations would enable rapid development of applications

Why LDAP • Sequence data easily conforms to a hierarchical directory structure • Sequence databases are often lookup only and are not updated by users (cf. SRS and flat file databases) • LDAP is scalable from very low end systems (slow laptops) to shared high-end servers • Cross-platform, variety of language support, flexible back-ends, open standard • Access control and security • Good performance for minimal cost

LDAP Issues • Approaches problem in a unique way • Can be confusing to newcomers • Easily overcome with modest experience • Potential rate and quantity of individual BNS queries is far beyond the expectations of email address book applications • Seems to work in practice • Assumed solvable by scalability More difficult to proxy through firewalls than HTML-based solutions • Socksification possible (trivial with Java)

LDAP Supports Distributed Architecture Query referral enables transparent federated searching across widely distributed data servers LDAP-BASED NAME SERVER OUTASIGHT PRIVATEDATA CLIENT APPLICATION BNS API Data is replicated from central curation server BNS NAME/ID RESOLVER LDAP-BASED NAME SERVER LDAP-BASED NAME SERVER LDAP API LDAP PROTOCOL LOCUSLINK LOCUSLINK

LDAP Findings • LDAP appears quite suitable for deploying this kind of system: • Performance appears to be good~20-200+ lookups/sec – usually bandwidth limitedqueries can be roundtrip optimizedserver-side in-memory caching possiblelow footprint allows client-side instance for special high-throughput needssubstantially faster than web-services equivalent* • Minimal infrastructure is requiredscalable from laptop to high-end multi-processor serveraccessible from many environments (Java, Perl, C/C++, Matlab, etc.) • Replication/Referral show promise for building distributed systems of biomolecule data*Based on data from Don Gilbert, Indiana Univ. (http://iubio.bio.indiana.edu/grid/directories)

Conclusion • Some form of consistent ubiquitous interface for performing BNS-like operations is useful and desirable • Efforts to create unified identifier schemes should consider a LocusLink-like organizing principle as these transcript/product relationships are important to emerging analyses • Properly overloaded ID conventions could eliminate the need for ID conversions (e.g. Hs12345M6789 vs. Hs12345P6789, Hs12346M*, etc) • LDAP shows promise as a useful lightweight high-performance delivery mechanism for biomolecule information

Availablility • http://openbns.sourceforge.net

Acknowledgements University of ColoradoNatalie AhnJoel Sevinski • AgilentPaul WolberKaren Shannon • Dean ThompsonAnnette AdlerAmir Ben-Dor Indiana UniversityDon Gilbert • Daniel Kleusing • Aditya Vailaya

Robert Kincaid Daniel Kluesing Aditya Vailaya