Research in the Verspoor Lab

Research in the Verspoor Lab

Linguistics, Lexicons, and Biomedical Verbs • I could go on and on and on and on • But I probably won’t…

Biological Knowledge Discovery Gene Normalization

Gene Normalization • Mapping a gene or protein name to an identifier (e.g. in GenBank) • Very important task for using extracted information (more useful than just a name) • Ambiguity • with English words (“to”“dunce”“wingless”) • in naming (1168 genes in Entrez named “p60”) • in species (949 species have a gene named “p53”)

Normalization methods • Heuristic approach can be effective • Edit distance is too coarse (some characters matter more than others) • Some heuristics that appear to help • Ignore hyphens, commas, some other interrupting punctuation (but not, e.g., ' ) • Ignore parenthetical elements • Consider translations among arabic/roman numerals, and latin/greek letters • Special words for compound noun phrases: receptor, precursor, mRNA, gene, protein, greek letter names, etc.

Gene Normalization:a species-based approach • Based on species detection (NCBI Taxonomy terms) • Global cues: • First (first species mention) • Abstract (most frequent species in abstract) • Majority (most frequent species in doc) • Local cues, close to gene reference: • Recency • Window (most frequent in window) • “mixed” strategy setting confidence • “First” >> “Recency” >> “Window” >> “Majority

Putting it all together: BioCreative II.5 System Architecture Concept Recognition OpenDMAP Gene Normalization

UniProt Dictionary Match • Trie-based data structure • Protein names and synonyms normalized upon insertion • reduces number of variants • same form we search for in the text

Gene candidate selection • Normalized string match against SwissProt names and synonyms • lowercase • eliminating punctuation (apostrophes, hyphens, and parentheses) • converting Greek letters and Roman numerals to a standard form • removing spaces • Left and right token boundary constraints (right constraint relaxed for plurals)

Protein Match example • Sentence: Affixin/β-parvin is an integrin-linked kinase (ILK)-binding focal adhesion protein highly expressed in skeletal muscle and heart. • Normalized Sentence:affixinbparvinisanintegrinlinkedkinaseilkbindingfocaladhesion proteinhighlyexpressedinskeletalmuscleandheart • Match Affixin to affixin (ID: Q9HBI1) • Match β-parvin to bparvin (ID: Q9HBI1)

Species detection • Dictionary lookup using UIMA Concept Mapper loaded with NCBI Taxonomy • Match species and sub-species; traverse is-a hierarchy for sub-species

BC II.5 results RAW Homonym/Ortholog

KNoGM and KaBOB • KNoGM: Knowledge-based Normalization of Gene Mentions • Strategy based on WSD methods from Agirre and Soroa, based on knowledge graphs • Taking advantage of biological knowledge resources • KaBOB: Knowledge Base Of Biology • Integrated resource across biological databases

Knowledge-based methods in Word Sense Disambiguation • Disambiguate words based on relations represented in a semantic graph • Take advantage of connections among word senses and prefer word senses that are semantically connected • Intuition: Spreading Activation • Can perform static analysis of the graph to determine most likely disambiguations based only on the state of connections in the graph • More effective: dynamic, consider words in context

UKB: Agirre & Soroa knowledge-based WSD • Knowledge-based word sense disambiguation method • knowledge = WordNet graph • algorithm = (personalized) page rank PageRank: ranks vertices in a graph according to their relative structural importance Personalized PageRank: bias certain vertices; “activation” from a vertex increases

Knowledge-based methods in Gene Normalization • Knowledge typically brought to bear based on textual matching of concepts known to be associated with genes • Gene ontology concepts • Chromosome locations • Species names • KNoGM takes advantage of such knowledge in a broader relational context

KaBOB: Knowledge Base of Biology • Goal: construction of an integrated, broad-coverage semantic resource of biological knowledge • information artifacts • abstracted biological knowledge • RDF representation using ontological relations • KaBOB v.0 • iRefWeb protein interaction data • GO annotations • Homologene • NCBI Taxonomy

From knowledge-based WSD to KNoGM • knowledge: KaBOB • dictionary: gene name → gene identifiers • context: mentions of gene names, GO terms, NCBI Taxonomy terms

KNoGM

Training Set 1, BCIII

Biological Knowledge Discovery Protein Active Sites

Automated validation of high-throughput predictions • Collaboration with Mike Wall @ LANL • Combine structure-based predictions of active sites on proteins with literature-based validation • Given a PDB protein structure, and a prediction for residues in that structure that are active (ligand binding sites, catalytic sites, etc.) • Search the literature for evidence supporting the prediction

Protein Fold vs. Function • Many amino acids in a protein are responsible for defining the overall fold • However, only a small fraction of the residues in a protein are directly responsible for its behavior • The evolutionary pressures on these residues are different from other residues, and can cause mutations to be correlated with function (Lichtarge)

Functional Residues Are OftenRemote in Sequence • Difficult to identify as motifs >1AQM:A|PDBID|CHAIN|SEQUENCE TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVSPPNEHITGSQWWTRYQPVSYELQSRGGNRAQFIDMVNRCSAAGVD IYVDTLINHMAAGSGTGTAGNSFGNKSFPIYSPQDFHESCTINNSDYGNDRYRVQNCELVGLADLDTASNYVQNTIAAYI NDLQAIGVKGFRFDASKHVAASDIQSLMAKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAW LSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPV HNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQ YCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDAMAIHKNAKLNTSSAS -amylase from Alteromonas haloplanctis Asp174, Glu200, Asp264

The Same Residues are OftenNearby in 3D Structure Glu200 Asp264 Asp174 1AQM

Functional Sites • Types of Functional Sites • Catalytic sites • Allosteric Sites • Ligand-binding sites • Protein-protein interaction sites • Used to define motifs • Geometric hashing and other methods (TESS, Thornton lab) • Targets for Drug Design

DPA Prediction of Functional Sites Glu200 Asp264 Asp174 Catalytic Triad Predicted Residues

NLP Validation of Protein Active Site Predictions • Combine structure-based predictions of active sites on proteins with literature-based validation • Given a PDB protein structure, and a prediction for residues in that structure that are active (ligand binding sites, catalytic sites, etc.) • Search the literature for evidence supporting the prediction

NLP validation: approach

NLP validation: NLP analysis

Residue mention detection,examples • This missense mutation converts a highly conserved glycine (Gly17 of neurophysin) to a valine residue. • Killer of prune (Kpn) is a mutation in the awd gene which substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene. • Residues in both the N-terminal (Arg-66 and Glu-70) and C-terminal (Arg-200, Asp-254, Asp-255, and Asp-276) thirds of the protein are implicated in binding to cells. • … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine. • Other outliers of possible functional relevance include D18, R23, R59, R390 and A391. Patterns must handle 3-letter and 1-letter abbrevations; various connectors, mutations, linguistic constructs such as coordination, and other variations in surface forms.

Some regular expressions for AA mentions AA_long= "(alanine|asparagine|aspartic|cysteine|glutamic|glutamic acid|glutamine|glycine|histidine|allo\ |leucine|lysine|methionine|penylalanine|proline|serine|threonine|tryptophane|\ tyrosine|valinealanine|arginine|alanyl|arginyl|asparaginyl|aspartyl|cysteinyl|glutaminyl\ |glycyl|glutamyl|histidyl|isoleucyl|leucyl|lysyl|methionyl\ |phenylalanyl|prolyl|seryl|threonyl|tryptophanyl|tyrosyl|isoleucine|valyl)" AA_short = "(arg|asn|asp|cys|gln|gly|glu|his|ile|leu|lys|met|phe|pro|ser|thr|trp|tyr|val|asx|glx|xle|xaa|ala|ctt)" AA_initial = "(A|C|D|E|F|G|H|I|K|L|M|N|P|Q|R|S|T|V|W|Y)” AA_unbounded = AA_long + "|" AA_short AA_bounded = "\b" + AA_unbounded + "\b" AA_position_variant1 = "(\d+)([ \-]+)" + AA_bounded #AA plus the position tyr85 with optional parenthesis around the position tyr(85) AA_position_variant2 = AA_unbounded + "[ \-]*\(?\d+\)*?" # (tyr85 to ser85, Tyr 85 Ser 85, trp27-gly360) connection = "[ \-]?(\-|to|\s|\\)[ \-]?" grammatical_expressions = "([ \-]?(to|substitution of|at position|acid)[ \-]?)” pattern3 = AA_unbounded + ".?\d+" + connection + AA_unbounded + ".?\d+"

Current pattern performance Corpus 1: 61 full-text journal publications derived from Protein Data Bank (PDB) records that have known functional sites Corpus 2: 7 full-text journal publications; 5 abstracts. Derived from PDB records that are known drug targets. Corpus 3: 100 journal abstracts; obtained from Nagel et al (2009).

NLP analysis, refined

Some initial results of integration • For 32,195 PDB entries: • 26,829 entries map to a PubMed ID • 14,851 unique PubMed abstracts processed • 23,477 residues identified • 69% match surface residues on the relevant protein • 50% of these match predicted active sites • 79% of PDB entries have at least one residue identified

Complicating factors • AA numbering in sequences may not be consistent • Different “reference” sequences for the protein • Mutant or other variant sequences • Explicit mentions of mutations • Namespace ambiguity, possibly

BioNLP Technical and Representational issues

NLP validation: infrastructure • Requires scaling our architecture to process full text publications on a large scale • UIMA-AS (Asynchronous Scaleout) • Cloud/cluster computing • Take software engineering seriously • Robust, scaleable, modular architectures • Consider the kinds of knowledge structures we need to be able to represent and manipulate • hierarchical controlled vocabularies • patterns of expression

Annotation Representation a3 a2 a1 a4 EG:23939 GO:0006350 GO:0045449 GO:0065007 “biological regulation” “transcription” rdfs:label rdfs:label kiao:denotesResource kiao:denotesResource kiao:denotesResource rdfs:label has_location has_location has_location t1 t2 t3 “M. musculus Mapk7” …regulation of transcription of mouse Mapk7… t4 “regulation of transcription” rdfs:Resource rdfs:label kiao:ResourceAnnotation kiao:StatementSetAnnotation rdf:Property (s p o) p

In a nutshell • Ontologies and Semantic graph analysis • Vocabularies and Linguistic knowledge for the biomedical domain • Text Mining • Information Extraction • Addressing the needs of the biological user • Biological data analysis integrating multiple data sources

Larry Hunter (Lab director) Eneko Agirre and Aitor Soroa at EHU (UKB) Kevin Livingston (KaBOB) Kevin Cohen (NLP) Helen Johnson (Linguist) (Software engineers) Bill Baumgartner Chris Roeder Tom Christiansen Other Lab members: Mike Bada, Hannah Tipney, Yuriy Malenkiy, Lynne Fox Mike Wall and Judith Cohn at LANL NIH grants R01 LM 010120-01 R01 LM 009254 R01 LM 008111 R01 GM 083649 G08 LM 009639 T15 LM 009451 Guillaume Achaz for the gnome image Acknowledgements

Research in the Verspoor Lab