Literature Mining and Systems Biology

Literature Mining and Systems Biology Lars Juhl Jensen EMBL

Why?

Overview • Information retrieval: finding the papers • Entity recognition: identifying the substance(s) • Information extraction: formalizing the facts • Text mining: finding nuggets in the literature • Integration: combining text and biological data

Status • IR, ER, and simple IE methods are fairly well established • Advanced NLP-based IE systems are rapidly being improved • Methods for text mining and text/data integration are still in their infancy

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Information retrieval • Ad hoc information retrieval • The user enters a query/a set of keywords • The system attempts to retrieve the relevant texts from a large text corpus (typically Medline) • Text categorization • A training set of texts is created in which texts are manually assigned to classes (often only yes/no) • A machine learning methods is trained to classify texts • This method can subsequently be used to classify a much larger text corpus

Ad hoc IR • These systems are very useful since the user can provide any query • The query is typically Boolean (yeast AND cell cycle) • A few systems instead allow the relative weight of each search term to be specified by the user • The art is to find the relevant papers even if they do not actually match the query • Ideally our example sentence should be extracted by the query yeast cell cycle although none of these words are mentioned

Automatic query expansion • In a typical query, the user will not have provided all relevant words and variants thereof • By automatically expanding queries with additional search terms, recall can be improved • Stemming removes common endings (yeast / yeasts) • Thesauri can be used to expand queries with synonyms and/or abbreviations (yeast / S. cerevisiae) • The next logical step is to use ontologies to make complex inferences (yeast cell cycle / Cdc28 )

Document similarity • The similarity of two documents can be defined based on their word content • Each document can be represented by a word vector • Words should be weighted based on their frequency and background frequency • The most commonly used scheme is tf*idf weighting • Document similarity can be used in ad hoc IR • Rather than matching the query against each document only, the N most similar documents are also considered

Document clustering • Unsupervised clustering algorithms can be applied to a document similarity matrix • All pairwise document similarities are calculated • Clusters of “similar documents” can be constructed using one of numerous standard clustering methods • Practical uses of document clustering • The “related documents” function in PubMed • Logical organization of the documents found by IR

Text categorization • These systems are a lot less flexible than ad hoc systems but can attain better accuracy • Works on a pre-defined set of document classes • Each class is defined by manually assigning a number of documents to it • Method • Rules may be manually crafted based on a very small set of manually classified documents • Statistical machine learning methods can be trained on a large number of classified documents

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Hints in the text • Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”) • Weaker: mitotic cyclin, Clb2, and Cdk1 ( “cell cycle)

Machine learning • Input features • Word content or bi-/tri-grams • Part-of-speech tags • Filtering (stop words, part-of-speech) • Singular value decomposition • Training • Support vector machines are best suited • Choice of kernel function • Separate training and evaluation sets, cross validation

Entity recognition • An important but boring problem • The genes/proteins/drugs mentioned within a given text must be identified • Recognition vs. identification • Recognition: find the words that are names of entities • Identification: figure out which entities they refer to • Recognition without identification is of limited use

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Entities identified • S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)

Recognition • Features • Morphological: mixes letters and digits or ends on -ase • Context: followed by “protein” or “gene” • Grammar: should occur as a noun • Methodologies • Manually crafted rule-based systems • Machine learning (SVMs) • But what can it be used for?

Identification • A good synonyms list is the key • Combine many sources • Curate to eliminate stop words • Flexible matching to handle orthographic variation • Case variation: CDC28, Cdc28, and cdc28 • Prefixes: myc and c-myc • Postfixes: Cdc28 and Cdc28p • Spaces and hyphens: cdc28 and cdc-28 • Latin vs. Greek letters: TNF-alpha and TNFA

Disambiguation • The same word may mean many different things • Entity names may also be common English words (hairy) or technical terms (SDS) • Protein names may refer to related or unrelated proteins in other species (cdc2) • The meaning can be resolved from the context • ER can distinguish between names and common words • Disambiguating non-unique names is a hard problem • Ambiguity between orthologs can be safely be ignored

Co-occurrence extraction • Relations are extracted for co-occurring entities • Relations are always symmetric • The type of relation is not given • Scoring the relations • More co-occurrences  more significant • Ubiquitous entities  less significant • Same sentence vs. same paragraph • Simple, good recall, poor precision

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Relations • Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and Cdc5–Swe1 • Wrong: Clb2–Cdc5 and Cdc28–Cdc5

Categorization of relations • Extracting specific types of relations • Text categorization methods can be used to identify sentences that mention a certain type of relations • Filtering can be done before or after relation extraction • Well suited for database curation • Text categorization can be reused • High recall is most important • Curators can compensate for the lack of precision

Relation extraction by NLP • Information is extracted based on parsing and interpreting phrases or full sentences • Good at extracting specific types of relations • Handles directed relations • Complex, good precision, poor recall

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Relations: • Complex: Clb2–Cdc28 • Phosphorylation: Clb2Swe1, Cdc28Swe1, and Cdc5Swe1

An NLP architecture • Tokenization • Entity recognition with synonyms list • Word boundaries (multi words) • Sentence boundaries (abbreviations) • Part-of-speech tagging • TreeTagger trained on GENIA • Semantic labeling • Dictionary of regular expressions • Entity and relation chunking • Rule-based system implemented in CASS

Semantic labeling • Gene and protein names • Cue words for entity recognition • Cue words for relation extraction Named entity chunking • A CASS grammar recognizes noun chunks related to gene expression:[nxgene The GAL4gene] Relation chunking • Our CASS grammar also extracts relations between entities:[nxexpr The expression of [nxgenethe cytochrome genes[nxpgCYC1 and CYC7]]]is controlled by[nxpgHAP1]

[phosphorylation_activeLyn, [negation but not Jak2]phosphorylated CrkL] [phosphorylation_activeLyn, [negation but not Jak2] phosphorylated CrkL] [expression_repression_activeBtk regulates the IL-2 gene] [expression_repression_activeBtk regulates the IL-2 gene] [phosphorylation_activeLyn also participates in [phosphorylation the tyrosine phosphorylation and activation of syk]] [phosphorylation_activeLyn also participates in [phosphorylation the tyrosine phosphorylation and activation of syk]] [expression_repression_activeIL-10 also decreased [expression mRNA expression of IL-2 and IL18 cytokine receptors] [expression_repression_activeIL-10 also decreased [expressionmRNA expression of IL-2 and IL18 cytokine receptors] [phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn] [phosphorylation_nominal the phosphorylation of the adapter proteinSHC by the Src-related kinaseLyn] [phosphorylation_nominalphosphorylation of Shc by the hematopoietic cell-specific tyrosine kinaseSyk] [phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk] [dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1] [dephosphorylation_nominalDephosphorylation of Syk and Btkmediated by SHP-1] [expression_activation_passive [expressionIL-13 expression] induced by IL-2 + IL-18] [expression_activation_passive [expressionIL-13 expression] induced by IL-2 + IL-18]

Mining text for nuggets • New relations can be inferred from published ones • This can lead to actual discoveries if no person knows all the facts required for making the inference • Combining facts from disconnected literatures • Swanson’s pioneering work • Fish oil and Reynaud's disease • Magnesium and migraine

Trends • Most similar to existing data mining approaches • Although all the detailed data is in the text, people may have missed the big picture • Temporal trends • Historical summaries • Forecasting • Correlations • “Customers who bought this item also bought …”

Time

Buzzwords

Correlations • “Customers who bought this item also bought …” • Protein networks • “Proteins that regulate expression …” • “Proteins that control phosphorylation …” • “Proteins that are phosphorylated …” • Co-author networks

Transcriptional networks 3592 79 32 83 Regulates Regulated P < 910-9

Signaling pathways 3704 27 11 44 Phosphorylates Phosphorylated P < 210-7

Integration • Automatic annotation of high-throughput data • Loads of fairly trivial methods • Protein interaction networks • Can unify many types of interactions • Powerful as exploratory visualization tools • More creative strategies • Identification of candidate genes for genetic diseases • Linking genes to traits based on species distributions

Literature Mining and Systems Biology

Literature Mining and Systems Biology

Presentation Transcript

Mining Medical Literature

Systems Biology

Mining the Medical Literature

Biological literature mining

Literature Mining for the Biologists

Systems Biology -

Systems Biology

Literature Retrieval and Mining

Systems Biology

Data Mining Systems and Languages

Literature Data Mining and Protein Ontology Development

Systems Biology

Mining the Biomedical Literature

NLP Tools for Biology Literature Mining

Searching, finding and organising literature: Biology

Systems Biology

Literature Mining BMI 730

Mining Biomedical Literature for Neuroanatomy

Systems Biology

Biology Literature Review

Searching, finding and organising literature: Biology