Department of Clinical and Biological Sciences, Turin University.

Mining literature to improve biological knowledge extraction by microarray transcriptional profiling R.A. Calogero,G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra Department of Clinical and Biological Sciences, Turin University. Department of Genetics, General and Molecular Biology, Naples. Department of Mathematics and Information Science, Italy. CINECA, Italy.

Data Mining applications in biological fields • On Sequence database / Molecular structure • Protein structure predictions, homology search, genomic sequence analysis, identification and gene mapping , gene expression microarrays, … • On Biomedical literature databases • Identification and classification of biological terms, identification of keywords and concepts, clustering , supervised classification, …

Biomedical literature analysis State of art • Two different approach: • Information Extraction • Application of Natural Language Processing techniques that produces structured representations (templates). Entities and relations must be defined before extraction from texts.Syntactic and semantic analysis lead the extraction. • Text Mining • Identification of word patterns inside the document corpus. No prior entities, allow to identify new concepts and new relations.No semantics.

tn.5.26.35 SOURCE Reuters tn.5.26.35 DATE 6/21/2000 tn.5.26.35 MONTHYEAR 2000_06 tn.5.26.35 SUBJECTS Japan tn.5.26.35 SUBJECTS Passenger_Vehicles tn.5.26.35 SUBJECTS Safety tn.5.26.35 STATE Japan tn.5.26.35 LANGUAGE English tn.5.26.35 ORG2 TOYOTA tn.5.26.35 NN area tn.5.26.35 NN automobile tn.5.26.35 NN average tn.5.26.35 NN barrier tn.5.26.35 NN car tn.5.26.35 NN chest tn.5.26.35 NN compartment tn.5.26.35 NN crash tn.5.26.35 NN driver tn.5.26.35 NN dummy tn.5.26.35 NN foot tn.5.26.35 NN force tn.5.26.35 NN group tn.5.26.35 NN head tn.5.26.35 NN hour tn.5.26.35 NN impact tn.5.26.35 NN injury tn.5.26.35 NN insurer tn.5.26.35 NN intrusion tn.5.26.35 NN likelihood tn.5.26.35 NN luxury tn.5.26.35 NN mark tn.5.26.35 NN mile tn.5.26.35 NN neck tn.5.26.35 NN offset tn.5.26.35 NN passenger tn.5.26.35 NN potential tn.5.26.35 NN rating tn.5.26.35 NN risk tn.5.26.35 NN safety tn.5.26.35 NN score tn.5.26.35 NN sedan tn.5.26.35 NN side tn.5.26.35 NN sport tn.5.26.35 NN test tn.5.26.35 NN utility tn.5.26.35 NN vehicle Databases, Web sites, … Interpretation and validation of results Text Mining Text Mining - The KDD process Grammatical analysisand lemmatization Knowledge meta-information extraction Patterns Tagging >>> 35:TOYOTA: Avalon Receives Top Score in Frontal Offset Crash Tests Toyota Motor Corp.'s Avalon received the top score -- a "good" rating earning a "best pick" -- in the 40 mile per hour frontal offset crash tests on new or updated vehicles. The tests were conducted by the Insurance Institute for Highway Safety, a nonprofit group funded by automobile insurers. Nissan Motor Co. Ltd.'s Maxima midsize sedan and Infiniti I30 luxury sedan, the Nissan Sentra small car and Mazda Motor Corp.'s Mazda MPV minivan all scored "average" marks. Isuzu Motors Ltd..'s Rodeo sport utility, also sold by Honda Motor Co. Ltd. as the Honda Passport, earned a "poor" rating due to high crash forces recorded on the crash dummy's head, indicating an increased likelihood of injury. In the crash tests, the vehicles were driven into a deformable barrier at 40 mph, with the driver's side of the vehicle taking the impact. The tests measured the potential for injury to the head, neck, chest and foot areas, and the risk of intrusion into the passenger compartment. SUBJECTS: Japan; Safety; Passenger Vehicles; SOURCE: Reuters, June 21, 2000;Japan;English Transformed documents Documents selection Keywords Target documents

1400000 abstract Documents selection Gene or protein

Affiliation (AD)Date (EDAT)Journal (TA)Publ.Type (PT)Country (CY) TITLE (TI) TEXT (AB) • Grammatical analysis (and lemmatisation) • Information Extraction The process • Identification of different parts of a documents (marking) Textual part Meta-information

Phase 1: marking the document TI Marking up the different part of documents: Title, Abstract, ….. AB ORG AU LA TY CO JN

Phase 2: grammatical analysis Automatic identification of: NOUNS

Phase 2: grammatical analysis Automatic identification of: NOUNS ADJECTIVES

Phase 2: grammatical analysis Automatic identification of: NOUNS ADJECTIVES VERBS

Phase 2: grammatical analysis Automatic identification of: NOUNS ADJECTIVES VERBS PROPER NOUNS

Phase 2: grammatical analysis Selection NOUNS Lemmatisation KEYWORDS

New document format

Gene “Dictionary”: gene name alias CDKN1B P27KIP1IFI27 P27P27 P27 20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27 Phase 3: informationextraction

20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27 Final document

m S N11=xik xjk k=1 m S N10=xik (1-xjk) k=1 m S N01=(1-xik) xjk k=1 m S N00=(1-xik) (1-xik) k=1 a N11 s(i,j) = b N11 + c (N10 +N01) m S N11=xik xjk wk k=1 clustering • Similarity index • Condorcet a=b=1 c=1/2 • Dice a=b=1 c=1/4 • Similarity threshold if s(i,j) > a Doci e Docj are similara in [0,1] • default: a = 0.5 • Weighting system • wk= 1 / x.k • wk= log( N / x.k) (N10=.. N01=...)

Resultsexample: RET <OR> BRCA1

Cluster Resultsexample: RET <OR> BRCA1

Cluster Keywords Resultsexample: RET <OR> BRCA1

Results example: RET <OR> BRCA1

Resultsexample: RET <OR> BRCA1

Locus Link Filter • extract OFFICIAL_SYMBOL and ALIAS_SYMBOL • put them on a row • select the terms with almost 3 character • make pairs GENE/ALIAS to search into Medline documents Locus Link(http://www.ncbi.nlm.nih.gov/LocusLink/) SERPINA3 Alternate symbol: ACT,AACT

20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27 Filter Index A1BG 18650110 45822308 69800214A2M 78121104 74300722 51024679A2MP … Meta-information

KIT <near/6> (protein <or> gene <or> product) “stop words” list For these aliases we made a constrained research or no research at all

Terms recognition - open problems • Non standardised terminology (different conventions) • Open vocabulary (added new terms) • Abbreviations usage, upper case/lower case,names that describe the function, … • Synonyms • Term Class cross-over (es: proteins called on the basis of related DNA) • Prepositions e conjunctions (ambiguity in the interpretation of dependence) • Co-reference

Terms recognition - approaches • Manual coding of Knowledge * • Learning methods • Maximum entropy * * • Hidden Markov models * • Decision trees * * • Naive Bayes * • Statistical methods • naive Bayes + “word lists”* • Hybrids methods * * * • LTG (Language Technology Group)

Test in biological filed genes(DNA) proteins 96,7 ----- ----- 47,2 75,9 17,8 - 44,6 83,4- 87,5 84,4 84,5 83,8 70,3 ----- ----- F-score = 2*P*R/(P+R) Terms recognition - approaches • Manual coding of knowledge* • Learning methods • Maximum entropy * * • Hidden Markov models * • Decision trees * * • Naive Bayes * • Statistical methods • naive Bayes + “word lists”* • Hybrids methods * * * • LTG (Language Technology Group) • * training * dictionary using * hand coded rules

Clustering approaches • Vectorial representation • Metric • Algorithms • descriptive terms (nouns, verbs, adjectives, …) • representation (binary, quantitative) • similarity index • Euclidean metric • cosine angle • hierarchical • partitive (K-means, Self Organizing Maps, Autoclass, …)

References • Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press • Milward, T., et al.(2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on Biocomputing, pp538-549, World Scientific Press. • Rindflesch, T. C. et al.(2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, PSB'2000 • Iliopoulos, et al., « TEXTQUEST : Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology» • Stapley, B.J. et al., « Biobibliometrics : Information Retrieval and Visualization form Co-occurrences of Gene Names in Medline Abstracts» • Jeffrey T. Chang et al., « Including Biological Literature Improves Homology Search » • Leung, S. et al., « Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia Coli promoter DNA sequences » • Andrade, M. A. Et at., « Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families » • Marcotte, E. M. et al., « Mining literature for protein-protein interactions » • Masys, D. R. et al., « Use of keyword hierarchies to interpret gene expression patterns » • Eckman, B. A. et al., « The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining » • Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, PSB 98 • Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”, COLING-2000 • Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language Processing Pacific Rim Symposium • Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of the Sixth Workshop on Very Large Corpora, pp 152-160. • Hatzivassiloglou, V. et al., « Disambiguating Proteins, Genes, and RNA in Text : A Machine Learning Approach» • Mikheev, A. Et al., « Description of the LTG System used for MUC-7 » • Andrade, M. A. Et at., « Automatic Annotation for Biological Sequences by Extraction of Keywords from Medline Abstracts. Development of a prototype system. »

Department of Clinical and Biological Sciences, Turin University.