1 / 29

Department of Clinical and Biological Sciences, Turin University.

Mining literature to improve biological knowledge extraction by microarray transcriptional profiling. R.A. Calogero, G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra. Department of Clinical and Biological Sciences, Turin University.

booker
Download Presentation

Department of Clinical and Biological Sciences, Turin University.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining literature to improve biological knowledge extraction by microarray transcriptional profiling R.A. Calogero,G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra Department of Clinical and Biological Sciences, Turin University. Department of Genetics, General and Molecular Biology, Naples. Department of Mathematics and Information Science, Italy. CINECA, Italy.

  2. Data Mining applications in biological fields • On Sequence database / Molecular structure • Protein structure predictions, homology search, genomic sequence analysis, identification and gene mapping , gene expression microarrays, … • On Biomedical literature databases • Identification and classification of biological terms, identification of keywords and concepts, clustering , supervised classification, …

  3. Biomedical literature analysis State of art • Two different approach: • Information Extraction • Application of Natural Language Processing techniques that produces structured representations (templates). Entities and relations must be defined before extraction from texts.Syntactic and semantic analysis lead the extraction. • Text Mining • Identification of word patterns inside the document corpus. No prior entities, allow to identify new concepts and new relations.No semantics.

  4. tn.5.26.35 SOURCE Reuters tn.5.26.35 DATE 6/21/2000 tn.5.26.35 MONTHYEAR 2000_06 tn.5.26.35 SUBJECTS Japan tn.5.26.35 SUBJECTS Passenger_Vehicles tn.5.26.35 SUBJECTS Safety tn.5.26.35 STATE Japan tn.5.26.35 LANGUAGE English tn.5.26.35 ORG2 TOYOTA tn.5.26.35 NN area tn.5.26.35 NN automobile tn.5.26.35 NN average tn.5.26.35 NN barrier tn.5.26.35 NN car tn.5.26.35 NN chest tn.5.26.35 NN compartment tn.5.26.35 NN crash tn.5.26.35 NN driver tn.5.26.35 NN dummy tn.5.26.35 NN foot tn.5.26.35 NN force tn.5.26.35 NN group tn.5.26.35 NN head tn.5.26.35 NN hour tn.5.26.35 NN impact tn.5.26.35 NN injury tn.5.26.35 NN insurer tn.5.26.35 NN intrusion tn.5.26.35 NN likelihood tn.5.26.35 NN luxury tn.5.26.35 NN mark tn.5.26.35 NN mile tn.5.26.35 NN neck tn.5.26.35 NN offset tn.5.26.35 NN passenger tn.5.26.35 NN potential tn.5.26.35 NN rating tn.5.26.35 NN risk tn.5.26.35 NN safety tn.5.26.35 NN score tn.5.26.35 NN sedan tn.5.26.35 NN side tn.5.26.35 NN sport tn.5.26.35 NN test tn.5.26.35 NN utility tn.5.26.35 NN vehicle Databases, Web sites, … Interpretation and validation of results Text Mining Text Mining - The KDD process Grammatical analysisand lemmatization Knowledge meta-information extraction Patterns Tagging >>> 35:TOYOTA: Avalon Receives Top Score in Frontal Offset Crash Tests Toyota Motor Corp.'s Avalon received the top score -- a "good" rating earning a "best pick" -- in the 40 mile per hour frontal offset crash tests on new or updated vehicles. The tests were conducted by the Insurance Institute for Highway Safety, a nonprofit group funded by automobile insurers. Nissan Motor Co. Ltd.'s Maxima midsize sedan and Infiniti I30 luxury sedan, the Nissan Sentra small car and Mazda Motor Corp.'s Mazda MPV minivan all scored "average" marks. Isuzu Motors Ltd..'s Rodeo sport utility, also sold by Honda Motor Co. Ltd. as the Honda Passport, earned a "poor" rating due to high crash forces recorded on the crash dummy's head, indicating an increased likelihood of injury. In the crash tests, the vehicles were driven into a deformable barrier at 40 mph, with the driver's side of the vehicle taking the impact. The tests measured the potential for injury to the head, neck, chest and foot areas, and the risk of intrusion into the passenger compartment. SUBJECTS: Japan; Safety; Passenger Vehicles; SOURCE: Reuters, June 21, 2000;Japan;English Transformed documents Documents selection Keywords Target documents

  5. 1400000 abstract Documents selection Gene or protein

  6. Affiliation (AD)Date (EDAT)Journal (TA)Publ.Type (PT)Country (CY) TITLE (TI) TEXT (AB) • Grammatical analysis (and lemmatisation) • Information Extraction The process • Identification of different parts of a documents (marking) Textual part Meta-information

  7. Phase 1: marking the document TI Marking up the different part of documents: Title, Abstract, ….. AB ORG AU LA TY CO JN

  8. Phase 2: grammatical analysis Automatic identification of: NOUNS

  9. Phase 2: grammatical analysis Automatic identification of: NOUNS ADJECTIVES

  10. Phase 2: grammatical analysis Automatic identification of: NOUNS ADJECTIVES VERBS

  11. Phase 2: grammatical analysis Automatic identification of: NOUNS ADJECTIVES VERBS PROPER NOUNS

  12. Phase 2: grammatical analysis Selection NOUNS Lemmatisation KEYWORDS

  13. New document format

  14. Gene “Dictionary”: gene name alias CDKN1B P27KIP1IFI27 P27P27 P27 20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27 Phase 3: informationextraction

  15. 20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27 Final document

  16. m S N11=xik xjk k=1 m S N10=xik (1-xjk) k=1 m S N01=(1-xik) xjk k=1 m S N00=(1-xik) (1-xik) k=1 a N11 s(i,j) = b N11 + c (N10 +N01) m S N11=xik xjk wk k=1 clustering • Similarity index • Condorcet a=b=1 c=1/2 • Dice a=b=1 c=1/4 • Similarity threshold if s(i,j) > a Doci e Docj are similara in [0,1] • default: a = 0.5 • Weighting system • wk= 1 / x.k • wk= log( N / x.k) (N10=.. N01=...)

  17. Resultsexample: RET <OR> BRCA1

  18. Cluster Resultsexample: RET <OR> BRCA1

  19. Cluster Keywords Resultsexample: RET <OR> BRCA1

  20. Results example: RET <OR> BRCA1

  21. Resultsexample: RET <OR> BRCA1

  22. Locus Link Filter • extract OFFICIAL_SYMBOL and ALIAS_SYMBOL • put them on a row • select the terms with almost 3 character • make pairs GENE/ALIAS to search into Medline documents Locus Link(http://www.ncbi.nlm.nih.gov/LocusLink/) SERPINA3 Alternate symbol: ACT,AACT

  23. 20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27 Filter Index A1BG 18650110 45822308 69800214A2M 78121104 74300722 51024679A2MP … Meta-information

  24. KIT <near/6> (protein <or> gene <or> product) “stop words” list For these aliases we made a constrained research or no research at all

  25. Terms recognition - open problems • Non standardised terminology (different conventions) • Open vocabulary (added new terms) • Abbreviations usage, upper case/lower case,names that describe the function, … • Synonyms • Term Class cross-over (es: proteins called on the basis of related DNA) • Prepositions e conjunctions (ambiguity in the interpretation of dependence) • Co-reference

  26. Terms recognition - approaches • Manual coding of Knowledge * • Learning methods • Maximum entropy * * • Hidden Markov models * • Decision trees * * • Naive Bayes * • Statistical methods • naive Bayes + “word lists”* • Hybrids methods * * * • LTG (Language Technology Group)

  27. Test in biological filed genes(DNA) proteins 96,7 ----- ----- 47,2 75,9 17,8 - 44,6 83,4- 87,5 84,4 84,5 83,8 70,3 ----- ----- F-score = 2*P*R/(P+R) Terms recognition - approaches • Manual coding of knowledge* • Learning methods • Maximum entropy * * • Hidden Markov models * • Decision trees * * • Naive Bayes * • Statistical methods • naive Bayes + “word lists”* • Hybrids methods * * * • LTG (Language Technology Group) • * training * dictionary using * hand coded rules

  28. Clustering approaches • Vectorial representation • Metric • Algorithms • descriptive terms (nouns, verbs, adjectives, …) • representation (binary, quantitative) • similarity index • Euclidean metric • cosine angle • hierarchical • partitive (K-means, Self Organizing Maps, Autoclass, …)

  29. References • Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press • Milward, T., et al.(2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on Biocomputing, pp538-549, World Scientific Press. • Rindflesch, T. C. et al.(2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, PSB'2000 • Iliopoulos, et al., « TEXTQUEST : Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology» • Stapley, B.J. et al., « Biobibliometrics : Information Retrieval and Visualization form Co-occurrences of Gene Names in Medline Abstracts» • Jeffrey T. Chang et al., « Including Biological Literature Improves Homology Search » • Leung, S. et al., « Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia Coli promoter DNA sequences » • Andrade, M. A. Et at., « Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families » • Marcotte, E. M. et al., « Mining literature for protein-protein interactions » • Masys, D. R. et al., « Use of keyword hierarchies to interpret gene expression patterns » • Eckman, B. A. et al., « The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining » • Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, PSB 98 • Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”, COLING-2000 • Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language Processing Pacific Rim Symposium • Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of the Sixth Workshop on Very Large Corpora, pp 152-160. • Hatzivassiloglou, V. et al., « Disambiguating Proteins, Genes, and RNA in Text : A Machine Learning Approach» • Mikheev, A. Et al., « Description of the LTG System used for MUC-7 » • Andrade, M. A. Et at., « Automatic Annotation for Biological Sequences by Extraction of Keywords from Medline Abstracts. Development of a prototype system. »

More Related