1 / 6

Opportunities for Text Mining in Bioinformatics

Opportunities for Text Mining in Bioinformatics. (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Why Biology Text Mining?. Strong motivations from biology side

daniellep
Download Presentation

Opportunities for Text Mining in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. Why Biology Text Mining? • Strong motivations from biology side • Difficulty for biologists to access literature • No theory in biology, so we must keep all literature “alive” • Observations about the same biology mechanism may be described in different terms (e.g., due to different perspectives of study) • Many unanswered research questions • Text mining may help better organize, link biology literature, and answer simple questions… (e.g., what do we know about this gene? )

  3. Why Biology Text Mining? (cont.) • Potentially high impact from CS side • Any “discovery” from biology text could be potentially significant • Biology text is relatively “easy” for mining • Literature is cleaner (compared with web data) • Biology text often has many annotations • Many other kinds of biology data can be exploited (e.g., DNA/Protein sequences, gene expression information, metabolic networks) • Simple techniques may work

  4. Characteristics of Biology Text • Large number of entities (e.g., genes, proteins) that have well-defined semantics • No standard for terminology (inconsistencies) • Ambiguities (e.g., many acronyms) • Synonyms • High complexity in phrases and sentence structures

  5. Research Topics • General goal: Applying known text mining techniques to help biology research • Problem 1: Data/Information Integration • How can we integrate text information (discovering terminology linkages) • How can we link text with databases (semantic interpretations of text on top of entities/relations in DB, e.g., entity extraction) • How can we integrate biology DBs (many fields are text) • Problem 2: Functional annotations • How can we annotate a biological entity (e.g., a gene) with functional information extracted from literature • How can we annotate a set of related genes with functional information • How can we exploit the ontologies/thesauri in biology?

  6. Research Topics (cont.) • Problem 3: Data/Information Cleanup & Curation • How can we detect suspicious data/information in existing databases? • How can we automate many manual tasks of database curation? • Problem 4: Research question answering • How can we answer simply research questions? (e.g., what functional connections are there between these two genes?) • How can we support exploratory access and digest of literature information? (e.g., a biology research workbench)

More Related