70 likes | 86 Views
Opportunities for Text Mining in Bioinformatics. (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Why Biology Text Mining?. Strong motivations from biology side
E N D
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Why Biology Text Mining? • Strong motivations from biology side • Difficulty for biologists to access literature • No theory in biology, so we must keep all literature “alive” • Observations about the same biology mechanism may be described in different terms (e.g., due to different perspectives of study) • Many unanswered research questions • Text mining may help better organize, link biology literature, and answer simple questions… (e.g., what do we know about this gene? )
Why Biology Text Mining? (cont.) • Potentially high impact from CS side • Any “discovery” from biology text could be potentially significant • Biology text is relatively “easy” for mining • Literature is cleaner (compared with web data) • Biology text often has many annotations • Many other kinds of biology data can be exploited (e.g., DNA/Protein sequences, gene expression information, metabolic networks) • Simple techniques may work
Characteristics of Biology Text • Large number of entities (e.g., genes, proteins) that have well-defined semantics • No standard for terminology (inconsistencies) • Ambiguities (e.g., many acronyms) • Synonyms • High complexity in phrases and sentence structures
Research Topics • General goal: Applying known text mining techniques to help biology research • Problem 1: Data/Information Integration • How can we integrate text information (discovering terminology linkages) • How can we link text with databases (semantic interpretations of text on top of entities/relations in DB, e.g., entity extraction) • How can we integrate biology DBs (many fields are text) • Problem 2: Functional annotations • How can we annotate a biological entity (e.g., a gene) with functional information extracted from literature • How can we annotate a set of related genes with functional information • How can we exploit the ontologies/thesauri in biology?
Research Topics (cont.) • Problem 3: Data/Information Cleanup & Curation • How can we detect suspicious data/information in existing databases? • How can we automate many manual tasks of database curation? • Problem 4: Research question answering • How can we answer simply research questions? (e.g., what functional connections are there between these two genes?) • How can we support exploratory access and digest of literature information? (e.g., a biology research workbench)