380 likes | 476 Views
Mining the Medical Literature. Chirag Bhatt October 14 th , 2004. Why MINE data!. Medical, genomics, proteomics research Find causal links between symptoms or diseases and drugs or chemicals Gene comparison. An example. Problem
E N D
Mining the Medical Literature Chirag Bhatt October 14th, 2004
Why MINE data! • Medical, genomics, proteomics research • Find causal links between symptoms or diseases and drugs or chemicals • Gene comparison
An example Problem • What is causing an uncharacteristic behavior in protein production? Solution • Find which genes have a roll to play in amino acid synthesis? How? • Search through online literature for genes that play a role in amino acid synthesis
Data Retrieval • Company Database • e.g. Customer records, product inventory • Search entity (structured) • records • Query (goal-driven) • What is the address of our client? • How many widgets are in stock? • SQL, Oracle, DB2, etc
Information Retrieval • Google, A9, AltaVista • Query (goal-driven) • Search entity (unstructured) • documents • variable format • html, pdf, etc
Data Mining • Structured data set • Generally a large amount of (historical) data • Find relations or patterns or trends in database (opportunistic) • Eg “beer and diaper”
Text Mining • Unstructured data set • Documents, publications, abstracts, web pages • Discover useful and previously unknown “gems” of information in large text collections using patterns, trends and domain knowledge
Need for mining text • Approximately 90% of the world’s data is held inunstructured formats (source: Oracle Corporation)
Why Text Mining in Medical Literature? • Many multi-functional genes • Screen functionally interesting ones • Complexity of needs increasing • Individual genes -> family of genes • Manual Text Mining ? Not really! • Availability of published literature online
Functionally Coherent Genes • Group of genes that exhibit similar experimental features • Amino acid metabolism, electron transport, stress response
Difficulties • Difficulties faced in finding functionally coherent genes • Most genes express multi-functionality • Some genes studied extensively and some only just discovered
Semantic neighbor • Two articles are semantic neighbors if they have similar word usage • Use statistical natural language processing to access and interpret online text
Methodology • Find semantic neighbors in document set • If any article about common functionality contains atleast one in the group then the group is functionally coherant
Neighbor divergence • Scoring method • Each articles relevance to gene group is scored by: • count of number neighbors that have references to the group
Neighbor divergence scores If score distribution is different from Poisson then gene group represents biological function The log ratio for a Poisson distribution should be flat along the horizontal axis
Need to filter results • Generally well-studied genes tend to have semantic neighbor that refer to same gene • Neighbor may not be relevant to group function, but increases score – false positive • So only articles that refer to different genes are considered
Evaluation • Report percentile of a functional group of genes • Calculate precision and recall at different cutoff levels (next slide) • Remove legitimate genes with irrelevant genes in group
Results • Sample Space: 19 known yeast groups and 1900 random groups
Limitations of neighbor divergence • Neighbor divergence helps group genes not tell us function • Work based on abstracts only • Entire literature search may prove challenging • Break into smaller components
Another mining approach Extracting synonymous gene and protein terms
Why find synonyms? • Genes and proteins are often associated with multiple names across articles and sub domains • More names keep getting added • new functional or structural information is discovered • Improve search and analysis
Current work • Biological databases such as GenBank and SWISSPROT include synonyms • Not up to date • Disagreement on some synonyms • Laborious manual curation and review • Need for automation
Two-step problem • Identifying gene and protein names • Done by state-of-the-art taggers • Determining whether these names are synonymous • We’ll discuss more on this…
Current synonym approaches • Synonymous gene and protein names represent same biological substance • Exhibit identical biological functions • Same gene or amino acid sequences • Other approaches • String matching • Matching abbreviations to full-forms
Gene and Protein Tagging • Identification step • Uses BLAST techniques and domain knowledge to pick out genes and protein terms • Heuristics • Synonyms usually occur within same sentence • Synonyms mentioned in first few pages of article
Synonym detection approaches • Unsupervised - ‘Similarity’ • based on contextual similarity • Semi-supervised - ‘Snowball’ • extracts structured relations using patterns • Supervised - Text Classification/SVM • Hand-crafted extraction – GPE • Combined system
Combined Approach • Combine output of SnowBall, SVM, and GPE • Each system gives a confidence score for each synonym pair Where, s = <p1,p2> is a synonym pair and ConfE(s) is confidence assigned to s by individual extraction by the system E
Unsupervised - Similarity • Context based • All words occurring within a ‘x’ word window • False positives are very common • Run time – O(|lexicon|3)
Semi-supervised - Snowball • Manual feedback mechanism
Supervised – Text Classification • Input: known synonym pairs • Automatically find contexts and assign weights • Train classifier to distinguish between ‘positive’ and ‘negative’ contexts • Eg ‘A also known as B’ and ‘A regulates B’
Why Combined Approach? • SnowBall and SVM, machine-learning based • captures synonyms that may be missed by GPE • GPE, knowledge-based • SnowBall and SVM have many false positives • Combine both advantages
Summary • Text mining • Semantic neighbor • Neighbor divergence • Precision and Recall • Synonym detection Approaches • Comments / Questions?