590 likes | 849 Views
Mining Medical Literature. Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005). Outline. Introduction and Background Mining Technique 1: Identifying Functionally Coherent Gene Groups Mining Technique 2: Extracting Synonymous gene and protein terms Conclusions. Outline.
E N D
Mining Medical Literature Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005)
Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusions
Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusions
Introduction • Medical Literature has vast amounts of knowledge and information • PubMed Central (PMC) ( the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature) • Amedeo.com (The Medical Literature Guide) • Journals like Science, Nature, Cell ,EMBO, Cell Biology, PNAS • (and many more..)
The Problem • Major task is finding out ways to extract useful information from these resources.
What is Data Mining? “Data Mining is the Process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques.”
Example Data! • Large amounts of data but no information • Daily transactions at a supermarket • Daily website visit histories • Books/videos rented at a Library • Newspaper, Journal archives
Google News • Clustering News items (Google News)
More Applications • Improving Sales strategy • Finding items that sell together (there is a common example of beer and diaper being related. A supermarket found out that 50% of the times beer was purchased with diapers) • Anomaly Detection and many more…
Information Retrieval (IR) • Collecting information from text data (Unstructured Data) • Applications • Search web documents • Natural Language Processing • Term also extends to include multimedia or other forms of unstructured data
IR System Evaluation • Some measures are • Precision • Recall • F1 measure – Combined measure which is a weighted harmonic mean • Sensitivity • Specificity
Precision and Recall How are Precision and Recall related?
Problems with Precision and Recall • Deciding documents relevant and non relevant is not easy • For recall, difficult to measure the number of relevant documents in database • Creating pool of relevant records is one solution • In practice, these are still good measures
Sensitivity and Specificity • Sensitivity – Probability of positive examples • Specificity – Probability of negative examples What is the relation between Sensitivity, Specificity, Precision and Recall?
Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusion
Introduction • Analysis shifting from single gene to family of genes • Examples of these are: • Sequence Data • Gene Expression Clustering • Deletion Phenotypes • Yeast-2-Hybrid screens
HOVERGEN: a Database of Homologous Vertebrate Genes Useful for comparative sequence analysis, or molecular evolution studies 10 biggest gene families
Why identify functional gene groups? • Interesting to know functionally relevant groups for large gene group sets • Helps to assess the significance of experimentally derived gene sets • Refine gene groups to find more functionally relevant groups • Existing algorithms can make use of this information in finding gene groups
Existing Approaches • Use of co occurrence of gene names in abstracts to create networks of related genes automatically • Use existing vocabulary of gene functions and assigned genes to decide a functionally relevant group (Gene Ontology (GO) consortium and Munich Information Center for Protein Sequences (MIPS) )
Statistical NLP approach • Used for annotating individual genes • Determining gene and protein interactions • Assigning keywords to genes or group of genes
Neighbor Divergence Approach • Statistical NLP technique • Will always be up to date if provided with a current literature base • Cannot specify what the actual function is!
Challenges in the Problem • Large number of genes • Genes have multiple functions • Some genes have been extensively studied, others recently discovered So the literature about genes reflects these differences
Neighbor Divergence Algorithm • Representation Of Articles • Identifying Semantic Neighbors for Corpus Articles • Scoring Articles Relative to Gene Group • Calculating a Theoretical distribution of Scores • Calculating the Difference between empirical and theoretical distribution
ND- Article Representation Words in articles represented by their inverse document frequency (to reduce the impact of common words) Wi,j= 1 + (log2(tfi,j))log2 (N/dfi) if tfi,j > 0 Wi,j= 0 if tfi,j= 0 where Wi,j : weighted count of word i in document j, tfi,j : the number f times word i is in document dfi : the number of documents containing I N : the total number of documents
ND – Identifying Semantic Neighbors • For each article, K most similar articles are pre computed (k=20 was used) • Cosine similarity measure is used ( Cosine of the angle between two weighted article vectors)
ND – Scoring articles • Given a gene group, ND assigns a score to each article (Si,g) • Score is a count of semantic neighbors that refer to group genes • frk,g = nk,g / nk (Fractional Reference for each neighbor k) • Si,g = round(Σ(i=1 to 20) fr sem(i,j),g) (Score value)
ND – Difference in Distributions • Calculating a theoretical Distribution of Scores • Use of Poisson Distribution to represent the non coherent functional structure P(S = n) = ((λ)n/n!)e−λ • KL Divergence • If 2 distributions are same, divergence is zero • More disparate the distributions, larger the divergence • Dgh = Sum(gi log gi /hi )
Other methods • Word Divergence
Other methods • Best Article Score • Highest article score is used as a measure of the gene group’s functional coherence • Best p-Value • Summed probability of an article having equal or more neighbors than it has • Neighborhood Divergence –No Filter • Filter used is: When calculating semantic neighbors, only articles that refer to different genes are considered.
Outline • Introduction and Background • Mining Technique 1: Identifying Functionally Coherent Gene Groups • Mining Technique 2: Extracting Synonymous gene and protein terms • Conclusion
Introduction • Genes and proteins are associated with multiple names • LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP, WSL-1, WSL-LR, Tnfrsf12, • PS2, Alg2, MA-3, alg-2, Pdcd6 • GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2 http://bioinformatics.org/textknowledge/synonym.php)
Advantage • Automated method will keep the database updated • Extracting synonyms will help • Information retrieval and extraction • Human curators of biological resource
Existing approaches • Detecting semantically related words • “beer” and “wine” are related terms • Use of WORDNET (a large lexical database of English words) to evaluate semantic similarity • Most synonymous identification methods do not consider surrounding context of words
Information Extraction and Machine Learning • Requires a large amount of manual labor to construct and tune extraction systems • Machine learning techniques help to reduce the manual labor by automatically acquiring rules for labeled and unlabeled data
ML techniques • Supervised Learning • Labeled Training Data available • Semi supervised Learning • Small number of labeled training data • Unsupervised Learning • Data with no labeling • Reinforcement Learning • Learn a mapping form situations to actions by trial and error interactions
Approach Used here • Obtain tagged genes and proteins in text using existing gene taggers • Four approaches used • Unsupervised Learning • Partially Supervised Learning • Supervised Learning • Hand Crafter System • Use of a final COMBINED system
Unsupervised Learning – Contextual Similarity • Finds set of words that appear in similar context using mutual information between the words
Unsupervised Learning – Contextual Similarity • Mutual Information • Similarity Measure:
Contextual Similarity • For all terms takes time O(|lexicon|3 . So ,heuristic search is used • Lots of false positives returned, so useful to incorporate some domain knowledge
Snowball • Confidence of a pattern • Calculates confidence of extracted tuples and discards low confidence tuples
Supervised Learning – Text classification • User provided positive and negative example gene and protein pairs • Use SVM to train using this data (radial basis kernel function of SVMLight) • Classifies pairs of identified genes and proteins using a confidence score Conf(s)(score assigned by classifier) • Does not combine evidence from multiple occurrences of same gene or protein pair
Hand Crafted Extraction System- GPE system • Most labor intensive but high quality result approach • Starts with set of known pairs of synonyms • Manual examination to find patterns of occurrences • Use of “known as” or “also called” • Scans for more synonyms and uses heuristics and filters to ignore non gene/protein terms • Confidence value of 1 assigned to every returned result