120 likes | 211 Views
Relevance Detection Approach to Gene Annotation. Aid to automatic annotation of databases Annotation flow Extraction of molecular function of a gene from literature That annotation of this function with a term in a controlled vocabulary Premise
E N D
Relevance Detection Approach to Gene Annotation • Aid to automatic annotation of databases • Annotation flow • Extraction of molecular function of a gene from literature • That annotation of this function with a term in a controlled vocabulary • Premise • If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them
Data • GeneRIF/GO term pairs • Paired if reference same MEDLINE article • Manually filtered for obvious errors • 550 pairs from 335 distinct genes • GO concept = GO term + definition • GeneRIFs and GO concepts too short for simple keyword matching • Treated as an IR problem • Similar to TREC novelty track • Compute relevance and similarity of 2 sentences
Document set - TREC Genomics 2003 docs • Each sentence within GeneRIF/GO concept pair treated as IR query • Similarity between the 2 computed based on top 200 docs retrieved by each query • Best Recall = 78.2%(prec = 22.1%) • Best Precision = 66.2% (rec = 46.9%)
GO Dependence Relations • Previous work (PSB) • Using substring matching between GO codes • Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. • ChEBI: www.ebi.ac.uk/chebi/ • Chemical Entities of Biological Interest • Preferred names + synonyms • IS_A (poly)hierarchy
methods • String matching • If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship • First order relationship • ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity • Also, in a dependence relationship with the ancestors • Second order relationship
Results • 55% of GO terms contain a ChEBI entity • 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study • Less than 1% of GO term pairs found in this study were identified by the PSB study • Issues • How to validate potential relationships? • Usual naming/synonym ambiguity! • Substrings not used: imidazolonepropionase
Disease Text Classification • Task: Classification of text into one of 26 disease classes • Used full text and weighted sections according to information distribution published by other groups
Data Preparation • HTML full text documents, semi automatic section division • Tokenisation, Stemming, Stop word filtering, Part of speech tagging • Dataset: 21*25 positive full text articles, 33 negative full text articles • 10 fold cross validation • Nearest centroid classifier
Results • Baseline: 56% F-score • Additional preprocessing: 67% • 10,000 stopword filter • Only nouns • Section weighting: 74% • Abstract and Introduction weighted highest
From Nonsense to Sense in Healthcare Questions • Diagnosis, Prognosis, Therapy, Prevention • medicine finds disease mechanisms by first finding cures • Currently by trial and error • Try drug then test • Future - test then try drug • Biomarkers • Normality -> dysfunction -> disease • There are prognostic markers before any diagnostic markers
Integrative Genomics • Looking for hidden connections over wide field, e.g. • Immune system works too hard = rheumatoid arthritis • Immune system doesn’t work hard enough = infectious diseases
Term Disambiguation • 40% of genes have homonym problem • For 300 genes = 1mil MEDLINE articles • After disambiguation = 60,000 articles • 93% accuracy in asigning correct ID to ambiguous genes • Use contectual fingerprints: • Experts choose 5 abstracts about a concept • Fingerprint then created for that concept