1 / 12

Relevance Detection Approach to Gene Annotation

Relevance Detection Approach to Gene Annotation. Aid to automatic annotation of databases Annotation flow Extraction of molecular function of a gene from literature That annotation of this function with a term in a controlled vocabulary Premise

elgin
Download Presentation

Relevance Detection Approach to Gene Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relevance Detection Approach to Gene Annotation • Aid to automatic annotation of databases • Annotation flow • Extraction of molecular function of a gene from literature • That annotation of this function with a term in a controlled vocabulary • Premise • If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them

  2. Data • GeneRIF/GO term pairs • Paired if reference same MEDLINE article • Manually filtered for obvious errors • 550 pairs from 335 distinct genes • GO concept = GO term + definition • GeneRIFs and GO concepts too short for simple keyword matching • Treated as an IR problem • Similar to TREC novelty track • Compute relevance and similarity of 2 sentences

  3. Document set - TREC Genomics 2003 docs • Each sentence within GeneRIF/GO concept pair treated as IR query • Similarity between the 2 computed based on top 200 docs retrieved by each query • Best Recall = 78.2%(prec = 22.1%) • Best Precision = 66.2% (rec = 46.9%)

  4. GO Dependence Relations • Previous work (PSB) • Using substring matching between GO codes • Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. • ChEBI: www.ebi.ac.uk/chebi/ • Chemical Entities of Biological Interest • Preferred names + synonyms • IS_A (poly)hierarchy

  5. methods • String matching • If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship • First order relationship • ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity • Also, in a dependence relationship with the ancestors • Second order relationship

  6. Results • 55% of GO terms contain a ChEBI entity • 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study • Less than 1% of GO term pairs found in this study were identified by the PSB study • Issues • How to validate potential relationships? • Usual naming/synonym ambiguity! • Substrings not used: imidazolonepropionase

  7. Disease Text Classification • Task: Classification of text into one of 26 disease classes • Used full text and weighted sections according to information distribution published by other groups

  8. Data Preparation • HTML full text documents, semi automatic section division • Tokenisation, Stemming, Stop word filtering, Part of speech tagging • Dataset: 21*25 positive full text articles, 33 negative full text articles • 10 fold cross validation • Nearest centroid classifier

  9. Results • Baseline: 56% F-score • Additional preprocessing: 67% • 10,000 stopword filter • Only nouns • Section weighting: 74% • Abstract and Introduction weighted highest

  10. From Nonsense to Sense in Healthcare Questions • Diagnosis, Prognosis, Therapy, Prevention • medicine finds disease mechanisms by first finding cures • Currently by trial and error • Try drug then test • Future - test then try drug • Biomarkers • Normality -> dysfunction -> disease • There are prognostic markers before any diagnostic markers

  11. Integrative Genomics • Looking for hidden connections over wide field, e.g. • Immune system works too hard = rheumatoid arthritis • Immune system doesn’t work hard enough = infectious diseases

  12. Term Disambiguation • 40% of genes have homonym problem • For 300 genes = 1mil MEDLINE articles • After disambiguation = 60,000 articles • 93% accuracy in asigning correct ID to ambiguous genes • Use contectual fingerprints: • Experts choose 5 abstracts about a concept • Fingerprint then created for that concept

More Related