1 / 17

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text. Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer. Pac Symp Biocomput. 2003;:403-14. Abstract. Construction of a comprehensive general purpose name dictionary

waneta
Download Presentation

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Playing Biology’s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp Biocomput. 2003;:403-14.

  2. Abstract • Construction of a comprehensive general purpose name dictionary • An accompanying automatic curation procedure based on a simple token model of protein names • An efficient search algorithm to analyze all abstracts in MEDLINE • Parameters are optimized using machine learning techniques

  3. Model for protein and gene names • Protein names are often composed of more than one word (token) • The “order” of these words is not very important – permutation of tokens may occur • General-purpose dictionaries of protein names must be automatically composed

  4. Token classes (1/3)

  5. Token classes (2/3) • Extract all words from the dictionary with frequency of occurrence > 100 • Non-descriptive tokens: words occurring in databases but rarely used in free text or have no influence on the significance of match • Modifier tokens: words crucial for correct recognition

  6. Token classes (3/3) • Specifier tokens: Arabic and Roman numbers and Greek letters • Delimiter tokens: used to gain specificity in the matching procedure – help identify name boundaries • Common words: obtained by comparison to a standard English dictionary • Standard tokens: gene identifiers as they cannot be easily assigned to a separate calss

  7. Automatic generation of the dictionary • Extract gene symbols, alias names, and full names for all human genes from the HUGO Nomenclature database • Create an entry for each official gene symbol and add the corresponding names in the OMIM database • Extract all synonyms in SWISSPROT and TREMBL database and match these to HUGO entries

  8. Curation of the dictionary (1/3) • To resolve ambiguities and to remove nosensical names from the dictionary • A curation procedure consists of two phases – expansion and pruning • Expansion:

  9. Curation of the dictionary (2/3) • Pruning: remove redundancies, ambiguities, and irrelevant synonyms • First: synonyme  a sequence of token class identifiers • Use regular expression to search unspecific synonyms (e.g. only non-descriptive tokens, only specifier tokens, etc.) • Finally, a list of ambiguous names is stored separately with reference to their original records

  10. Curation of the dictionary (3/3) • The ambiguity list can be used to identify such entries and move them to the manual curation list based on their frequency of occurrence.

  11. Efficient detection of names (1/3) • MEDLINE contains about 11 million abstracts • Linear time in the number of tokens of the parsed text • To sweep over the abstract, processing one token at a time and keep a set of candidate solutions and two associated scoring measures, boundary score s and acceptance score s, for the present position

  12. Efficient detection of names (2/3) • boundary score s: controls the end of the extension of a candidate match and is increased on a token mismatch. The candidate is pruned if s>boundary threshold • acceptance score s: determine whether the candidate is reported as a match. sis a linear combination of token-class-specific match and mismatch terms. In other words, the significance of token classes vary.

  13. Efficient detection of names (3/3) • Example: • Only the non-descriptive token“precursor” is unmatched in the candidate  a nearly maximal match score would be computed (if non-descriptive tokens receive a small weight) • However, the semantically significant modifier token“receptor” leads to a substantial mismatch term (if weights are set appropriately)

  14. Parameter optimization • Robust linear programming (RPL) was used to compute a set of sensible weights • This supervised machine learning techniques uses a set ofpositive samples, i.e. correctly identified protein names, and a set of negative ones. • The match and mismatch weighting parameters for delimiter, specifier, modifier, and standard tokens were tuned. • The optimized weightings penalize mismatch of modifier and number tokens and reward matching of other token classes to various extend

  15. Evaluation • The test dataset is based on the TRANSPATH database on regulatory interactions. • Extracted all human proteins with SWISSPROT annotations • Discarded abstracts if no text was available or if a protein was described for the first time • Resulting benchmark set consists of 611 associations (141 objects in 470 abstracts)

  16. Results – 5-fold c.v.

More Related