170 likes | 256 Views
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text. Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer. Pac Symp Biocomput. 2003;:403-14. Abstract. Construction of a comprehensive general purpose name dictionary
E N D
Playing Biology’s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp Biocomput. 2003;:403-14.
Abstract • Construction of a comprehensive general purpose name dictionary • An accompanying automatic curation procedure based on a simple token model of protein names • An efficient search algorithm to analyze all abstracts in MEDLINE • Parameters are optimized using machine learning techniques
Model for protein and gene names • Protein names are often composed of more than one word (token) • The “order” of these words is not very important – permutation of tokens may occur • General-purpose dictionaries of protein names must be automatically composed
Token classes (2/3) • Extract all words from the dictionary with frequency of occurrence > 100 • Non-descriptive tokens: words occurring in databases but rarely used in free text or have no influence on the significance of match • Modifier tokens: words crucial for correct recognition
Token classes (3/3) • Specifier tokens: Arabic and Roman numbers and Greek letters • Delimiter tokens: used to gain specificity in the matching procedure – help identify name boundaries • Common words: obtained by comparison to a standard English dictionary • Standard tokens: gene identifiers as they cannot be easily assigned to a separate calss
Automatic generation of the dictionary • Extract gene symbols, alias names, and full names for all human genes from the HUGO Nomenclature database • Create an entry for each official gene symbol and add the corresponding names in the OMIM database • Extract all synonyms in SWISSPROT and TREMBL database and match these to HUGO entries
Curation of the dictionary (1/3) • To resolve ambiguities and to remove nosensical names from the dictionary • A curation procedure consists of two phases – expansion and pruning • Expansion:
Curation of the dictionary (2/3) • Pruning: remove redundancies, ambiguities, and irrelevant synonyms • First: synonyme a sequence of token class identifiers • Use regular expression to search unspecific synonyms (e.g. only non-descriptive tokens, only specifier tokens, etc.) • Finally, a list of ambiguous names is stored separately with reference to their original records
Curation of the dictionary (3/3) • The ambiguity list can be used to identify such entries and move them to the manual curation list based on their frequency of occurrence.
Efficient detection of names (1/3) • MEDLINE contains about 11 million abstracts • Linear time in the number of tokens of the parsed text • To sweep over the abstract, processing one token at a time and keep a set of candidate solutions and two associated scoring measures, boundary score s and acceptance score s, for the present position
Efficient detection of names (2/3) • boundary score s: controls the end of the extension of a candidate match and is increased on a token mismatch. The candidate is pruned if s>boundary threshold • acceptance score s: determine whether the candidate is reported as a match. sis a linear combination of token-class-specific match and mismatch terms. In other words, the significance of token classes vary.
Efficient detection of names (3/3) • Example: • Only the non-descriptive token“precursor” is unmatched in the candidate a nearly maximal match score would be computed (if non-descriptive tokens receive a small weight) • However, the semantically significant modifier token“receptor” leads to a substantial mismatch term (if weights are set appropriately)
Parameter optimization • Robust linear programming (RPL) was used to compute a set of sensible weights • This supervised machine learning techniques uses a set ofpositive samples, i.e. correctly identified protein names, and a set of negative ones. • The match and mismatch weighting parameters for delimiter, specifier, modifier, and standard tokens were tuned. • The optimized weightings penalize mismatch of modifier and number tokens and reward matching of other token classes to various extend
Evaluation • The test dataset is based on the TRANSPATH database on regulatory interactions. • Extracted all human proteins with SWISSPROT annotations • Discarded abstracts if no text was available or if a protein was described for the first time • Resulting benchmark set consists of 611 associations (141 objects in 470 abstracts)