170 likes | 272 Views
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature. Hong Yu and Eugene Agichtein. Dept. Computer Science, Columbia University, New York, USA {hongyu, eugene}@cs.columbia.edu 212-939-7028. Significance and Introduction.
E N D
ISMB 2003 presentationExtracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia University, New York, USA {hongyu, eugene}@cs.columbia.edu 212-939-7028
Significance and Introduction • Genes and proteins are often associated with multiple names • Apo3, DR3, TRAMP, LARD, and lymphocyte associated receptor of death • Authors often use different synonyms • Information extraction benefits from identifying those synonyms • Synonym knowledge sources are not complete • Developing automate approaches for identifying gene/protein synonyms from literature
Background-synonym identification • Semantically related words • Distributional similarity [Lin 98][Li and Abe 98][Dagan et al 95] • “beer” and “wine” • “drink”, “people”, “bottle” and “make” • Mapping abbreviations to full forms • Map LARD to lymphocyte associated receptor of death • [Bowden et al. 98] [Hisamitsu and Niwa 98] [Liu and Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida et al. 00] [Yu et al. 02] • Methods for detecting biomedical multiword synonyms • Sharing a word(s) [Hole 00] • cerebrospinal fluid cerebrospinal fluid protein assay • Information retrieval approach • Trigram matching algorithm [Wilbur and Kim 01] • Vector space model • cerebrospinal fluidcer, ere, …, uid • cerebrospinal fluid protein assaycer,ere, …, say
Background-synonym identification • GPE [Yu et al 02] • A rule-based approach for detecting synonymous gene/protein terms • Manually recognize patterns authors use to list synonyms • Apo3/TRAMP/WSL/DR3/LARD • Extract synonym candidates and heuristics to filter out those unrelated terms • ng/kg/min • Advantages and disadvantages • High precision (90%) • Recall might be low, expensive to build up
Background—Machine-learning • Machine-learning reduces manual effort by automatically acquiring rules from data • Unsupervised and supervised • Semi-supervised • Bootstrapping [Hearst 92, Yarowsky 95] [Agichtein and Gravano 00] • Hyponym detection [Hearst 92] • The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string. • A Bambara ndang is a kind of bow lute • Co-training [Blum and Mitchell 98]
Method-Outline • Machine-learning • Unsupervised • Similarity [Dagan et al 95] • Semi-supervised • Bootstrapping • SNOWBALL [Agichtein and Gravano 02] • Supervised • Support Vector Machine • Comparison between machine-learning and GPE • Combined approach
Method--Unsupervised • Contextual similarity [Dagan et al 95] • Hypothesis: synonyms have similar surrounding words • Mutual information • Similarity
Methods—semi-supervised • SNOWBALL [Agichtein and Gravano 02] • Bootrapping • Starts with a small set of user-provided seed tuples for the relation, automatically generates and evaluates patterns for extracting new tuples. “Apo3, also known as DR3…” {Apo3, DR3} “DR3, also called LARD…” “<GENE>, also called <GENE>” {LARD, Apo3} “<GENE>, also known as <GENE>” {DR3, LARD}
Method--Supervised • Support Vector Machine • State-of-the-art text classification method • SVMlight • Training sets: • The same sets of positive and negative tuples as the SNOWBALL • Features: the same terms and term weights used by SNOWBALL • Kernel function • Radial basis kernel (rbf) kernel function
Methods—Combined • Rational • Machine-learning approaches increase recall • The manual rule-based approach GPE has a high precision with lower recall • Combined will boost both recall and precision • Method • Assume each system is an independent predictor • Prob=1-Prob that all systems extracted incorrectly
Evaluation-data • Data • GeneWays corpora [Friedman et al 01] • 52,000 full-text journal articles • Science, Nature, Cell, EMBO, Cell Biology, PNAS, Journal of Biochemistry • Preprocessing • Gene/Protein name entity tagging • Abgene [Tanabe and Wilbur 02] • Segmentation • SentenceSplitter • Training and testing • 20,000 articles for training • Tuning SNOWBALL parameters such as context window, etc. • 32,000 articles for testing
Evaluation-matrices • Estimating precision • Randomly select 20 synonyms with confident scores (0.0-0.1, 0.1-0.2, …,0.9-1.0) • Biological experts judged the correctness of synonym pairs • Estimating recall • SWISSPROT—Gold Standard • 989 pairs of SWISSPROT synonyms co-appear in at least one sentence in the test set • Biological experts judged 588 pairs were indeed synonyms • “…and cdc47, cdc21, and mis5 form another complex, which relatively weakly associates with mcm2…”
Middle <(0.55><ALSO 0.53><CALLED 0.53> <ALSO 0.47><KNOWN 0.47><AS 0.47> <( 0.54> <ALSO 0.54> <TERMED 0.54> Conf 0.75 0.54 0.47 Left - - - Right - - - Results • Patterns SNOWBALL found • Of 148 evaluated synonym pairs, 62(42%) were not listed as synonyms in SWISSPROT
System Tagging Similarity Snowball SVM GPE Time 7 hs 40 mins 2 hs 1.5 h 35 mins Results • System performance
Conclusions • Extraction techniques can be used as a valuable supplement to resources such as SWISSPROT • Synonym relations can be automated through machine-learning approaches • SNOWBALL can be applied successfully for recognizing the patterns