220 likes | 259 Views
This research paper discusses an innovative approach to scalable semantic search embedding, highlighting its potential benefits in information retrieval, entity disambiguation, de-duplication, recommendation, clustering, and subject prediction. The paper introduces embedding approaches, including global co-occurrence count-based methods and local context predictive methods, and explores the use of random projection for entity and document embeddings. The effectiveness of the approach is evaluated using benchmarks and experiments. The authors emphasize the need for human involvement and appropriate evaluation measures in embedding efforts.
E N D
AIDR2019 • May 13, 2019 An innovative approach to scalable semantic search embedding Shenghui Wang, Rob Koopman, Titia van der Werfand Jean Godby OCLC Research
Why semantic embedding? Many of our tasks could be improved by semantic embedding • Information retrieval • Entity disambiguation • De-duplication • Recommendation • Clustering • Subject prediction • ...
Foundation for semantic embedding • Distributional Hypothesis (Harris, 1954) “words that occur in similar contexts tend to have similar meanings” • Statistical Semantics (Weaver, 1955, Firth 1957) “a word is characterized by the company it keeps”
An example by Stefan Evert: what is bardiwac? • He handed her her glass of bardiwac. • Beef dishes are made to complement the bardiwacs. • Nigel staggered to his feet, face flushed from too much bardiwac. • Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’ssunshine. • I dined on bread and cheese and this excellent bardiwac. • The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish. ⇒ ‘bardiwac’ is a heavy red alcoholic beverage made from grapes
Word embedding • Words are represented as vectors of numbers that represent the meaning of a word. • Semantically similar words are mapped to nearby points, that is, “are embedded nearby each other” • A desirable property: computable similarity https://medium.com/@jayeshbahire/introduction-to-word-vectors-ea1d4e4b84bf
Embedding approaches • Global co-occurrence count-based methods (e.g. LSA) • Based on statistics of how often some word co-occurs with its neighbour words in a large text corpus • Dimension reduction methods • Local context predictive methods (e.g. word2vec) • Learning to predict a word from its neighbours or vice versa • More complex and powerful deep learning models for embedding words, sentences and documents
Why not deep learning methods? • Computationally expensive to train from scratch • Often requiring GPUs • Difficult to find the optimal hyperparameter settings • Pre-trained word embeddings may not capture domain-specific semantics • Medical information retrieval, special collection exploration, etc. • Standard benchmarks and evaluation methods often do not answer practical needs
After random projection • Each entity (words, subjects, authors, …) is embedded as a D-dimensional vector (in our case, a 256-byte vector) • Each document is also embedded as a vector in the same semantic space • A document is represented as the weighted average of the vectors of its associated entities • Cosine similarity reflects semantic similarity
What’s special in Ariadne RP • Entity embeddings are updated online while going through the corpus once • No need to store the original co-occurrence matrix • No iterations over the corpus, highly efficient implementation
Orthogonal projection and weight adjustment • Vectors are projected on the orthogonal hyperplane to an average language vector • Removing the stop-wordiness improves the discriminating power • Weights are calculated automatically • The more similar to the average vector, the less weight it gets • No need to remove stop words first • Crucial to get distinctive document embeddings
Effect of orthogonal projection and weight adjustment Rob Koopman, Shenghui Wang and Gwenn Englebienne. Fast and discriminative semantic embedding. Proceedings of The 13th International Conference on Computational Semantics (IWCS 2019). To appear.
Automatic subject prediction • Naive similarity based method • Subjects and documents are embedded in the same semantic space • A document is likely to be indexed with subjects that are most related to it (with highest cosine similarities)
Experiment: Predicting MeSH subjects • Metadata of one million Medline articles with abstract • The training set contains 147,837 unique MeSH subjects, on average 16 per article • 10,000 articles for testing • Measure precision/recall of top N predictions
Recall and precision @ n: Medline Rob Koopman, Shenghui Wang and Gwenn Englebienne. Fast and discriminative semantic embedding. Proceedings of The 13th International Conference on Computational Semantics (IWCS 2019). To appear.
Example https://www.ncbi.nlm.nih.gov/pubmed/14670424
Summary • Along with deep learning efforts, we can take an alternative and practical approach for a fast and discriminative semantic embedding • We need human involvement as well as more appropriate evaluation measures
Thank you Shenghui Wang (shenghui.wang@oclc.org) Rob Koopman (rob.koopman@oclc.org) Titia van der Werf (titia.vanderwerf@oclc.org) Jean Godby (godby@oclc.org)