460 likes | 472 Views
Institute for Systems Analysis Federal Research Center « Computer Science and Control » of Russian A cademy of S ciences. Evolution of Russian Search Engines. Ilya Tikhomirov PhD. + 7 (499) 135-04-63 117312 , Moscow pr. 60-letiya Oktyabrya , 9 www.isa.ru. Outline. Introduction
E N D
Institute for Systems Analysis Federal Research Center «Computer Science and Control» of RussianAcademyofSciences Evolution of Russian Search Engines IlyaTikhomirov PhD +7(499)135-04-63 117312, Moscow pr. 60-letiya Oktyabrya, 9 www.isa.ru
Outline • Introduction • History of Russian search engines • Yandex • Aport • Rambler • Mail.ru • Sputnik • Technologies behind Russian search engines • TextAppliance – new solution in B2B • The Future of Information Search
The Dawn of Search Engines Archie (1990): FTP, index of directory listings Veronica (1992), Jughead (1993): Gopher, search for filenames, titles VLib (1993): “Virtual Library” – the list ofwebservers in the Internet Primitive Web Search (1993): linear search, WWW Worm (indexed URLs and titles) Alta Vista (1994): natural language queries Yahoo! Search (1994) Web Crawler (1994): indexed entire pages
First Russian Search Engines • Limitations of foreign web search engines in the mid 1990s: • Poor indexing depth of the Russian segment of the Web • Limited support for Cyrillic encodings • Different word forms were not considered (Russian is the inflected language) • Birth of Russian Search Engines : • Aport (1996) • Rambler (1996) • Yandex (1997)
Search Engines in Russia (1999) Others (Aport)
History of Russian Search Engines • Will discuss technological and scientific aspects of search engines • Most significant Russian web search engines from 90s till now: • Yandex • Aport • Rambler • Mail.Ru • Sputnik
Origins www.yandex.ru • Yandex= acronym(yet another indexer) • Started in 1993 as indexer on local computers • Through 1995-1997, web indexerwas developed • Yandex was showed on public in 1997 • The main feature – morphological analysis with lemmatization. • Some other novel features included • Similar document search • Sorting options (time/relevance)
Morphological Analysis and Lemmatization the mother brought her son to school . sonnoun thedet mothernoun bringverb herpronoun toprep schoolnoun . Morphological analysis determines morphological features of words Lemmatization transforms words into their canonical form Lemmatization is very important for search engines that work with inflected languages (e.g., Russian)
in Early 2000s (1) www.yandex.ru • Yandex rapidly increases its market share due to: • Developing new services (Yandex.Mail, Yandex.News, Yandex.Goods etc.) • This made Yandex universal Internet portal people can rely on when solving many tasksbesides search • Advertising campaign • Big investment in research and development • The search engine was constantly improving
in Early 2000s (2) www.yandex.ru • Increased the indexing database • Engine got new features: • Shallow syntax parsing • Popular finds for user queries • Popular finds: • Many user queries are similar • Engine can determine which answers were clicked by users • It gathers statistics on popular clicks for groups of queries • Statistics can be used to show answers that other users considered useful
in Early 2000s (2) www.yandex.ru In 2001 Yandex gets the biggest market share among other search engines It overtakes its competitor – the previous biggest search engine – Rambler
in Middle 2000s (1) www.yandex.ru • Search engine development: • Yandexstarts indexing documents in multiple formats: PDF, DOC, RTF, PPT, XLS, etc. • Rapid web crawler (updates index every 1.5–2 hours) • New search algorithms “Magadan”, “Nahodka”. • New services: • Yandex.Maps, Yandex.Address, etc.
in Middle 2000s (2) www.yandex.ru • Search algorithms (“Magadan”, “Nahodka”) • Geo-classification of user queries (retrieves results from certain geographical regions) • Yandex increased number of features that are taken into account by the search engine • Increased speed • Abbreviation detection, transliterations, word translates, and other minor features • More languages (English, German) • Understanding of cognate words (verb “fix” ~ noun “fixing”)
in Late 2000s www.yandex.ru Search engine starts using new technology called “MatrixNet”
MatrixNet (1) MatrixNet is a machine learning algorithm that performs ranking of search engine answers Gradient boosting on oblivious decision trees Balanced trees with constant depth Example:
MatrixNet (2) • Advantages: • Has resistance to overfitting • Takes into consideration many criteria (>800) • Does not need many human-assessors • Can use less training data than other methods (50000 evaluations of assessors / month) • Can be tuned for particular domain (e.g. music) • Highly scalable
MatrixNet (3) • Disadvantages: • Sometimes it is difficult to adjust search ranking to common-sense rules • Unpredictable for cite optimization • Search results are unstable. Even popular relevant services can disappear from query results from time to time
Today www.yandex.ru • The search engine with the biggest market share in Russia • The 4th biggest search service in the world • Can search documents in Russian, English, German, French, Ukrainian, Belorussian, Kazakh, Tatar, and Turkish(for all languages Yandex takes into account morphology) • 52.5 million usersper month in Russia • >150 Million search queries per day • Yandex is developed rapidly: • Number of features for MatrixNet is increased -> search quality increases • New services are developed • However, Yandex looses users in Russia to Google • In 2011 Yandex had 65% of the search market • Today its only 51% • Lost 14% in 5 years
Aport – the First Search Engine in Russia www.aport.ru Was developed in the beginning of 1996 Became the first national search engine Had great success in the beginning Used the simple approaches with no natural language processing
Aport – the End www.aport.ru After Aport became popular, it was not developed Simple search algorithms that did not use powerful natural language techniques were defeated by constantly improving solutions of competitors (Rambler, Yandex) Market share of Aport degraded rapidly in the end of 90s and early 2000s The search engine was completely shutdown only in 2011 Some of the Aport developments and experience were used by Mail.ru
, the Start www.rambler.ru Officially started in 1996 Took a big share of the information search market Was the leading search engine till 2001 Was developing as a Internet media portal
in Early 2000s www.rambler.ru Development of the search engine wasabandoned The creator of engine left the company Engine became technologically outdated In 2001 the company presented new version of engine However Rambler lost the market share to Yandex New services were developed: Rambler.Mail, Rambler.News, etc. Rambler added support for German and Bolgarianlanguages, however they were shutdown very quickly
in the Middle 2000s and the End of the Search Engine www.rambler.ru Rambler company was oriented on providing media services The role of the search engine itself started to decrease The lack of research and development resulted in loss of share in information search market since the competitors (Yandex, Google) provided better search quality In 2011, the search engine was completely shut down. Since then, Rambler portal uses Yandex’s services for search
Search Engine (1) www.mail.ru • Mail.Ru originally was as a mailing service • In 2004 the company started development of their own search engine • However in 2004 – 2006 and in 2010 – 2013 the Mail.Ru search service mainly used Google search engine. The own engine was supplementary: • The search service merged results from both engines • Mail.ru used Yandex engine in 2007 – 2009 in the same way • Only in 2013 Mail.ru started using its entirely own software for search “GoGo.Ru”
Search Engine (2) www.mail.ru • Mail.ru engine features: • Multiple languages: English, Russian, Kazakh • Search results can be tuned by web-masters of big Internet services directly • Web-masters can insert links to their resources for certain user queries into the search result list • This means that big Internet services do not need “search engine optimization” (SEO) for Mail.ru
Sputnik www.sputnik.ru Started in 2014 and now considered to be in a beta stage Sputnik is owned by Russian government The main distinguishing feature of Sputnik is considered to be social-oriented services Has small database of cites The quality of search is mediocre < 20 000 users today, which is far less than 0.1% of the Russian search market, therefore – not a success
TextAppliance–Solution for B2B and B2G www.textapp.ru • Started in 2015 • Based on Exactus technologies • TextAppliance – system for intelligent search and analysis of large-scale text collections. Functions: • Semantic and explorative search • Search for topically similar documents • Semantic plagiarism detection • Formation, comparison and topic analysis of user’s collections • Automatic extraction of keywords • Automatic generation of document summary • Topic analysis for document collections
TextAppliance Semantic Search Method • Perform deep natural language processing of user query • POS-tagging • Syntactic parsing • Semantic role labeling • Semantic relation extraction • Named entity recognition • Compare linguistic structure of query with structures of documents in a large indexed textual collection
Relational-Situational Model of Text Example: “Oxygen arrives at tissues from lungs through blood. There it is spent on oxidation of various substances.” • Syntax relations • Semantic roles and values of syntaxemes • Semantic relations between syntaxemes • Coreference relations • Other information extracted from texts: • names of persons • names companies • geographical objects • etc.
Tendencies of Russian Search Engines From search engine to media portal (Rambler, Mail.Ru, Yandex, Sputnik) One site – multiple services: web search, multimedia search, news search, navigation/maps, commercials, weather, games, social networks, e-mail, etc. Special Search Engines (Yandex.Auto, Text Appliance) Metasearch (Exactus, AskNet, Nigma, FindBook.ru) with search results regrouping and re-ranking (research projects)
Technologies behind Russian Search Engines (1) • Crawling: • Global and Vertical Search (News, Auto, E-Commerce, Travel, Realty, Social networks, books and sci-tech documents) • Bunch of crawlers: different crawlers for different tasks • Natural Language Processing: • Stemming (like Porter stemmer, 1980) • Lemmatization (Yandex: MyStem, 1998) • Word sense disambiguation (Yandex, 2005, 2009) • Semantic analysis (Text Appliance 2015)
Technologies behind Russian Search Engines (2) • Request processing: • Misprints correction • Guessing request using its incomplete part • Translation to other languages
Technologies behind Russian Search Engines (3) • Ranking: • TF-IDF evolution (BM-formulas, early 2000s) • Hyperlink analysis (Topical PageRank is similar to Google PageRank, uses Yandex.Directory as an alternative to DMOZ - early 2000s) • Word positions, tagging and collocation analysis (2000s) • Requests classification (2000s) • Relevance feedback (2000s) • Geotargeting(late 2000s) • User behavior analysis (late 2000s) • Machine learning (MatrixNet by Yandex, 2009)
Information Search Services in Russia Today (1) Russian search market is mostly taken by two search engines: Yandex and Google Rapid research and development in information retrieval, natural language processing, and computer science in general allowed Yandex to overtake its competitors (Aport, Rambler, etc.) The new search engine Sputnik have not achieved a success The new search engines also under construction and development (Text Appliance)
Information Search Services in Russia Today (2) It is very difficult to enter the information search market today. The new solution has to offer new advanced technologies and novel features to attract users Big investment in infrastructure with long pay off time is also needed Therefore companies that have new technologies in this area prefer to create solutions not for the end users but rather for other companies or government (B2B or B2G solutions)
The Future of Information Search (1) • The significance of semantic technologies will increase: • Semantic role labeling, semantic relation extraction, concept mapping • Search engine will more deeply ‘understand’ user queries and texts of indexed documents • Search engines will increase their question-answering abilities • Understand more complex user questions • Synthesize aggregative answers
The Future of Information Search (2) • Multi-language search with more languages • Using a query given in one languages search across documents in any language • Translate answers to user’s native language • Machine learning techniques will be crucial for ranking algorithms • Unsupervised learning • Distant semi-supervised learning • More functions, more intelligence, more services…
Institute for Systems Analysis Federal Research Center «Computer Science and Control» of RussianAcademyofSciences IlyaTikhomirov Institute for Systems Analysis Federal Research Center “Computer Science and Control” of Russian Academy of Sciences 117312, Moscow,pr. 60-letiya Oktyabrya, 9 Tel/fax: +7 499 1350463 tih@isa.ru