1 / 26

The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis

The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis. Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc. {ntoulas, gerald}@infocious.com 2 University of California Los Angeles {ntoulas, cho}@cs.ucla.edu. Motivation.

afya
Download Presentation

The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Infocious Web Search Engine:Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas1,2 Gerald Chao1 Junghoo Cho2 1 Infocious Inc.{ntoulas, gerald}@infocious.com 2 University of California Los Angeles {ntoulas, cho}@cs.ucla.edu

  2. Motivation • Current Web search engines identify relevant pages based on keyword matching • Example: jaguar Jaguar CarsOfficial worldwide web site of Jaguar Cars. www.jaguar.com/ WWW 2005 Chiba Japan

  3. Motivation • Is keyword matching enough ? • Natural languages are inherently ambiguous • Example: jaguar • The car brand ? • Apple Mac OS X 10.2 ? • The animal ? • Chemical software • … WWW 2005 Chiba Japan

  4. The Infocious Web Search Engine • Uses Language Analysis techniques to: • Resolve ambiguities inside Web pages • Rank the Web pages based on the coherence (quality) of the text • Help users organize the results in intuitive ways through categorization • Provide suggestions for query refinement WWW 2005 Chiba Japan

  5. What is different about Infocious ? • Search Engines today do not apply Language Analysis to the level Infocious does • It is not simply a matter of applying existing algorithms: need optimizations for Web scale • Features made possible only through language analysis • Makes Language Analysis features intuitive (yet powerful) for the user WWW 2005 Chiba Japan

  6. Architecture WWW 2005 Chiba Japan

  7. Architecture • Crawler • Follows links to discover Web pages • Refreshes changed pages using sampling [VLDB’02] • Can download pages from the Hidden Web [JCDL’05] WWW 2005 Chiba Japan

  8. Architecture • Linguistic Processing • Resolves language ambiguities [COLING’02] • Annotates Web pages • Extracts concepts • Extracts named entities • Operates at crawl speed WWW 2005 Chiba Japan

  9. Linguistic Processing: Disambiguation • Part-of-speech (POS) tagging • Example: house plants • Done probabilistically: Given sentence S, set of tags T find Tbest(S) = arg maxT P(T | S) ... most house plants are hybrids of plant species ... garden built to house our most valuable plants ... Adj Noun Noun Verb Noun Prep Noun Noun Noun VerbD Inf Verb PronP Adv Adj Noun WWW 2005 Chiba Japan

  10. Linguistic Processing: Disambiguation • POS information stored inside the index • User can manually specify POS at query time (or click on examples) Query N:house N:plants GreenPatio.Com – Tips for buying house plants.Why keep natural indoor plants.... Tips for buying house plants. Care for indoor plants....www.greenpatio.com/tips.shtml Low Light Plants for the HouseIs a common name for plants in the species Dieffenbachia.... As with most house plants …www.plantsgalore.com/articles/houseplants/houseplants-low-light-plantfacts.htm WWW 2005 Chiba Japan

  11. Linguistic Processing: Disambiguation • POS information stored inside the index • User can manually specify POS at query time (or click on examples) Query V:house N:plants Over Wintering Bonsai …One method is to build a cold frame to house your plants in the winter. ...www.evergreengardenworks.com/overwint.htm Keeping Your Sunroom Cozy … And if you want to house a hot tub or plants, think about enclosing the …doityourself.com/sunroom/sunroomcozy.htm WWW 2005 Chiba Japan

  12. Linguistic Processing: Disambiguation • Word-sense disambiguation • Previous Example: jaguar • Approach through Web page categorization • Use the categories of DMOZ (~600,000) • Given set of categories C and a page d Find maxc  C P(c|d) • In Infocious a page may belong to multiple categories WWW 2005 Chiba Japan

  13. The category of a result is highlighted onMouseOver() Allow users to restrict search within a category: jaguar cat:Computers Can also be done by clicking on a category Jaguar CarsOfficial worldwide site of jaguar cars www.jaguar.com/ Apple Mac OS X The Apple Mac OS Product page www.apple.com/macosx/ Categorization Computers Recreation/Autos Computers Apple Mac OS X WWW 2005 Chiba Japan

  14. Linguistic Processing: Concept Extraction • More accurate phrase identification: • Identify concepts through a set of rules (pre-specified or automatically learned) • Example: VerbPhrase-PrepPhrase-NounPhras • lightly tossed with salad dressing • tossed with oil and vinegar dressing • tossed immediately with blue-cheese dressing • Reduced to Concept: tossed with dressing In the profession of cooking oil is an important ingredient WWW 2005 Chiba Japan

  15. Answering a query • Default is AND-semantics • Query disambiguation (e.g. in query train a petInfocious knows train has to be a verb) • Ranking takes into account a variety of factors • Presence of keywords, Proximity • Title, URL, formatting, font size, coloring etc. • Popularity of a page measured by in/out links • TextQuality WWW 2005 Chiba Japan

  16. Architecture • TextQuality • Summarize probabilities from Linguistic Processing into one metric • Promote coherent text • Demote incoherent text WWW 2005 Chiba Japan

  17. TextQuality (disabled) • Promotes well-written pages (preferable from the user perspective) BritneySpears Pictures – britneyspears pictures …picture of britney spears, hot pictures of britney spears …britney-spears-pictures.hotyoungstars.com/nude/ Hot Britney Spears Pics - hot britney spears pics,...britney spears, new hot pics of britney spears,...hot-britney-spears-pics.hotyoungstars.com/nude/ Britney Spears Photos – britney spears photos …spears, britney spears nude photos, nude photos of …britney-spears-photos.hotyoungstars.com/nude/ TextQuality DISABLED WWW 2005 Chiba Japan

  18. TextQuality (enabled) • Promotes well-written pages (preferable from the user perspective) Is BritneySpears over the edge? Is Britney Spears over the edge? … Britney Spears is a singer …azwestern.edu/modern_lang/esl/cjones/mag/spring2004/britney.htm IMPERSONATORS – BRITNEY SPEARSIs Proud to Present! Contact: Gary Shortall Back… www.impersonators.com/brittany/brit.html Britney Spears’ Coke HabitBritney Spears’ Coke Habit Destroys Her…www.emptyv.org/britney_spears.htm TextQuality ENABLED WWW 2005 Chiba Japan

  19. Other Language Analysis-Enhanced Features • Key phrases: Present a list of the salient concepts within the results • Related topics: Concepts related to the present query • Hone your search: Suggestion of more specific queries • Spell Checking • Personalization: I like Sports but not Politics WWW 2005 Chiba Japan

  20. Evaluation of Categorization • Using Naïve Bayes classifiers for illustration: Language Analysis improves accuracy • Infocious actually employs an improved classification technique (76% accuracy) • We used four different flavors of NB on 100,000 Web pages: • C1: Words • C2: Words + POS tags • C3: Words + extracted concepts • C4: Words + POS + extracted concepts WWW 2005 Chiba Japan

  21. Evaluation of Categorization 3% accurary increase – 8% error reduction • C1: Words only • C2: Words + POS tags • C3: Words + extracted concepts • C4: Words + POS + extracted concepts WWW 2005 Chiba Japan

  22. User Interface WWW 2005 Chiba Japan

  23. Conclusion • Infocious: uses language analysis to improve Web search • Resolves language ambiguities • Incorporates text coherence in the ranking • Provides query suggestions and refinements • Organizes information intuitively through categorization WWW 2005 Chiba Japan

  24. Related Work • Web Search Engines: • Google, Yahoo!, MSNSearch, Ask/Teoma, Altavista, Looksmart, Vivisimo, … • Enterprise Search • Autonomy, Inquira, Inxight, iPhrase, … • Answer Engines • START@MIT, BrainBoost, … WWW 2005 Chiba Japan

  25. Ongoing work • Increase index size (currently ~1 billion pages) through surface & hidden Web-crawls • Apply our Language Analysis algorithms to additional languages • Leverage our Language-annotated repository for additional features (e.g. summarization, machine translation,…) • Investigate how to use Language Analysis to improve relevance in advertisements WWW 2005 Chiba Japan

  26. Thank you ! You can check out our Search Engine at:www.infocious.com WWW 2005 Chiba Japan

More Related