260 likes | 330 Views
The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis. Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc. {ntoulas, gerald}@infocious.com 2 University of California Los Angeles {ntoulas, cho}@cs.ucla.edu. Motivation.
E N D
The Infocious Web Search Engine:Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas1,2 Gerald Chao1 Junghoo Cho2 1 Infocious Inc.{ntoulas, gerald}@infocious.com 2 University of California Los Angeles {ntoulas, cho}@cs.ucla.edu
Motivation • Current Web search engines identify relevant pages based on keyword matching • Example: jaguar Jaguar CarsOfficial worldwide web site of Jaguar Cars. www.jaguar.com/ WWW 2005 Chiba Japan
Motivation • Is keyword matching enough ? • Natural languages are inherently ambiguous • Example: jaguar • The car brand ? • Apple Mac OS X 10.2 ? • The animal ? • Chemical software • … WWW 2005 Chiba Japan
The Infocious Web Search Engine • Uses Language Analysis techniques to: • Resolve ambiguities inside Web pages • Rank the Web pages based on the coherence (quality) of the text • Help users organize the results in intuitive ways through categorization • Provide suggestions for query refinement WWW 2005 Chiba Japan
What is different about Infocious ? • Search Engines today do not apply Language Analysis to the level Infocious does • It is not simply a matter of applying existing algorithms: need optimizations for Web scale • Features made possible only through language analysis • Makes Language Analysis features intuitive (yet powerful) for the user WWW 2005 Chiba Japan
Architecture WWW 2005 Chiba Japan
Architecture • Crawler • Follows links to discover Web pages • Refreshes changed pages using sampling [VLDB’02] • Can download pages from the Hidden Web [JCDL’05] WWW 2005 Chiba Japan
Architecture • Linguistic Processing • Resolves language ambiguities [COLING’02] • Annotates Web pages • Extracts concepts • Extracts named entities • Operates at crawl speed WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation • Part-of-speech (POS) tagging • Example: house plants • Done probabilistically: Given sentence S, set of tags T find Tbest(S) = arg maxT P(T | S) ... most house plants are hybrids of plant species ... garden built to house our most valuable plants ... Adj Noun Noun Verb Noun Prep Noun Noun Noun VerbD Inf Verb PronP Adv Adj Noun WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation • POS information stored inside the index • User can manually specify POS at query time (or click on examples) Query N:house N:plants GreenPatio.Com – Tips for buying house plants.Why keep natural indoor plants.... Tips for buying house plants. Care for indoor plants....www.greenpatio.com/tips.shtml Low Light Plants for the HouseIs a common name for plants in the species Dieffenbachia.... As with most house plants …www.plantsgalore.com/articles/houseplants/houseplants-low-light-plantfacts.htm WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation • POS information stored inside the index • User can manually specify POS at query time (or click on examples) Query V:house N:plants Over Wintering Bonsai …One method is to build a cold frame to house your plants in the winter. ...www.evergreengardenworks.com/overwint.htm Keeping Your Sunroom Cozy … And if you want to house a hot tub or plants, think about enclosing the …doityourself.com/sunroom/sunroomcozy.htm WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation • Word-sense disambiguation • Previous Example: jaguar • Approach through Web page categorization • Use the categories of DMOZ (~600,000) • Given set of categories C and a page d Find maxc C P(c|d) • In Infocious a page may belong to multiple categories WWW 2005 Chiba Japan
The category of a result is highlighted onMouseOver() Allow users to restrict search within a category: jaguar cat:Computers Can also be done by clicking on a category Jaguar CarsOfficial worldwide site of jaguar cars www.jaguar.com/ Apple Mac OS X The Apple Mac OS Product page www.apple.com/macosx/ Categorization Computers Recreation/Autos Computers Apple Mac OS X WWW 2005 Chiba Japan
Linguistic Processing: Concept Extraction • More accurate phrase identification: • Identify concepts through a set of rules (pre-specified or automatically learned) • Example: VerbPhrase-PrepPhrase-NounPhras • lightly tossed with salad dressing • tossed with oil and vinegar dressing • tossed immediately with blue-cheese dressing • Reduced to Concept: tossed with dressing In the profession of cooking oil is an important ingredient WWW 2005 Chiba Japan
Answering a query • Default is AND-semantics • Query disambiguation (e.g. in query train a petInfocious knows train has to be a verb) • Ranking takes into account a variety of factors • Presence of keywords, Proximity • Title, URL, formatting, font size, coloring etc. • Popularity of a page measured by in/out links • TextQuality WWW 2005 Chiba Japan
Architecture • TextQuality • Summarize probabilities from Linguistic Processing into one metric • Promote coherent text • Demote incoherent text WWW 2005 Chiba Japan
TextQuality (disabled) • Promotes well-written pages (preferable from the user perspective) BritneySpears Pictures – britneyspears pictures …picture of britney spears, hot pictures of britney spears …britney-spears-pictures.hotyoungstars.com/nude/ Hot Britney Spears Pics - hot britney spears pics,...britney spears, new hot pics of britney spears,...hot-britney-spears-pics.hotyoungstars.com/nude/ Britney Spears Photos – britney spears photos …spears, britney spears nude photos, nude photos of …britney-spears-photos.hotyoungstars.com/nude/ TextQuality DISABLED WWW 2005 Chiba Japan
TextQuality (enabled) • Promotes well-written pages (preferable from the user perspective) Is BritneySpears over the edge? Is Britney Spears over the edge? … Britney Spears is a singer …azwestern.edu/modern_lang/esl/cjones/mag/spring2004/britney.htm IMPERSONATORS – BRITNEY SPEARSIs Proud to Present! Contact: Gary Shortall Back… www.impersonators.com/brittany/brit.html Britney Spears’ Coke HabitBritney Spears’ Coke Habit Destroys Her…www.emptyv.org/britney_spears.htm TextQuality ENABLED WWW 2005 Chiba Japan
Other Language Analysis-Enhanced Features • Key phrases: Present a list of the salient concepts within the results • Related topics: Concepts related to the present query • Hone your search: Suggestion of more specific queries • Spell Checking • Personalization: I like Sports but not Politics WWW 2005 Chiba Japan
Evaluation of Categorization • Using Naïve Bayes classifiers for illustration: Language Analysis improves accuracy • Infocious actually employs an improved classification technique (76% accuracy) • We used four different flavors of NB on 100,000 Web pages: • C1: Words • C2: Words + POS tags • C3: Words + extracted concepts • C4: Words + POS + extracted concepts WWW 2005 Chiba Japan
Evaluation of Categorization 3% accurary increase – 8% error reduction • C1: Words only • C2: Words + POS tags • C3: Words + extracted concepts • C4: Words + POS + extracted concepts WWW 2005 Chiba Japan
User Interface WWW 2005 Chiba Japan
Conclusion • Infocious: uses language analysis to improve Web search • Resolves language ambiguities • Incorporates text coherence in the ranking • Provides query suggestions and refinements • Organizes information intuitively through categorization WWW 2005 Chiba Japan
Related Work • Web Search Engines: • Google, Yahoo!, MSNSearch, Ask/Teoma, Altavista, Looksmart, Vivisimo, … • Enterprise Search • Autonomy, Inquira, Inxight, iPhrase, … • Answer Engines • START@MIT, BrainBoost, … WWW 2005 Chiba Japan
Ongoing work • Increase index size (currently ~1 billion pages) through surface & hidden Web-crawls • Apply our Language Analysis algorithms to additional languages • Leverage our Language-annotated repository for additional features (e.g. summarization, machine translation,…) • Investigate how to use Language Analysis to improve relevance in advertisements WWW 2005 Chiba Japan
Thank you ! You can check out our Search Engine at:www.infocious.com WWW 2005 Chiba Japan