340 likes | 560 Views
ECMS210. Find it Faster with MOSS 2007 Search. Marianne Sweeny Senior Search Specialist Ascentium. partner . AGENDA. Introduction Search 101 Web Search and Enterprise Search MOSS 2007 Search Working with MOSS 2007 Search Leveraging MOSS 2007 Search Gotchas
E N D
ECMS210 Find it Faster with MOSS 2007 Search Marianne Sweeny Senior Search Specialist Ascentium partner
AGENDA Introduction • Search 101 Web Search and Enterprise Search MOSS 2007 Search • Working with MOSS 2007 Search • Leveraging MOSS 2007 Search • Gotchas Compared to What? A Look at the Google Search Appliance • Features • Relevance • Gotchas
Introduction There is too much information to manage so we create ways to manage it There is no “silver bullet” solution for finding information • Customers don’t know what they don’t know • What is the “Google experience” • 50% of Web searches are unsuccessful • Every information need is individualized • What works for another enterprise likely won’t work for yours • Enterprises have different lines of business and different information types My problem is not finding something,” says Danny Hillis, a MacArthur Foundation genius and computer scientist who now runs a consulting business, “my problem is understanding something.”…That, he continues, can only happen if search engines understand what a person is looking for, and then guide her toward understanding that thing… • “The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture” • John Battelle [p.16]
SEARCH and HAMBURGERS All search technologies use the same fundamental process that has been with use since the 1960’s for searching electronic databases
HOW SEARCH TECHNOLOGY WORKS http://www.howstuffworks.com/search-engine.htm/printable
What the Spider “sees” We see this While the spider “sees” this
Web Search and Enterprise Search Web Search • Anarchistic publishing model = “anyone, anywhere, any time” • Unlimited document set • No real standards or code, more like guidelines • No central authority • Spam • Commercialization • Technology is agnostic • Has to work the same for everyone worldwide • No shared understanding Enterprise Search • Controlled corpus of documents • Standards and practices in place • No spam • Users and authors generally share contextual understanding • Maybe even a TAXONOMY –if you’re lucky • Can customize search technology to enterprise themes and concepts
Indexing Enhancements • Content index that holds text of pages • Property store that holds other document values • Shared service for easy central management • Crawl to small indexes that are then consolidated at scheduled times into a “master merge” • Continuous propagation • Incremental indexing • Indexes only those records that have changed • Change notification based • Single item add /removal without re-indexing entire corpus • Scans folders that have been added-to, deleted-from, or modified by user or system • Security Change only crawl – to catch changes in permissions • Broad and granular security permissions capability • 5 - 50 million document capacity in official testing
Relevance • Dynamic ranking (relevance impacted by query term) • Frequency • Location in document • Appearance in link text • Appearance in URL • Static ranking (relevance independent of customer query) • URL Depth • Click Distance • Authority/Demoted site • Change property weights • Language of customer (browser setting) • Security trimmed results dependent on broad/granular • Permissions setting • Folder/document • Search Alerts • User can subscribe to receive email when results change • File type filtering • Some file types are deemed more relevant (i.e. HTML, DOC) than others (XML, txt) • Supports 220 files types, MS and non-MS application
Custom Results • Search Scopes • Allow users to refine search through filtering • Define content resources and map to business rules/key concepts • Focused content = shared understanding = more precise results • Duplicate results filtering • Collapsing duplicates from same directory or site to leave more room for other relevant results • Query time security trimming • Only results that user has permissions to see are presented • Based on permissions from LDAP or AD • Synonym mapping and editorialized results • Use search logs to detect popular searches, low click-through from results or 0 result queries • Manually map related terms or program results to keywords • And…Facets
MOSS 2007 Faceted Search • Empowers customer to refine search • Filters results by predetermined categories • Contextual facet menu based on refined search criteria • Enables sub-facets
Search for People or Expertise • Grouping by social distance from and common interests shared by • Separate tab on search UI • Indexes individual profile using • Email • IM contacts • Connections to outside organizations • Team sites • Distribution list memberships • User can choose what information to reveal and select from 5 levels of access • Everyone [outermost ring of visibility] • My Colleagues • My workgroup • My manager • Me [center of the visibility network]
Business Data Catalogue Search Across Business Applications • Extracts data from line-of-business, CRM, and other 3rd Party data stores • Caches for indexing by search service • Searches any data source accessible through ADO.net or Web Services • Uses Live Communication Server for connectivity options • Aggregated into a single application • No more cutting and pasting needed • Available in MOSS 2007 Search Enterprise edition and both version of MOSS 2007 Full Product
Customizing the Search Experience The simple web-based administration tool makes customizing the search experience to the client’s content or site visitor needs easy and intuitive
Designating Authority Sites • Hilltop Algorithm • Quality of links more important than quantity of links • Segmentation of corpus into broad topics • Selection of authority sources within these topic areas • Pre query calculations applied at query time • Topic Sensitive Page Rank • Consolidation of Hypertext Induced Topic Selection [HITS] and PageRank • Pre-query calculation of factors based on subset of corpus • Context of term use in document • Context of term use in history of queries • Context of term use by user submitting query • Creator now a Senior Engineer at Google
Web Analytics • Export search logs to Excel • Query terms • Page views • Number of results returned • Volume trends • Query success: can define success for certain query terms • Report Center • Access to MOSS 2007 BI features • Filters data for permissions and relevance • Key Performance Indicators [KPI] • Create a KPI list and measures of success • Default KPIs within MOSS 2007 • KPI information drawn from MOSS 2007 data sources SharePoint lists Excel workbooks SQL Server 2005 Analysis Services Manually entered information
MOSS 2007 Search Gotchas! • Does not support wildcard searches • i.e. diet* that will bring up: dietary, dietician, dieting • It is as good as you want it to be and are willing to work with it to be • The Expertise search capacity is predicated on employee compliance in profiling • Without authoritative sites configured in the relevance settings, the benefits of click-distance are missed • Use of special characters in the thesaurus can lead to highly irrelevant results and impact “did you mean” capabilities • Must ensure a schedule for the incremental crawl to catch additions to the document set • Custom applications using SharePoint 2003 administrative object model must be rewritten to use MOSS 2007 object model
Leverage MOSS 2007 Search • Get as much customer data as possible to find search pain points • Review search logs and customer feedback mechanisms • What are they trying to find • What terms are they using Develop Best Bets for searches with 0 results • Define key enterprise themes in content • Map existing content to these themes • Create filters and scopes to map for themes • Leverage Human-mediated relevance components • Assign relevance weighting that makes sense to the customer information set • Develop a structure that leverages the structural weighting elements • Authorities • Deep hierarchies • Develop editorial guidelines and tools that enforce strong meta data standards across the enterprise • Develop controlled vocabulary that best describes enterprise key concepts and themes • Use as a foundation for meaningful metadata • Engage in regular search effectiveness reviews of: • Search logs • Customer feedback
Better Than Ever • SharePoint 2003 • Used Probabilistic Relevance Scoring • Collection frequency • Term frequency • Document length • Term position • Relevance based on internal information • What the system thinks you want to see keyed on numeric values derived solely from document text • Different systems between Windows SharePoint Systems and SharePoint Portal Server • Multiple indexes • Custom Content groups, Best Bets, scheduling configurations are portal-based • Scopes tied to content sources • Index propagated at completion of master crawl only MOSS 2007 • Relevance customizable to the enterprise content • Click distance • Hyperlink anchor text • URL surf depth • URL text matching • Automated metadata extraction • Automatic language detection • File type relevancy biasing • Enhanced text analysis • Fully integrated admin experience between Windows • SharePoint Services v3 and MOSS 2007 • Single search system and index per server farm • Custom Content groups, Best Bets, scheduling are now shared services • Improved control over indexing • Continuous Index propagation for data integrity • Multiple start points • Customized crawling • Single item removal from index • Scopes can be tied to document properties
GSA Features • Hardware and software solution • Commodity hardware, open source software • Google experience • UI and PageRank • Searches desktop, email servers, internal and external websites • Results collapsing • User assigned search parameters • Numbers: sort by price • Date: assign a date range • OneBox access information from business applications • Requires business application partner’s module for access • Always appears at top of search results • Version price dependent on feature set and documents crawled • Mini: $2000 for 50,000 documents • GB-8008: $450,000 for an 8u server rack with secured system crawling, load balancing features, and capacity for up to 5 collections of 4 million documents each
Google Search Appliance Relevance • PageRank is minimal foundation for GSA relevance • Enterprise relevance based on deployed research • 1000 Google engineers researching data from 7000 current enterprise deployment for features and enhancements • Often cited Raytheon case study revealed following order of influence for relevance ranking • Keyword to content ration • Qterm in page title • Qterm in Description • Qterm in file name • NO ability to adjust relevance algorithm
GSA Administration • Plug and Play • Set up in 3-5 clicks • Point to content directory and crawl • Indexing • Incremental indexing • Change only indexing • Web-based Administration console • People/Expert search available • Security • View permissions check
GSA Benefits • Same search experience as with Google Web • Relevance based on Google PageRank and other features • Plug and play • Set it up and point it to the content source and start crawling • Google OneBox searches across structured data • Google People Search • Small set of features to adjust
GSA Gotchas! • Proprietary relevance functionality not open to customization • Not most suitable to unique/special enterprise content • Sort-by-date is “published date” only • KeyMatch [editorialized results] to achieve customization • Can prioritize/deprioritize content groups • Meta size limitation of 160 characters • Stemming file limited to 3 MGs • Non-HTML files are converted to HTML for indexing • HTML clone counts towards document limit • HTML files crawled up to 2.5 MGs with rest discarded • HTML code counts towards the 2.5 MGs though not indexed • Fileshare hosted documents placed in public location • Does not index TIFFs from OCR text • Expertise search cannot be controlled by the individual • No indicators of existing relationship
Conclusion • Both products share many of the same features • MOSS 2007 lifts the hood a little further to show you how it works • MOSS 2007 Enterprise Search acknowledges the uniqueness of your search landscape and information needs and offers the opportunity to customize the search experience for your users, your information types, your enterprise • MOSS 2007 Relevance tuning by… • Algorithm adjustment • Site structure • Authority designation • Keyword development • MOSS 2007 people search feature with • Search by social distance • Is brokered by a server for privacy
Resources MOSS 2007 Product Guide http://www.microsoft.com/office/preview/servers/sharepointserver/guide.mspx MOSS Developer Center on MSDN http://msdn.microsoft.com/office/server/moss/default.aspx MOSS 2007 Software Developers Kit http://msdn2.microsoft.com/en-us/library/ms550992.aspx MOSS 2007 on TechNet http://technet2.microsoft.com/Office/en-us/library/3e3b8737-c6a3-4e2c-a35f-f0095d952b781033.mspx MOSS 2007 Administrator Documentation http://jamorgan.wordpress.com/2006/09/07/administrator-documentation-for-moss-2007-wss-v3/ Microsoft Enterprise Search website http://www.microsoft.com/enterprisesearch/ Me: marianne.sweeny@ascentium.com or 425.519.7700
SUBMIT AN EVALUATION For a chance to win an 8GB ZUNE! Submit evaluations on MySPC www.MicrosoftSharePointConference.com
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.