SharePoint Search – Architecture and Topology

SharePoint Search – Architecture and Topology Steve Peschka Sr. Principal Architect Microsoft Corporation

Introduction • Search is new and different from all previous versions of SharePoint • The platform has been consolidated • Built on a combination of FAST Search and SharePoint Search components, as well as new development work – it is the same from Foundation to Server • Foundation Search is intended to replace WSS Search, Microsoft Search Server and Search Server Express from previous versions of SharePoint • There are new components, new topology, new features • All will be covered on subsequent slides • It’s used pervasively throughout the product now in many different ways, in different features • eDiscovery, navigation, topic pages, Internet facing business sites, etc. • It’s installed by default when you run the farm configuration wizard • SharePoint Search was in 2010; FAST Search for SharePoint was not

Logical Architecture

Crawling the Content • The crawl role is responsible for crawling content sources. It delivers crawled items – both the actual content as well as their associated metadata – to the content processing component • Invokes connectors or protocol handlers to content sources to retrieve data • Does not do any document parsing (Content Processing Component does that) • Information about content sources, schedules, etc. are synchronized to the registry on crawl servers from the search admin database • The Crawl Database is used by the crawl component to store information about crawled items and to track crawl history • Holds information such as the last crawl time, the last crawl ID and the type of update during the last crawl.

Crawling Improvements SharePoint 2013 Crawler Model We now have one crawl role that communicates with all Crawl DBs Each crawler role contains only one “crawl component” Role loads items to crawl from specified Crawl DB, processes them, and then commits Same host can be distributed across crawl databases Split work among multiple crawlers SharePoint hosts distributions happens through ContentDB Ids rather than Host URL

Content Processing • Content Processing Component • Processes crawled items and feeds these items to the index component • Document Parsing happens through new parser handlers • iFilter is supported through a Generic iFilter handler • iFilters are still the extensibility platform for SharePoint 2013 • Transforms crawled items into artifacts that can be included in the search index by parsing document and property mappings • Performs linguistic processing at index time (e.g. language detection and entity extraction) • Writes information about links and URLs to the Link Database directly • Generates phonetic name variations for people search

How the Crawler Feeds Content Processing CPC Index Component CSS • A crawler sends data to the Content Submission Service (CSS) • CSS is then responsible to distribute loads across Content Processing Components (CPC) • Inside the CPC the content is transformed and gets ready to be indexed • Document parsing, word breaking, entity extraction, security, link info, and content from web service callout then sends the metadata and document content to an index component • CPC then tells the Crawler whether the document was successfully indexed or not so failures can be retried; failed documents and error codes are shown in the Crawl Log Crawler CPC Index Component CSS

Content Processing Customization • A content web service callout to enrich data before an item is added to the index is available as extensibility capability • Works with Managed Properties that can be provided to the Web Service and can be returned from the Web Service • Web Services calls are governed by “triggers” • A trigger always get signaled when a condition is true • Trigger conditions use an expression language to refer to the values of managed properties

Analytics Processing Component • Search Analytics analyzes crawled items and how users interact with search results. • Usage analytics analyzes usage events, like views from the event store • When an user does an action (e.g. view a page) the event is collected in usage files on the WFE’s and regularly pushed to event store where they are stored until processed • The APC sends results to the Content Processing Component to be included in the search index • You can write code to add custom events • Analytics Processing Component supports scaling out: • Add more APC roles to have analysis complete faster • Add more Link databases to increase capacity for links and search clicks • Add more reporting databases to store more reports as well as improve SQL throughtput in retrieving reports

List of Sub Analyses Search Analytics Link and Anchor text analysis Click Distance Search Clicks Deep Links Social Tags Social Distance Search Reports Usage Analytics Recommendations Usage Counts Acivity Ranking

Analytics Processing Databases • Link Database • Stores links extracted by the content processing component • Stores information about the number of times people click on a result from the search result page • Analytics Reporting Database • Stores the results of usage analysis as well as Search Reports • Reports stored in the DB • Item Reports : • Number of views for item over time • Unique Users viewing item over time • Site Level Reports : Tenant; Site Collection; Web • Number of views over time • Unique users over time • Data aggregated to monthly views every 14 days • Search Index • A small portion of the data in the Analytics Reporting DB is replicated to the search index • View count life time • View count last 14 days

Index Component • Feed / Query • Feeding: receives processed items from the content processing component and persists those items to index files • Query: receives queries from the query processing component and provides results sets in return • Replication • Replicates index content between replicas within the same index partition • Topology Changes • Responsible to apply index partition changes when there is a topology change

Index High Level Architecture • An index partition is a logical portion of the entire search index. • Each partition is served by one or more index components (or “replicas”) • Primary (or “Active”) replica maintains a persisted journal of new and updated items, which is copied to the other replicas within the partition • Selection of primary replica is dynamic and not controlled by the admin • ALLreplicas are there for fault tolerance and increased query throughput • Index can scale in both horizontal (partitions) and vertical (replicas) ways • Partitions can be added but NOT removed

Search Administration • Search Admin Component • Is responsible for search provisioning and topology changes • Manages the lifecycle and monitor state for search components – Crawling, Content Processing, Query Processing, Analytics, and Indexing • Can deploy multiple Admin Components for fault tolerance • Search Admin DB • Stores search configuration data: • Topology • Crawl rules • Query rules • Managed property mappings • Content sources • Crawl schedules • Stores Analytics settings • Does not store ACLs anymore

Query Processing Component • Performs linguistic processing at query time: • Word breaking, stemming, query spellchecking, thesaurus • It receives queries and analyzes and processes the them to attempt to optimize precision, recall and relevancy; the processed query is submitted to the index component(s) • As part of this it also decides which query rules are applicable, which index to send the query to, and whether to do any pre- or post-processing of the query • The index returns a result set back to the query processing component, which processes it before sending it back

Search Processes • Host Controller • A Windows Service that supervises NodeRunner process(es) on a given box • If you need to restart components on a server – Restart the Host Controller service • Manages search dictionary repository • NodeRunner.exe • Is the process that hosts the search components • There might be several instances of this process on a single box • MSSearch.exe • Is the Windows Service that hosts the Crawl Component

Search Host Process MSSearch.exe NodeRunner.exe • Multiple NodeRunner instances can run on the same server • Each NodeRunner instance hosts one search component • E.g. If you have Content Processing Component and Index Component on one server you will have two NodeRunner instances – one for each • On a default single server install there will be 5 instances of the NodeRunner.exe process Crawl Component Content Processing Component NodeRunner.exe NodeRunner.exe Index Component Query Processing Component NodeRunner.exe NodeRunner.exe Analytic Processing Component Search Admin Component

Physical Architecture

Small Example Query Crawl Content Processing Analytics Admin Index Target: 10 Million Items NOTE: This example includes all services, not just search since it’s unlikely you would have a small services only search farm.

Medium Example Query Crawl Content Proc. Analytics Admin Index Target: 40 Million Items

Large Example Query Crawl Content Processing Analytics Admin Index Target: 100 Million Items

Scaling Guidelines

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SharePoint Search – Architecture and Topology