510 likes | 801 Views
SQL Server Full-Text Search SQL Server 2008. Plan. Qu’est-ce que la recherche Full-Text La méthode SQL server 2008 Comment faire Requête. Introduction.
E N D
Plan • Qu’est-ceque la recherche Full-Text • La méthode SQL server 2008 • Comment faire • Requête
Introduction • L'indexation plain texte, aussi appelée indexation de texte intégrale, propose de fournir un service de recherche rapide d'informations déstructurées basée sur des mots, parties de mots, expressions, formes fléchies, synonymes, etc. contenu dans une ligne d'une table. • Un service(Full textsearch) permet d’indexer les éléments contenue dans notre base de données dans le but de permettre une recherche efficace.
Type de recherche • La technologie full-text permet de créer des indexes sur base de mots non parasites et d'utiliser ces indexes pour des recherches (avec support linguistique) ainsi que des recherches de proximité. • Les mots parasites sont les mots de liaisons, les pronoms, … tels que " je ", " de ", " car " et bien d'autres. Ceci permet d'éviter d'avoir comme résultat des données faussées par ces mots. Ainsi, si l'on effectue une recherche avec comme critère la phrase " je veux une classe mfc qui permet de faire du gdi ", seuls les mots " classe ", " mfc " et " gdi " seront pris en compte.
Type de recherche • Le support linguistique permet de définir, en interne, toute une série de mots qui dérivent de ceux qui forment les critères. Ainsi, pour le mot " nager ", le système effectuera la recherche avec les mots " nager ", " nage ", " nagé(e) ". • La recherche de proximité permet de spécifier qu'un mot doit se trouver près d'un autre. Dans le cas de Sql Server, il s'agit d'un " rayon " de 50 mots. Ce nombre est fixé et n'est pas paramétrable.
On auraitpu…. • SELECT *FROM [Northwind].[dbo].[Employees]WHERE Notes LIKE '%grad%‘ • Cependant les performances ne sont pas aussi bonne et la possibilité… n’enparlonsmême pas
Full-Text Search Terminology 1/3 • Full-text index • Information à propos des mots et leur emplacement dans les colonnes • Utilisédans les requêtes full text • Full-text catalog • Groupelogiqued’index (Container) • Word breaker • Identifie les mots pour lesquelsilexiste des dérivations(en fonction de la langue)
Full-Text Search Terminology 2/3 • Token • Les mots identifiés par le word breaker • Stemmer • Génère des dérivation(inflectional) d’un mots (propre à la langue) • Filter • Extrait le texte d’un fichiersauvegarderdans un varbinary(max)ouuneimage • Population or Crawl • Créer et mettre à jour l’index full-text index.
Full-Text Search Terminology 3/3 • Stopwords/Stoplists/Noise Word • Mot sans importance • Par exemple: ‘le’, ‘la’, ‘les’, ‘à’ etc. • Accent insensitivity • café = cafe
Administrer la recherche Full-Text Troistâche: • Créer/Modifier/Supprimer des catalogs full-text • Créer/Modifier/Supprimer des indexesfull-text • Céduler et maintenir les informationsdansl’index.
Comment Créer un Full Text Catalog
Remplirl’index Full-Text • Full • Créel’index • Utilise beaucoup de ressources • Incremental • Met à jour automatiquement l’index en mettant à jour les informations qui ont été mises à jour depuis la dernière mise à jour.
Remplirl’index Full-Text • Update • Gardeune trace des modifications (inserts, updates, and deletes) • Propage les mises à jour selondeuxfaçons: • AUTO automatic • MANUAL basésurunecédulemanuel des changements
CommentRechercher Full-Text Les mots clés: • FREETEXT • FREETEXTTABLE • CONTAINS • CONTAINSTABLE
FREETEXT • Recherchefloue • Inflectional forms (Stemming) • Related words (Thesaurus) • SELECT * • FROM Artiste • WHERE Description like '%francais%'; • SELECT * • FROM Artiste • WHERE FreeText (*, 'francais)') • SELECT * • FROM Artiste • WHERE Contains (*, 'FORMSOF(INFLECTIONAL, francais)')
FREETEXTTABLE • La fonctionprécédenteretournetoutes les réponses qui correspondent à ;la requête. Cependant, cesréponses ne sont pas classé en ordre de pertinence. • Retourne un ensemble de lignes qui contientunecolonne rank • Valeur entre 1 et 1000 • Valeur relative, critère de pertinence par rapport à la requête. Requête: • SELECT Artiste.*, KEYTBL.[KEY], KEYTBL.RANK FROM Artiste • INNER JOIN FREETEXTTABLE(Artiste, *, 'FORMSOF(INFLECTIONAL, francais)') • AS KEYTBL ON No = KEYTBL.[KEY]
CONTAINS • Identique à freetextmaislaissel’utilisateurpréciser la recherchefloue SELECT ProductDescriptionID, Description FROM Production.ProductDescription WHERE CONTAINS(Description, 'bike'); SELECT ProductDescriptionID, Description FROM Production.ProductDescription WHERE CONTAINS(Description, ‘”bike*”'): INFLECTIONAL Considère les dérivations“ride“ “riding", “riden", .. THESAURUSConsidère les Synonyms"metal“ "gold", "aluminium"," steel", ..
SELECT ProductDescriptionID, Description FROM Production.ProductDescriptionWHERE CONTAINS(Description, ‘FORMSOF(INFLECTIONAL, ride) '); Proximité des mots NEAR ( ~ ) Combienprèssont les mots dans le texte SELECT ProductDescriptionID, Description FROM Production.ProductDescriptionWHERE CONTAINS(Description, 'mountain NEAR bike'); SELECT ProductDescriptionID, Description FROM Production.ProductDescriptionWHERE CONTAINS(Description, 'mountain ~ bike');
CONTAINTABLE CONTAINSTABLE (table , { column_name | (column_list ) | * } ,' < contains_search_condition > ' [ , LANGUAGE language_term] [ ,top_n_by_rank ] ) < contains_search_condition > ::= { < simple_term > | < prefix_term > | < generation_term > | < proximity_term > | < weighted_term > } | { ( < contains_search_condition > ) { { AND | & } | { AND NOT | &! } | { OR | | } } < contains_search_condition > [ ...n ] } < simple_term > ::= word | "phrase " < prefix term > ::= { "word *" | "phrase *" } < generation_term > ::= FORMSOF ( { INFLECTIONAL | THESAURUS } , < simple_term > [ ,...n ] ) < proximity_term > ::= { < simple_term > | < prefix_term > } { { NEAR | ~ } { < simple_term > | < prefix_term > } } [ ...n ] < weighted_term > ::= ISABOUT ( { { < simple_term > | < prefix_term > | < generation_term > | < proximity_term > } [ WEIGHT (weight_value ) ] } [ ,...n ] )
La recherche Full-text est plus puissanteque LIKE • Plus spécifique, des résultats plus pertinents • Meileure performance • Amèneune classification des résultats • Utilisation commune • Recherchedans les informations de la base de données • Recherchedans les fichierscontenusdans la base de données • Filtre les mots vide de sens • e.g. with exact phrases - "to be or not to be" (however needs care!)
Integrated backup, restore and recovery • Faster queries and index building • Data definition language (DDL) statements for creating and altering indexes • System stored procedures deprecated • Noise Insensitivity – noise words no longer break the query • Accent Insensitivity (optional) – e.g. café and cafe are the same • Multiple columns can be included in full-text searches • Pre-computed ranking optimizations when using FREETEXTTABLE • Improved ranking algorithm • Catalogs can be set to populate continuously track changes, or index when the CPU is idle
Generation terms • Inflectional • FORMSOF(INFLECTIONAL, "expression") • "drive“ "drove", "driven", .. (share the same stem) • When vague words such as "best" are used, doesn't match the exact word, only "good" • Thesaurus • FORMSOF(THESAURUS, "expression") • "metal“ "gold", "aluminium"," steel", .. • Both return variants of the specified word, but variants are determined differently
Thesaurus • Supposed to match synonyms of search terms • SQL_Server_install_path\Microsoft SQL Server\MSSQL.1\MSSQL\FTDATA\FTNoiseThesaurusBak http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506231
Proximity terms • SyntaxCONTAINS(Column, 'local NEAR winery')CONTAINS(Column, ' "local" NEAR "winery" ') • Important for ranking • Both words must be in column like AND • Terms on either side of NEAR must be either simple or proximity terms
Weighted terms • Each word can be given a rank • Can be combined with simple, prefix, generation and proximity terms • e.g. • CONTAINS(Column, 'ISABOUT( performance weight(.8), comfortable weight(.4))') • CONTAINS(Column, 'ISABOUT( FORMSOF(INFLECTIONAL, "performance") weight (.8), FORMSOF(INFLECTIONAL, "comfortable") weight (.4))')
Administering Full-Text Search Guidelines in administering full-text indexes (Continued) • Always select the smallest unique index available for your full-text unique key. (A 4-byte, integer-based index is optimal.) This reduces the resources required by Microsoft Search service in the file system significantly. If the primary key is large (over 100 bytes), consider choosing another unique index in the table (or creating another unique index) as the full-text unique key. Otherwise, if the full-text unique key size exceeds the maximum size allowed (900 bytes), full-text population will not be able to proceed. • If you are indexing a table that has millions of rows, assign the table to its own full-text catalog. • Consider the amount of change occurring in the tables being full-text indexed, as well as the number of table rows. If the total number of rows being changed, together with the numbers of rows in the table present during the last full-text population, represents millions of rows, assign the table to its own full-text catalog.
Administering Full-Text Search Guidelines in administering full-text indexes (Continued) • A full-text index is a special type of token-based functional index that is built and maintained by theMicrosoft Full-Text Engine for SQL Server (MSFTESQL) service. The process of building a full-text index is quite different from building other types of indexes. Instead of constructing a B-tree structure based on a value stored in a particular row, MSFTESQL builds an inverted, stacked, compressed index structure based on individual tokens from the text being indexed. • In SQL Server 2005, the size of a full-text index is limited only by the available memory resources of the computer on which the instance of SQL Server is running. • You can build full-text indexes on textual data stored in char, nchar, varchar, nvarchar, varchar(max), text, ntext, image, varbinary, and varbinary(max), and xml columns. However, the image, varbinary, and varbinary(max) columns require special handling if you want to use them for full-text processing. • You use protocol handlers and filters when you want to create a full-text index on a varbinary, varbinary(max), or image column. These services let you extract text from Word, Excel, and PowerPoint files as well as PDF and other files that are stored in a native format inside SQL Server. For the filters to work, you need to add a column to the table to contain a value that indicates the type of document stored in the column. The filter then loads up the binary stream stored in the column, strips all formatting information, and returns the text within the document to the word-breaker routine.
Administering Full-Text Search Guidelines in administering full-text indexes (Continued) • After the word breaker-routine has a list of valid words for a row within a column, the full-text engine calculates tokens to represent the words. • A token is simply a compressed form of the original word that saves space and ensures that full-text indexes can be created in as compact a form as possible. • The full text-text functionality then builds all tokens in a column into inverted, stacked, compressed structure within a file that is used for search operations. This unique structure allows ranking and scoring algorithms to efficiently satisfy possible queries.