250 likes | 400 Views
TCN Spell Checker. Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist. Team AZP. Team descendant of previous project groups Primary roles by member: Joshua Correa – Project Lead, TCN Liason Eric Engquist – Materials and Metrics Manager
E N D
TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist
Team AZP • Team descendant of previous project groups • Primary roles by member: • Joshua Correa – Project Lead, TCN Liason • Eric Engquist – Materials and Metrics Manager • Mark Biddlecom – Resource and Process Manager • Zianeh Kemeh-Gama – Schedule Manager • Jatinder Singh – Research Lead • Dr. Ludi – Faculty Advisor • Website: http://www.se.rit.edu/~teamazp/index.htm
TCN • Software development and staffing company based here in Rochester, NY http://www.tcnus.com • Developer of web-based search and knowledge management programs • KnowledgeTrac • Customizable multilingual web search tool • Standalone spider • TecTrac, AppTrac, AuditTrac, HelpTrac, TestTrac • Document and database search and management tools
Document Collaboration Tool • Online repository for management documents • Meeting minutes • Metrics • Research links • Presentations and diagrams • Task and issues for each team member • Email notifications of changes • Custom developed for this project
Spell Checker • Should compensate for mistyped search terms • Match misspelled words with correct spelling • “atourney” attorney • Match misspelled words with correct results • “atourney” legal services, lawyers • Meant to make searches more useful for average web search users • 1) Takes in search terms from user • 2) Checks spelling/matches with known search terms • 3) Returns suggestions to search engine
Spell Checker Requirements Functional Requirements: • Look up search terms in a dictionary • Suggest replacements for misspelled terms (closest match) • Add new terms to dictionary • Process phrases (as opposed to single words) • Support multiple dictionaries
Spell Checker Requirements Non-functional Requirements: • Object-oriented design to be implemented as a web service with VB.NET • Adaptability • Must support ability to work with different data stores • Must support the addition of new components • Performance • Analysis of a search string cannot take longer than one second.
Spell Check Process • Load configuration • Load dictionaries (from cache or rebuild) • Apply rules • Parse search string • Apply algorithm to each term • Short-circuit if enough results have been found • Return results set of suggestions
Configuration • Application configuration file • Provides application-level settings (e.g., maximum memory usage, maximum processor time for search) • Points to search configuration file • Search configuration file • Allows control over how memory is used vs. algorithm performance • Defines dictionaries and methodologies • Methodologies include rules
Loaders • Load a set of words for use in dictionaries • Used to create root dictionaries (<root> in the configuration file) • Word sets returned by loaders are not cached, but instead used to create algorithm dictionaries
Formatters • Provide a dictionary specialized for use with a specific algorithm • Created by <dictionary> tags in the configuration file • Dictionaries created by formatters are cached for use between application sessions
Parsers • Split a search string up into a number of terms • For a given rule, the algorithm is applied to each term supplied by the parser
Algorithms – String Similarity • Calculates number of operations to go from one word to another • Insertion, Deletion, Substitution • Few operations Good Suggestion • Extra features • Swapping operation • Operation weighting
Algorithms – String Similarity • Complexity of O(s1*s2) • S1,s2 lengths of strings being compared • Can be improved to O(s1*k) • K is edit distance
Algorithms - Phonetic • Several rules used to parse English words into a sequence of phonetic sounds • Example: Phonetic pntk • Parse dictionary, parse search term • String similarity comparison
Deliverable Schedule Iteration 1: February 1st 2005 • Complete system design for system iterations 1-3 • Instructions for installation and integration with TCN client software • Research • Analysis of historic search strings and business names from TCN • Dictionaries (common words) • Word search algorithms • Basic System Implementation • Database integration • Testing
Deliverable Schedule Iteration 2: February 18th 2005 • Suggest replacements for words not in the dictionary • Addition of a new search algorithm to provide more intelligent searches • Closest Match • Using multiple dictionaries • Unit Testing for all written code
Deliverable Schedule Iteration 3: March 21st 2005 • Phonetic Matching • Dynamically add words/phrases to the dictionary • Support phrase searching • Addition of further search algorithms • GUI Configuration tool • Algorithm Optimization
Metrics • Schedule/estimation accuracy • Estimation accuracy (hours per task) • Slippage percentages • Defect statistics and analysis • Severity and complexity of defects • Defect source tracking • Average age of defects
Research References • “Approximate String Matching” by Ricardo Baeza-Yates at University of Chile • “A Guided Tour to Approximate String Matching” by Gonzalo Navarro at University of Chile, 2001 • “An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm” by Hal Berghel (U of Arkansas) and David Roach (Acxiom Corp.), 1996