Document Data Mining Design Review

Document Data MiningDesign Review November 18, 2010 Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.

The Problem • State Board collects meeting minutes and other documents recording decisions made • Board members want to retrieve text from old documents that relate to current issues • May not recall when issue was discussed • May not know exact keywords to search for

The Existing Solution • Currently, all files exist on a large, unorganized shared network drive. • Finding information recorded in documents requires knowing when it was recorded, and in which document.

Requirements / Design Decisions

Multiple File types • System limited to more major file types • Word documents (.doc, .docx) • PDF files (.pdf) • Excel (.xls, .xlsx) • Text (.txt) • Lacking • WordPerfect (.wpd) • PDF files that were scanned in • Open Office document types

Multi-User Access Web Based • Pros: • Information searchable anywhere • Only one index required • Index on regular basis without interrupt • Cons: • File permissions Individual User Application • Pros: • Can be programmed to learn user behavior • Apply more emphasis to files he/she used before (Looks at search history to aid in new searches) • Cons: • Software package installed on each users machine

Search Collection of Documents Efficiently Results displayed in less than a second • Real Time Searching • Pros: • Easy • No initial overhead • Cons: • Time consuming (> 100,000 words) • Unable to find non-exact search results • Reverse Indexing • Pros: • Fast and efficient • Able to find useful information without exact search text known • Cons: • Large initial overhead (pre-analyze all documents) • Keep index file up to date • Storage space necessary

Find Useful Information Without Exact String Specification (A: Stemming) • Create our own • Pros: • Pay attention to details that may be lacking in existing algorithms (aglet vs. readable) • More efficient • Define special cases • Cons: • Requires a lot of time • Use existing algorithm • Pros: • Readily available • Spend more time on other important details • Cons: • Special cases incorrect • Some root words are truncated

Porter Stemming Algorithm • Large set of steps based on English Natural Language to determine root of word • Extensively used in programs • Outdated: Results not always correct

Find Useful Information Without Exact String Specification (B: Thesaurus) • Own Model • Pros: • Fine tune thesaurus to have only relevant terms (terms that exist inside our index file) • Cons: • Very time consuming and complex • Using pre-built Thesaurus • Pros: • Quick and easy to use • Very extensive • Cons: • Has irrelevant search term results • Unnecessary terms for State Board

Searching • User types in a search criteria • Determine whether they want Narrow Search results or Broad Search Results • May retrieve too many results in Broad Search • Search algorithm converts each typed word into a list of possible stems and synonyms • Tries all possible permutations of words, trying to find the closest match to the search • Calculate standard deviation of the distance between all of the words

Searching (cont.) • Each file is ranked based on the number of matches it contains • Exact matches rank highest • Reordering of exact match is ranked next • Stems, synonyms, partial matches, and large spacing between searched words rank lowest • All rank values found inside a file are summed • Highest ranked files considered most relevant

Unit Testing

UnitTesting • Benefits • Goal • Facilitates change • Limitations • Not omnipotent • Low cost performance

UnitTesting DocumentTest: /// Returns the document location public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); }

UnitTesting /// creates word count in alphabetical order for all words located inside PDF public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); }

End of Semester Status • Goals: • Working, tested prototype • Documentation for future teams • Plenty of areas open for extension or improvement

Future Possibilities: File Types • Currently supported file types • Microsoft Word • Microsoft Excel • PDF • No optical character recognition • Our system will allow for easy extension

Future Possibilities: Indexing • We have a relatively simple indexing scheme • More complex indexing would lead to decreased search time • Our indexing scheme is very general • Could be specific to the State Board • Could lead to more relevant results

Future Possibilities: Searching • Search time increases quickly as search terms are added • Thesaurus is broad • Large number of synonyms can slow search • Could be trimmed to fit domain • Porter stemming algorithm could be replaced

Future Possibilities: Correlation • Related documents should be correlated • By date? • Using a tagging system?

Future Possibilities: Decision Database • A client need that is not addressed by our software • Many board decisions have been passed, with varying lifetimes • A database could track all board decisions and lifespan • Possible connection to our search engine?

Future Possibilities: Web-Based Interface • Software will be installed on each user’s computer • GUI could be web based, with access restricted to State Board employees • Users could search from home or while on the road, not just in the office • Indexing would be simplified

Questions?

Document Data Mining Design Review