1 / 33

Document Data Mining Design Review

Document Data Mining Design Review. November 18, 2010. Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D. The Problem. State Board collects meeting minutes and other documents recording decisions made

Download Presentation

Document Data Mining Design Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Data MiningDesign Review November 18, 2010 Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.

  2. The Problem • State Board collects meeting minutes and other documents recording decisions made • Board members want to retrieve text from old documents that relate to current issues • May not recall when issue was discussed • May not know exact keywords to search for

  3. The Existing Solution • Currently, all files exist on a large, unorganized shared network drive. • Finding information recorded in documents requires knowing when it was recorded, and in which document.

  4. Requirements / Design Decisions

  5. Multiple File types • System limited to more major file types • Word documents (.doc, .docx) • PDF files (.pdf) • Excel (.xls, .xlsx) • Text (.txt) • Lacking • WordPerfect (.wpd) • PDF files that were scanned in • Open Office document types

  6. Multi-User Access Web Based • Pros: • Information searchable anywhere • Only one index required • Index on regular basis without interrupt • Cons: • File permissions Individual User Application • Pros: • Can be programmed to learn user behavior • Apply more emphasis to files he/she used before (Looks at search history to aid in new searches) • Cons: • Software package installed on each users machine

  7. Search Collection of Documents Efficiently Results displayed in less than a second • Real Time Searching • Pros: • Easy • No initial overhead • Cons: • Time consuming (> 100,000 words) • Unable to find non-exact search results • Reverse Indexing • Pros: • Fast and efficient • Able to find useful information without exact search text known • Cons: • Large initial overhead (pre-analyze all documents) • Keep index file up to date • Storage space necessary

  8. Find Useful Information Without Exact String Specification (A: Stemming) • Create our own • Pros: • Pay attention to details that may be lacking in existing algorithms (aglet vs. readable) • More efficient • Define special cases • Cons: • Requires a lot of time • Use existing algorithm • Pros: • Readily available • Spend more time on other important details • Cons: • Special cases incorrect • Some root words are truncated

  9. Porter Stemming Algorithm • Large set of steps based on English Natural Language to determine root of word • Extensively used in programs • Outdated: Results not always correct

  10. Find Useful Information Without Exact String Specification (B: Thesaurus) • Own Model • Pros: • Fine tune thesaurus to have only relevant terms (terms that exist inside our index file) • Cons: • Very time consuming and complex • Using pre-built Thesaurus • Pros: • Quick and easy to use • Very extensive • Cons: • Has irrelevant search term results • Unnecessary terms for State Board

  11. Searching • User types in a search criteria • Determine whether they want Narrow Search results or Broad Search Results • May retrieve too many results in Broad Search • Search algorithm converts each typed word into a list of possible stems and synonyms • Tries all possible permutations of words, trying to find the closest match to the search • Calculate standard deviation of the distance between all of the words

  12. Searching (cont.) • Each file is ranked based on the number of matches it contains • Exact matches rank highest • Reordering of exact match is ranked next • Stems, synonyms, partial matches, and large spacing between searched words rank lowest • All rank values found inside a file are summed • Highest ranked files considered most relevant

  13. Unit Testing

  14. UnitTesting • Benefits • Goal • Facilitates change • Limitations • Not omnipotent • Low cost performance

  15. UnitTesting DocumentTest: /// Returns the document location public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); }

  16. UnitTesting /// creates word count in alphabetical order for all words located inside PDF public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); }

  17. End of Semester Status • Goals: • Working, tested prototype • Documentation for future teams • Plenty of areas open for extension or improvement

  18. Future Possibilities: File Types • Currently supported file types • Microsoft Word • Microsoft Excel • PDF • No optical character recognition • Our system will allow for easy extension

  19. Future Possibilities: Indexing • We have a relatively simple indexing scheme • More complex indexing would lead to decreased search time • Our indexing scheme is very general • Could be specific to the State Board • Could lead to more relevant results

  20. Future Possibilities: Searching • Search time increases quickly as search terms are added • Thesaurus is broad • Large number of synonyms can slow search • Could be trimmed to fit domain • Porter stemming algorithm could be replaced

  21. Future Possibilities: Correlation • Related documents should be correlated • By date? • Using a tagging system?

  22. Future Possibilities: Decision Database • A client need that is not addressed by our software • Many board decisions have been passed, with varying lifetimes • A database could track all board decisions and lifespan • Possible connection to our search engine?

  23. Future Possibilities: Web-Based Interface • Software will be installed on each user’s computer • GUI could be web based, with access restricted to State Board employees • Users could search from home or while on the road, not just in the office • Indexing would be simplified

  24. Questions?

More Related