1 / 31

Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies.com. DPMS: Example 1. DPMS: Example 2. DPMS: Example 3 (FY 2011). DPMS: Example 4 (FY 201?). DPMS: Example 5. CPA. (sorry, I don’t have a screenshot). What is Aspire???.

leda
Download Presentation

Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies.com

  2. DPMS: Example 1

  3. DPMS: Example 2

  4. DPMS: Example 3 (FY 2011)

  5. DPMS: Example 4 (FY 201?)

  6. DPMS: Example 5 CPA (sorry, I don’t have a screenshot)

  7. What is Aspire??? • ASPIRE: The Automated Assembly Line for Documents

  8. What is DPMS??? • A Methodology for Creating Good Assembly Lines For Search • Document Process Methodology for Search

  9. What is DPMS? Technologies Management Processes MS-Windows The Customer ISO-9000 Java 6-Sigma + Aspire .NET DPMS Agile Search Technologies Linux Search Engine Waterfall

  10. What is DPMS? DPMS is the 6-Sigma of Document Preparation for Search

  11. Are you DPMS Compliant? Certified? • Inputs: Identified & Documented • Validated • Virus Checked • Metadata: Identified & Documented • Fields named • Structure and arity known • Schema V • File Processing: Identified & Documented • File names & formats specified • Index Fields: Identified & Documented • Fields mapped from metadata • Search Fields: Identified & Documented • Fields mapped from search engine

  12. DPMS and Aspire Work Together • DPMS: • A methodology for creating awesome assembly lines for documents • Is 100% software independent • Produces Design and Architecture Documents • Aspire: • A software framework and toolset that can be used to implement DPMS and enhance search • Search engine independent • Architected to handle systems with very many & very complex collections DPMS Aspire

  13. GPO: Before DPMS Everyone is working on their own function (no one is looking out for the data)

  14. GPO: After DPMS Data flow is documented through the system (this is done for each and every collection)

  15. how will parser data and input files be validated The DMD Defines How Data Flows Through The System what renditionsare available? how will the MODS be created? how will metadata be extracted and merged? how will the HTML rendition be created what manual edits may be required? how will the content and metadata be indexed how are PDF files processed? what’s on the search form? what are the navigators? what do content URLs look like? how are search results formatted?

  16. The Challenge at GPO

  17. DPMS High-Level View Assessment (Search Technologies Architect and Business Analyst) Assessment Report 1 Expert assessment and recommendations Assessment DMDs DPMS Analysis (Knowledge Engineer, Business Analyst, etc.) Review (Architect, Domain Experts, Peers) 2 Detailed Analysis Implementation (Developer) Validate DMDs Aspire Validation 3 Search Engine Execution

  18. Aspects of the DMD • Describes a “Horizontal Slice” through the application • One per collection of data • Documents all metadata mappings throughout • Parsing • Storage representation & fields • Data value representations & mappings • Documents all file processing throughout • Documents search methods and presentations

  19. The DMD = Data Model Design The DMD Drives the Whole Process • Introduction • Metadata Schema • Input Files • File Parsing  Metadata Extraction • File Processing  Renditions, formats • Metadata to Index Mapping • Index to Search & Browse Mapping • Metadata to Detail Display Mapping

  20. Aspire Pipeline Outside Sources Outside DBs Files or DBs Packaging Document Pre Processing Parsing & Extraction Enrichment Document Post Processing Transform and Load Search Engine Quarantine Quarantine Quarantine Quarantine Quarantine Quarantine

  21. Just a few things in the DMD • Fields Names • Mapping formulas • 111 = 111th Congress (2009-2010) • Navigators (names, where from, how displayed) • Format Translations (.doc  .txt  .html  .pdf) • Data structure • Single value, multi-value, optional, grouped • Document Structure • Hierarchies, granules, sections, chapters

  22. DPMS: From Document  Object • Data inside organizations are very messy • Multiple databases / sources, data types, etc. • Fragmented or incomplete data • An “object” can be: project, person, customer, transaction, product item, etc. • Moving From Documents  Objects • Combining data from multiple databases into larger, “virtual” documents, OR • Tagging documents so they can be grouped by object ID • Decomposing large documents so sections can be retrieved as manageable units but re-assembled if needed

  23. From Documents to Objects - Merging Skills Certifications Time Cards Web Site Résumés M erge Merge Merge Merge Combined Document

  24. From Documents to Objects - Splitting Federal Register Granules

  25. Lots of other examples of splitting • Zip Files • Spread Sheets • RDBMS Tables • XML Data Records • Newspapers • Blog Entries

  26. Samples • [SHOW] DMD for GPO • [SHOW] DMD for OLRC • [SHOW] DMD Template • [SHOW] Mini-DMD

  27. Other Advantages to the DMD & DPMS • Scalable • Data Analyst ≠ Programmer • Two different jobs with two different skill sets • Much easier to fill these roles if they are separate • The programmer’s job is more enjoyable • Doesn’t have to worry about data issues • Can just implement what’s in the DMD

  28. The Problems Are… • DPMS is hard to sell • DPMS is hard to describe to customers • We are “inventing” a methodology from scratch • This is hard • Giving it a name is step 1 • Next steps: solidify methods, determine what “certified DPMS” means • Need case studies • Needs work to define and communicate

  29. Hosting Possibilities? • “DPMS Level 3 Compliant Hosting Center” • Take customer through the process as we load their data into our hosting center • Provide all of the documentation back to them • Certified DPMS Level Three Search Systems

  30. But the upside is enormous • Multi-million dollar customers • GPO, CPA • Ideal for customers for whom “data is their product” • We become mission-critical to these customers • We can more easily justify the expense • Customers will see bottom-line value • We become much more valuable to the customer • Customers will want low-risk “tried and true” methodologies for these very complex and difficult tasks

  31. Questions? Presenter presenter@searchtechnologies.com

More Related