1 / 69

Data Analysis and Parsing

Data Analysis and Parsing. Data Analysis and Parsing. Agenda: Data Management Definition Parsing fdsys.xml. Data Management Definition (DMD). Data Management Definition (DMD). Purpose of the Data Management Definition (DMD) Define collection-specific metadata elements

varian
Download Presentation

Data Analysis and Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis and Parsing

  2. Data Analysis and Parsing Agenda: • Data Management Definition • Parsing • fdsys.xml

  3. Data Management Definition(DMD)

  4. Data Management Definition (DMD) • Purpose of the Data Management Definition (DMD) • Define collection-specific metadata elements • Specify roles for the granules, if applicable • Collection-specific schema definition for FDsys.xsd • Define mappings of metadata elements for Documentum and FAST • Define mappings to metadata standards • One DMD for each collection • PMO & dev team collaborative effort for CDM documentation development • Is both a document and a process

  5. The DMD Defines how Data Flows Through FDsys how will parser data and input files be validated what renditionsare available? how will the MODS be created? how will metadata be extracted and merged? how will the HTML rendition be created what manual edits may be required? how will the content and metadata be indexed how are PDF files processed? what’s on the search form? what are the navigators? what do content URLs look like? how are search results formatted?

  6. DMD – Table of Contents • General Description • fdsys.xml Schema Elements • Renditions, Plant Processing and Interractions • Parser Definition – Extraction patterns and algorithms • Content Management • Content Publishing and Index • Search and Browse • Search results, navigators, and collection browsing • Content Delivery • URLs, content-detail, Front page, actions • mods.xml mappings

  7. Metadata Flow Diagram Metadata Flow Diagram

  8. Metadata Flow Diagram fdsys.xmlstructure Metadata Flow Diagram parsing rules search indexfield mapping CMSmetadatamapping modsmapping search resultsmapping content-detailmapping browsealgorithm search-formmapping

  9. Federal Register Granules • Each article is a granule • Each Part is a single granule • There are no higher-level granules • Sections are not preserved as independent granules

  10. Federal Register Example Metadata agencies title action summary dates contact FR Doc Number Billing Code

  11. Content Files Input Files Renditions locator locator SGML SGML text CDTP extract granules pdf-submitted PDF OCR embedded images extract granules pdf (public) Create “FrontMatter”, “ReaderAids”, and “Issue” PDF files

  12. Content Files – Creating the HTML Rendition text Add HTML headers and header metadata Add URL and E-mail links embed image tags pdf-submitted html extract images as JPEG images html (public) longdesc text OCR images

  13. Extracting Metadata SGML TOC (TOC headings) parse SGML content parse MergedMetadata CDTP parse overwrite add • Metadata is merged based on the FR Doc Number

  14. Search Results action(first 20 chars) collection firstpage rin volume section title 73 FR 22020 - Title I-Improving the Academic Achievement of the Disadvantaged [PDF 123 KB] Federal Register. Proposed Rules. Notice of proposed rulemaking. RIN 0324-AJ10. Wednesday, April 23, 2008. ...The Secretary proposes to amend the regulations governing programs administered under Part A of Title I of the Elementary and Secondary Education Act of 1965, as amended (ESEA)... More Information... publishdate teaser link to content-detail

  15. FR Navigators • Section • Agency • CFRs • Hierarchial + 15 CFR - Part 12 - Part 13 - Part 14+ 16 CFR - Part 412 - Part 413

  16. Collection Browsing yearnav monthnav daynav agencynav

  17. Advanced Search Form

  18. Package-Level URLs • Package Content Detail • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/content-detail.html • Package Metadata Standards • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/mods.xml • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/premis.xml • Package Table of Contents • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/toc.html • Today’s Table of Contents • http://www.gpo.gov/fdsys/html/FR/todays_toc.html

  19. Granule-Level URLs • HTML and PDF Files • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/html/E6-1423.html • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/pdf/E6-1423.pdf • Granule Content Detail • http://www.gpo.gov/fdsys/granule/FR-2006-01-01/E6-1423/content-detail.html • Granule Metadata Standards • http://www.gpo.gov/fdsys/granule/FR-2006-01-01/E6-1423/mods.xml

  20. Content Detail Sample UI

  21. Parsing

  22. Parsing Overview • Runs regular expressions to extract metadata Regular Expression: (Public Law|Pub. L.|PL|P. L.) (1[0-9][0-9])-([0-9]+) Example: Pub. L. 109-130 Produces: <law congress="109" number="130"/> • Written in Java • Called from Documentum when a package needs to be parsed • Produces an instance of fdsys.xml • Parsing has an internal XML format (called the “raw” XML) which is transformed to produce the fdsys.xml

  23. Parser Foundation Classes PContainer PParser PPackage PRendition PFile PGranule USCODEParser USCODEPackage USCODERendition USCODEFile USCODEGranule FRRendition FRFile FRFile • Foundation classes handle 95% of parsing needs • Derived classes handle all special cases

  24. PContainer • Takes patterns and produces elements • Holds XML at each level of the parsing process XML Fragment PPattern PContainer produces used_by <publicLaw> <congressNum>109 <lawNum>123</publicLaw> stored_in XML DOM used_by "(Public Law|Pub. L.|P. L.) (1[0-9][0-9])-([0-9]+)"

  25. Parser Foundation Classes PContainer PParser PPackage PRendition PFile PGranule PRendition PFile PGranule PFile PGranule XML DOM XML DOM XML DOM XML DOM XML XML XML XML XML XSLT prioritymerge append append xml

  26. Parsing XML Documents PContainer PParser PPackage PRendition PFile PRendition PFile PFile XML DOM XML DOM XML DOM XML XML XML XSLT XSLT prioritymerge append fdsys.xml bills.xml

  27. Other Parsing Considerations • Heuristics testing is integrated into the parsing • PEHelper: Checks for heuristics and adds “quality=“ attributes • Output can be automatically Schema-Validated • Schema-Validation is run on all fdsys.xml formats produced by the parser • Parser Validation Tool • Used by GPO to validate that parsers meet the 90% Service Level Agreement for accuracy • Randomly selects 100 documents or granules • Displays metadata & original text for manual review • Produces Validation Report

  28. fdsys.xml

  29. Internal container of metadata related to package Is a detailed representation/model of the data structure across all of FDsys Reduces duplication of data across metadata formats Reduces number of required transformations Can be transformed into standard schemas including: METS MODS PREMIS FDsys.xml Purpose

  30. FDsys.xml General Structure Header Content Metadata

  31. FDsys Publish and Search

  32. Publish and Search Agenda: • FDsys Publish • Search Engine Configuration • Search Engine Application Services

  33. FDsys Publish

  34. High-Level SW Components

  35. Content Publishing - Overview • Communicates from Documentum to Access • From: Documentum • Extract fdsys.xml & premis.xml • Extract renditions and content files • Uses Documentum native DFC calls • To: ACP Cache • Stores metadata and content files • To: FAST ESP Search Engine • Converts fdsys.xml to FAST.xml -> to indexer • Includes the mods.xml (indexed into ESP) • ESP pulls in content files automatically • Uses FAST ESP content_api & search_api calls

  36. Component Interfaces UPDATE THIS

  37. Component Interfaces

  38. Major Architectural Decisions • Pull from Documentum, not Push • Maintenance of Access Subsystem databases becomes the responsibility of the Access Subsystem • Data is pulled from Documentum only as needed • Avoids overflow/queuing problems • Allows multiple access systems to be fielded • Search for Deletes in FAST • Packages can contain many granules • When updating the FAST indexes, use search to find the list of all nested granules in the indexes • Guaranteed to avoid any “orphan” granule problems

  39. ACP Cache Directory Structure

  40. Metadata Flow Diagram

  41. Implementation Detail

  42. Search Engine ConfigurationDesign

  43. Component Interfaces Update This

  44. FAST System – Hardware & Network publish & admin document processors index & search index & search index & search index & search index & search search search search search search Web Application

  45. FAST System – Indexing Flow publish & admin document processors index & search index & search index & search index & search index & search search search search search search

  46. FAST System – Search index & search index & search index & search index & search index & search search search search search search QR server QR server QR server QR server QR server Web Application

  47. Search Engine Sizing: Columns • Total Number of Documents • Estimated 10 million records • Each granule = 1 Search Engine document • Allow 2x expansion for estimation errors and growth • Estimated 20 million records • Sizing Recommendation: • FAST recommends: 5 million records per column • For public facing web sites • 5 columns: to account for the large number of navigators

  48. Search Engine Sizing: Disk • Year 2006 FR – Index Sizing Test • Scale to 20 million documents • Fixml: ~150gb • Index: ~420gb • Total index space required: • 150gb + (420gb)*2 = 1tb • Add 50% for estimation error, total = 1.5tb

  49. Search Engine Sizing: QPS • Queries per second – Estimated from GPO Access • 0.8 QPS (across the whole day) • Estimated peak: 2.4 qps (1/2 of queries in 4 hours) • Estimated Peak QPS for FDsys: • Factor for improved search interface: 3x • Factor for growth: 2x • Estimated: 2.4 x 2 x 3 = ~15 QPS • Correllates with other websites known to ST • Each row: 20-30qps • Therefore: 1 row for query performance • Recommend: 2 rows • 2nd row for redundancy, failover, and maintenance

  50. Metadata Flow Diagram search indexfield mapping

More Related