690 likes | 884 Views
Data Analysis and Parsing. Data Analysis and Parsing. Agenda: Data Management Definition Parsing fdsys.xml. Data Management Definition (DMD). Data Management Definition (DMD). Purpose of the Data Management Definition (DMD) Define collection-specific metadata elements
E N D
Data Analysis and Parsing Agenda: • Data Management Definition • Parsing • fdsys.xml
Data Management Definition (DMD) • Purpose of the Data Management Definition (DMD) • Define collection-specific metadata elements • Specify roles for the granules, if applicable • Collection-specific schema definition for FDsys.xsd • Define mappings of metadata elements for Documentum and FAST • Define mappings to metadata standards • One DMD for each collection • PMO & dev team collaborative effort for CDM documentation development • Is both a document and a process
The DMD Defines how Data Flows Through FDsys how will parser data and input files be validated what renditionsare available? how will the MODS be created? how will metadata be extracted and merged? how will the HTML rendition be created what manual edits may be required? how will the content and metadata be indexed how are PDF files processed? what’s on the search form? what are the navigators? what do content URLs look like? how are search results formatted?
DMD – Table of Contents • General Description • fdsys.xml Schema Elements • Renditions, Plant Processing and Interractions • Parser Definition – Extraction patterns and algorithms • Content Management • Content Publishing and Index • Search and Browse • Search results, navigators, and collection browsing • Content Delivery • URLs, content-detail, Front page, actions • mods.xml mappings
Metadata Flow Diagram Metadata Flow Diagram
Metadata Flow Diagram fdsys.xmlstructure Metadata Flow Diagram parsing rules search indexfield mapping CMSmetadatamapping modsmapping search resultsmapping content-detailmapping browsealgorithm search-formmapping
Federal Register Granules • Each article is a granule • Each Part is a single granule • There are no higher-level granules • Sections are not preserved as independent granules
Federal Register Example Metadata agencies title action summary dates contact FR Doc Number Billing Code
Content Files Input Files Renditions locator locator SGML SGML text CDTP extract granules pdf-submitted PDF OCR embedded images extract granules pdf (public) Create “FrontMatter”, “ReaderAids”, and “Issue” PDF files
Content Files – Creating the HTML Rendition text Add HTML headers and header metadata Add URL and E-mail links embed image tags pdf-submitted html extract images as JPEG images html (public) longdesc text OCR images
Extracting Metadata SGML TOC (TOC headings) parse SGML content parse MergedMetadata CDTP parse overwrite add • Metadata is merged based on the FR Doc Number
Search Results action(first 20 chars) collection firstpage rin volume section title 73 FR 22020 - Title I-Improving the Academic Achievement of the Disadvantaged [PDF 123 KB] Federal Register. Proposed Rules. Notice of proposed rulemaking. RIN 0324-AJ10. Wednesday, April 23, 2008. ...The Secretary proposes to amend the regulations governing programs administered under Part A of Title I of the Elementary and Secondary Education Act of 1965, as amended (ESEA)... More Information... publishdate teaser link to content-detail
FR Navigators • Section • Agency • CFRs • Hierarchial + 15 CFR - Part 12 - Part 13 - Part 14+ 16 CFR - Part 412 - Part 413
Collection Browsing yearnav monthnav daynav agencynav
Package-Level URLs • Package Content Detail • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/content-detail.html • Package Metadata Standards • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/mods.xml • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/premis.xml • Package Table of Contents • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/toc.html • Today’s Table of Contents • http://www.gpo.gov/fdsys/html/FR/todays_toc.html
Granule-Level URLs • HTML and PDF Files • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/html/E6-1423.html • http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/pdf/E6-1423.pdf • Granule Content Detail • http://www.gpo.gov/fdsys/granule/FR-2006-01-01/E6-1423/content-detail.html • Granule Metadata Standards • http://www.gpo.gov/fdsys/granule/FR-2006-01-01/E6-1423/mods.xml
Content Detail Sample UI
Parsing Overview • Runs regular expressions to extract metadata Regular Expression: (Public Law|Pub. L.|PL|P. L.) (1[0-9][0-9])-([0-9]+) Example: Pub. L. 109-130 Produces: <law congress="109" number="130"/> • Written in Java • Called from Documentum when a package needs to be parsed • Produces an instance of fdsys.xml • Parsing has an internal XML format (called the “raw” XML) which is transformed to produce the fdsys.xml
Parser Foundation Classes PContainer PParser PPackage PRendition PFile PGranule USCODEParser USCODEPackage USCODERendition USCODEFile USCODEGranule FRRendition FRFile FRFile • Foundation classes handle 95% of parsing needs • Derived classes handle all special cases
PContainer • Takes patterns and produces elements • Holds XML at each level of the parsing process XML Fragment PPattern PContainer produces used_by <publicLaw> <congressNum>109 <lawNum>123</publicLaw> stored_in XML DOM used_by "(Public Law|Pub. L.|P. L.) (1[0-9][0-9])-([0-9]+)"
Parser Foundation Classes PContainer PParser PPackage PRendition PFile PGranule PRendition PFile PGranule PFile PGranule XML DOM XML DOM XML DOM XML DOM XML XML XML XML XML XSLT prioritymerge append append xml
Parsing XML Documents PContainer PParser PPackage PRendition PFile PRendition PFile PFile XML DOM XML DOM XML DOM XML XML XML XSLT XSLT prioritymerge append fdsys.xml bills.xml
Other Parsing Considerations • Heuristics testing is integrated into the parsing • PEHelper: Checks for heuristics and adds “quality=“ attributes • Output can be automatically Schema-Validated • Schema-Validation is run on all fdsys.xml formats produced by the parser • Parser Validation Tool • Used by GPO to validate that parsers meet the 90% Service Level Agreement for accuracy • Randomly selects 100 documents or granules • Displays metadata & original text for manual review • Produces Validation Report
Internal container of metadata related to package Is a detailed representation/model of the data structure across all of FDsys Reduces duplication of data across metadata formats Reduces number of required transformations Can be transformed into standard schemas including: METS MODS PREMIS FDsys.xml Purpose
FDsys.xml General Structure Header Content Metadata
Publish and Search Agenda: • FDsys Publish • Search Engine Configuration • Search Engine Application Services
Content Publishing - Overview • Communicates from Documentum to Access • From: Documentum • Extract fdsys.xml & premis.xml • Extract renditions and content files • Uses Documentum native DFC calls • To: ACP Cache • Stores metadata and content files • To: FAST ESP Search Engine • Converts fdsys.xml to FAST.xml -> to indexer • Includes the mods.xml (indexed into ESP) • ESP pulls in content files automatically • Uses FAST ESP content_api & search_api calls
Component Interfaces UPDATE THIS
Major Architectural Decisions • Pull from Documentum, not Push • Maintenance of Access Subsystem databases becomes the responsibility of the Access Subsystem • Data is pulled from Documentum only as needed • Avoids overflow/queuing problems • Allows multiple access systems to be fielded • Search for Deletes in FAST • Packages can contain many granules • When updating the FAST indexes, use search to find the list of all nested granules in the indexes • Guaranteed to avoid any “orphan” granule problems
Component Interfaces Update This
FAST System – Hardware & Network publish & admin document processors index & search index & search index & search index & search index & search search search search search search Web Application
FAST System – Indexing Flow publish & admin document processors index & search index & search index & search index & search index & search search search search search search
FAST System – Search index & search index & search index & search index & search index & search search search search search search QR server QR server QR server QR server QR server Web Application
Search Engine Sizing: Columns • Total Number of Documents • Estimated 10 million records • Each granule = 1 Search Engine document • Allow 2x expansion for estimation errors and growth • Estimated 20 million records • Sizing Recommendation: • FAST recommends: 5 million records per column • For public facing web sites • 5 columns: to account for the large number of navigators
Search Engine Sizing: Disk • Year 2006 FR – Index Sizing Test • Scale to 20 million documents • Fixml: ~150gb • Index: ~420gb • Total index space required: • 150gb + (420gb)*2 = 1tb • Add 50% for estimation error, total = 1.5tb
Search Engine Sizing: QPS • Queries per second – Estimated from GPO Access • 0.8 QPS (across the whole day) • Estimated peak: 2.4 qps (1/2 of queries in 4 hours) • Estimated Peak QPS for FDsys: • Factor for improved search interface: 3x • Factor for growth: 2x • Estimated: 2.4 x 2 x 3 = ~15 QPS • Correllates with other websites known to ST • Each row: 20-30qps • Therefore: 1 row for query performance • Recommend: 2 rows • 2nd row for redundancy, failover, and maintenance
Metadata Flow Diagram search indexfield mapping