280 likes | 487 Views
Outline. Framework Bibliography Description Structure Modeling . Automatic Recognition Experiments & Results Conclusion. Retrospective Conversion of Old Catalogues. A. Belaïd LORIA-CNRS, Nancy France. The Framework. F. Electronic Record. Telematics Program : 1990-1994.
E N D
Outline • Framework • Bibliography Description • Structure Modeling • Automatic Recognition • Experiments & Results • Conclusion Retrospective Conversion of Old Catalogues A. Belaïd LORIA-CNRS, Nancy France
The Framework F Electronic Record
Telematics Program : 1990-1994 • Availability & accessibility of modern library services • Standardizationof shared resources between libraries • Harmonization and convergence of national policies 28 projects 48 M ECU The EU Library Program F • Create, enhance & harmonize bibliography in electronic format, develop tools for the conversion of important collections • Develop network connection between libraries allowing data access • Define innovating services for libraries using new technology
Three Projects:IMAGE/OCR F BIBLIOTHECA:Spain + Italy + France • Study of a pivot format (SGML) for the representation of different card and a system for indexing and classifying different documents FACIT:Denmark+Greece+Italy • Search for adapted OCR packages for retro-conversion with large set of characters, and tools for fast and cheap mass conversion of catalogues MORE:Belgium+France • Study of the role & use of dictionaries in the structure modeling and recognition of catalogues by OCR techniques
Cards MORE:three experiences F N o t i c e s
Bibliography Description: Normative Aspects B Notice (catalogues) Reference(bases) Bibliographic Descriptions Areas : position, nb of digits, content Physical ISBD Coded Elements Logical MARC LIBRARIES Cataloguing organisms Semantic Interpreted Elements
Role of the Separators: ISBD B Fields Punctuation Area F [ ] =+ :+ / ;* • Title proper • Type of the document • Parallel Title • Sub-title • Responsibility Mention • Others mentions Title / Responsibility • Edition Mention • Edition Parallel Mention • First Mention F =+ / Edition ...
Example B Les interrogations du psychanalyste : clinique, théorie et technique/ par Jean Bergeret,… --- Paris : Presses Universitaires de France, 1987. --- 193 p. : couv. ill. ; 22 cm. --- (Le fait psychanalytique, ISSN 0986-3524). Bibliogr., 5 p. --- ISBN 2-13-039-780-8 (br.) : 98 F Title proper:sub-title/aut hor Statement. ---Location: Editor Name, Pub.date. --- Print Mat. : Accomp. Mater. ; Format . --- (Title pr. collec., ISSN). Note. ---ISBN:Price
Structure Modeling Object ::= Constructor {subordinate objects [qualifier]}sequence, required, aggregate, optional, choice repetitive Separator: space, punctuation Attributes: Physical Logical Typographicalposition lexicon typeface… Weights : Attributes Sub-objectsImp / Reco. Imp / Hyp. Ambig.
Recognition Schema R Learning Recognition Acquisition Structure Model Structure Recognition Lexicons Format Restitution Target Format Control Electronic Record
The French Library:without OCR R Structural Analysis Compilation Anchor Points Indices Hypotheses Management Image Model Bottom-Up Top-Down
Indices Extraction R Correlation with 4.7% 16.1% 76.7% 43.3% 37.5% 91.0% 31.5% 55.5% 37.5% 61%
S . . . B mA A m’A l0 l’0 a1 … ai-1 ai … o … aj aj+1 … an Syntactical Analysis:the approach R • - Anchor points extraction (o)- Bottom-up: Choice of a ruleA o o ’o- Top-down: verification for left context oright context ’o- Add A to anchor points
Results R Group Vedette: Area Title: Principal Title: Crossing Title: End of the title: Cros. Formulae: Crossing Title: Area Address / Date: Address: Date: Area Collection: 200 references 75% Group Cote:
The Belgian Library: Albert I R • Large number of abbrev. Words • Numerical information • Imp. quantity of names • Mixture of languages • Stressed characters • Punctuation marks • Similar characters
Agenda Analysis Schema R Filtering ANALYSIS OCR Flow Hypotheses Model Specific Structure (Hypotheses) Pre-conditions Specific Instances a priori a posteriori Local Strategies Post Analysis Actions Hypotheses Evaluation
OCR Flow:SGML R Line <LIG X= ...Y= … <B>Helvetius R=100%</B> <LEX L=GNL> <B>(Claude R=100%</B> Adrier R=75%).</B> <LEX L=GFR,GNL>De<LEX L=GGB,GFR,GNL> l’esprit. R=100% Présent-</LIG> <LIG X=… Y = ... tation R=98%<LEX I=GFR,GNL> de R=100% François R=100% <LEX L=GFR>Châtelet. R=99% <I>(Verviers, R=100%<LEX L=GGB>Editions R=100%</I></LIG>... Bold Lexicon Italic
The Analysis:opportunistic mode R Reference CHOICE A B C Agenda • Frontiers of the father • Inheritance of the father score • Put subordinate objects in the Agenda Sort SEQ A B C a priori score (Attributes) a posteriori score • Find search area, initials & finals • Construct potential zones: combinations of I & F • Put combinations in the agenda
The prototype R Catalogues Dictionnaires - Général - Spécifique Manual Acquisition Automatic Acquisition OCR Flow (SGML) Manual Structural Acquisition Dictionaries Manual Structure Structure Recognition Structure Model Structure Specif. (UNIMARC) Error File Dictionaries Error Correction (with Library) Verification & Final Formatting Structure Model MARC
Structure Results E&R References 75.5% Recognized with ambiguities to resolve manually 3% Recognized but with ‘risk’ to be re-examined 8% Recognized with structure error 1% Unrecognized: unknown cause 3% Unrecognized: model 2%
Manual Correction E&R OCR Corrections Nb of doubts examined Structure Correct. Country Correct. Doubt Index Authors Doubt Subj. Index French Doubt Subj. Index Dutch Doubt References Default Solutions Refer. Number Refer. Number Total Nb. of Refer. Month January 427 1100 9 10 2882 2305 130 96 428 1056 15 21 2632 2105 127 84 February March 408 1975 19 24 2344 1555 98 82 419 1132 14 9 3187 1912 200 107 April 386 1088 8 3 2178 1438 130 79 May June 412 1493 7 11 2347 1488 169 112 July 372 August 930 3 13 2433 1699 102 85 302 963 12 12 1725 1170 79 43 September 397 925 13 21 3037 2209 120 89 October 387 1239 13 45 3217 1961 166 96 November December 610 2272 17 20 4446 3230 173 141 4548 14173 130 189 30428 21072 69% 1494 33% 1014 22.3% Total 11 months
Conclusion C Comparing the two methods • Importance of the model • Tools to extract pertinent indices • Several references remain unrecognized: - Task Complexity - Model built from non-normalized references (pre_ISBD) - Knowledge is incomplete and uncertain - Great number of sub-classes - Model construction: difficulty of introduction of fine degree of specification: attributes, weights, etc.
General Conclusion C • Enhance Understanding of the issues involved in the retroconversion • Advances in OCR and Structure Recognition and their solutions • Prototypes developed constitute important results - Precious syntheses - Broader basis for further work • Cooperation between Libraries - valuable insights into practices of retroconversion of old catalogues - contributed to a better comprehension of the problem • A number of problems remain to be tackled