190 likes | 205 Views
ST22 revision proposal. June-2006 WIPO-SDWG meeting Geneva. Agenda. Reasons for the revision of the ST22 Age of current standard Expected benefits PCT International Bureau experience Examples of pages difficult to OCR Conclusion Discussion / Questions. Age of current standard.
E N D
ST22 revision proposal June-2006 WIPO-SDWG meeting Geneva
Agenda • Reasons for the revision of the ST22 • Age of current standard • Expected benefits • PCT International Bureau experience • Examples of pages difficult to OCR • Conclusion • Discussion / Questions
Age of current standard • Inadequate title: “Recommendation for the presentation of patent applications typed in optical character recognition (OCR) format” • Contains valid recommendations but expressed using an old-fashioned terminology (ribbons, typewriter,…). Some recommendations need to be precised. • A few new recommendations should be added to take into account the progress in OCR technology in the last 10 years. • Not enough followed by agents/applicants: some promotion is required
Expected benefits • Experience shows that if documents follow simple layout rules, the automatic OCR procedures are sufficiently effective to yield a satisfying result for full text search purposes (i.e. an average accuracy above 98.5%). • An updated standard ST22 would lead to: • Significant reductions in cost for the OCR procedures performed by the IP regional/national offices and the IB. • Better quality for the full-text published documents built from OCR procedures • More efficient and precise search procedures for the IP community
PCT International BureauExperience • An internal automatic OCR system and a Quality Checking system have been developed by the PCT • The system has been tested for 6 months and then put in production. It has been in operations since January, 1st 2006 and OCRs the pamphlets published weekly by the PCT.
Internal OCR key points • Use an off-the-shelf commercial product and adapt it to the PCT needs • Build a generic and scalable service so that the OCR function can be used from different applications (on- line or batch) and fulfill PCT future needs • Operate the service in house to reduce costs and gain flexibility in the publication process (discontinue Outsourcing contract)
Internal OCR: key points • OCR the description and claims sections of the published PCT pamphlets each week (circa 50’000 pages to OCR weekly) • Provide the results as ST36 XML files that are used to feed the indexation engine of the Patentscope site and the espacenet site (see http://www.wipo.int/pctdb/en/browse.jsp) • Enrich the PCT electronic products with the results of the OCR (searchable PDFs added to the rule 87 DVD)
Internal OCR some figures • With our hardware configuration, the OCR of a complete publication week lasts around 16 hours (it runs during week ends). • 5 staffs are performing part-time Quality Checking operations every Monday (Around 3 to 4 man days are spent each week on quality checking) in order to correct the worse cases.
Some examples of difficult pages submitted in paper or in image form, the revised ST22 standard should discourage...
Conclusion • We invite the SDWG to: • (a) to consider the proposal to revise WIPO Standard ST.22; and • (b) to consider establishing a task for the revision of WIPO Standard ST.22 and to set up a Task Force to handle such revision.
Agenda • Reasons for the review of the ST22 • Age of current standard • Expected benefits • PCT International Bureau experience • Examples of applications difficult to OCR • Conclusion • Discussion / Questions