260 likes | 360 Views
Implementing Coding Tools for a New Classification. Andrew Allen, UK Office for National Statistics. Operation 2007 - The players:. In the UK: The Standard Industrial Classification of Economic Activities (SIC) (current version SIC (2003)
E N D
Implementing Coding Tools for a New Classification Andrew Allen, UK Office for National Statistics
Operation 2007 - The players: • In the UK: The Standard Industrial Classification of Economic Activities (SIC) (current version SIC (2003) • In Europe: NACE, the Nomenclature générale des activités économiques dans les Communautés européens (current version NACE Rev 1.1) • In the UN: ISIC, the International Standard Industrial Classification of all Economic Activities (current version ISIC Rev 3.1)
The UK SIC • is a 5 digit classification system • is required, by EU legislation, to be identical to NACE down to and including the 4 digit Class level • contains a national 5th digit level which does not exist in NACE
ACTR as an aid to coding • ACTR – Automatic Coding by Text Recognition • Developed by Statistics Canada • ONS standard tool for coding, initially industry and occupation • Replaces Precision Data Coder for industry coding • Determines a code from a text description • Extent of automation of process is controlled by parameters
Knowledge Bases – SIC2003 • ACTR relies heavily on indexes of standard descriptions: • Business descriptions from responses to the Business Register Survey • Published index for the SIC2003 • The short descriptions for each SIC2003 code • Standard descriptions for construction industry statistics • Trade code descriptions for PAYE (Pay As You Earn Tax) employers • Farm type descriptions • With a total of > 30,000 standard descriptions
How ACTR works • Each input description is converted to a standard form • This is compared with the standard forms of descriptions held in the knowledge base • The closeness is presented as a score between 0 and 10 • The system has rules to determine whether the score is sufficient to confirm a match: • Requires a score of more than 7.5 to code automatically (our setting which may differ for other data sets) • Lower scores are passed through interactive coding • Coding does not depend on the order in which the knowledge bases are checked
ACTR Process • Supplied text: Horticultural services • HORTICULTURAL SERVICE • Best fit index entry: Sales and service of horticultural machinery • HORTICULTURAL MACHINERY SALE SERVICE • Score is 6.911 (out of 10) • ACTR prefers SIC 2003 code: 51880 (Wholesale of agricultural machinery and accessories)
Interactive coding • Scores below 7.5 are passed to clerical staff for coding interactively • The system presents options in descending order of score • If none of the choices appear good, staff modify the description • Once a decision is made, the person coding confirms the choice • The index description is then held on the IDBR.
Introducing the SIC2007 (NACE Rev 2) • New index files: • SIC2007 headings • SIC2007 index • Initially code forward from the SIC2003 using bridging codes – these are codes for each knowledge base entry that link the SIC2003 and SIC2007 • Later will change to code backwards from the SIC2007 • Eventually dual coding will cease
Impact of ACTR on IDBR at Micro Level • Existing SIC 2003 is 01120 (Growing of vegetables etc) • The preferred ACTR SIC 2003 is 51880 (Wholesale of agricultural machinery and accessories) • The SIC 2007 comes from the bridging code • SIC 2003: 51880 • Bridging code: MTOLR • SIC 2007: 46610 • SIC 2003 code will change but only when agreed
Conversion to SIC2007 • ACTR will deal with units that have a suitable business description • Conversion tables will deal with: • Units with descriptions that ACTR is unable to code (vague descriptions) • Units without a description • Units supplied through administrative sources (existing VAT traders, PAYE employers, Registered Companies)
Creation of Conversion Tables • Tables have been created to convert units from SIC2003 to SIC2007: • Using ACTR bridging codes • Coding existing data through ACTR • Producing cross-tabulation of SIC2003 to SIC2007 • Allocating on a probability basis rounded to nearest 5% • Validate relationships against the acceptable range of industries • Best fit tables also produced for users who cannot accommodate probability based conversion
Impact on the IDBR at the Macro Level • Impact on SIC 2003 is only on those reporting units that have business descriptions for local units, where ACTR can code. • ACTR codes 620,000 • ACTR does not code 210,000 • No business description 340,000 • Administrative data only 1,660,000 • Total local units 2,830,000 • SIC 2007 comes from the bridging codes only where ACTR codes – otherwise SIC 2007 comes from conversion from SIC 2003
A AGRICULTURE, HUNTING AND FORESTRY SIC 2003 B FISHING C MINING AND QUARRYING D MANUFACTURING E ELECTRICITY, GAS AND WATER SUPPLY F CONSTRUCTION G WHOLESALE AND RETAIL TRADE; REPAIR OF MOTOR VEHICLES H HOTELS AND RESTAURANTS I TRANSPORT, STORAGE AND COMMUNICATION J FINANCIAL INTERMEDIATION K REAL ESTATE, RENTING AND BUSINESS ACTIVITIES L PUBLIC ADMINISTRATION AND DEFENCE; COMPULSORY SOCIAL M EDUCATION N HEALTH AND SOCIAL WORK O OTHER COMMUNITY, SOCIAL AND PERSONAL SERVICE ACTIVITIES P PRIVATE HOUSEHOLDS EMPLOYING STAFF AND UNDIFFERENTIATED Q EXTRA-TERRITORIAL ORGANISATION AND BODIES
Impact at SIC 2003 broad industry level (provisional counts)
A Agriculture, Forestry And Fishing SIC 2007 B Mining And Quarrying C Manufacture D Electricity, Gas, Steam And Air Conditioning Supply E Water Supply; Sewage, Waste Management And Remediation Activities F Construction G Wholesale And Retail Trade; Repair Of Motor Vehicles And Motorcycles H Transportation And Storage I Accommodation And Food Service Activities J Information And Communication K Financial And Insurance Activities L Real Estate Activities M Professional, Scientific And Technical Activities N Administrative And Support Service Activities O Public Administration And Defence; Compulsory Social Security P Education Q Human Health And Social Work Activities R Arts, Entertainment And Recreation S Other Service Activities T Activities Of Households U Activities Of Extraterritorial Organisations And Bodies
Correspondence between SIC 2003 and SIC 2007 for local units coded by ACTR
Conclusions • The ACTR tool delivers considerable savings in terms of cost and burden on businesses compared to traditional survey approaches. • The knowledge base is portable (i.e. independent of the coding engine), enabling sharing this with any interested parties, e.g. administrative data suppliers, to increase the consistency of coding. • The use of bridging codes permits simultaneous coding to multiple classification systems, essential if periods of dual-coding are required. • The knowledge base approach can help to inform the development of future versions of a classification, by providing a reference frame of business activity descriptions.