300 likes | 430 Views
How does Auto-Classification Work? The Science Behind It. Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute Auto-Classification: Taking a Closer Look ARMA NOVA Chapter Spring Seminar March 6, 2012. Overview.
E N D
How does Auto-Classification Work? The Science Behind It Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute Auto-Classification: Taking a Closer Look ARMA NOVA Chapter Spring Seminar March 6, 2012
Overview • The Problem: Email needs to be categorized by business activity and retained according to agency records schedule. • What are some of the issues that arise in trying to solve this problem. • Rule-based text categorization • Statistics-based text categorization • An experiment in e-mail categorization • Conclusions
BACKGROUND • NARA Bulletin 2011-03, Dec 22, 2010 • Subject: Guidance Concerning the use of E-mail Archiving Applications to Store E-mail • What are the requirements for managing e-mail messages as Federal records? • Provide for the grouping of related records into classifications according to the nature of the business purposes the records serve; • Permit easy and timely retrieval of both individual records and files or other groupings of related records; • Retain the records in a usable format for their required retention period as specified by a NARA approved records schedule; • …
Which e-mails are non-records? • Many received intra-office e-mail messages should not be saved as records, because they are for information only. If kept by the recipient, they are kept for reference. They are non-records. They do not need to be categorized as records, but may need to be categorized as non records.. • Some received intra-office mail should be saved as records, for example, requests for action and responses to requests for information. • Email messages that are related, e.g., requests for information and response should be linked .
TEXT CATEGORIZATION Text (or Document) Categorization is the problem of assigning a selected document to one or more categories. There are two primary approaches to automated text categorization—rule-based and statistics-based machine learning. Rule-based filters are available in some e-mail clients for automatically classifying mail into email folders. Machine learning techniques are the basis of most spam filters for email. Name – Short Title- 5
Rule-based Text Categorization with Shallow Text Processing • Human experts define template structures to be filled automatically by extracting information from the documents [Ciravegna et al 1999]. The partially filled templates are classified by hand-made rules. • Result in very high recall/precision or accuracy values • High costs in analyzing and modeling the application domain, especially if one takes into account the problem of changing content in the categories.
Supervised Machine Learning • Supervised machine learning trains classifiers on a set of documents that have been labeled with the correct class. • Given a sufficient sample for each category, machine learning models generally cost less to create than rule-based systems. • SML is also easier to scale up to large volumes of email. • SML promises low costs both in analyzing and modeling the application at the expense of a lower accuracy. • It is independent of domain specific knowledge.
Supervised Machine Learning Applied to E-mail • The supervised machine learning methods that have been used for text categorization include: Maximum Entropy classification, Naïve Bayes, and Support Vector Machines (SVM). • The most effective SML method for Text categorization has been SVM. [Joachims 1998; Sebastiani 2002] • SVMs scale up to high dimensionalities • Work well without term selection • Robust to over-fitting
Support Vector Machines • SVMs distinguish positive and negative examples for each class. During the learning phase, they construct a hyperplane supported by vectors of positive and negative examples. For each class, a categorizer is built by computing such a hyperplane. • During the categorization phase, each categorizer is applied to the new document vector, yielding the probabilities of the document belonging to a class. The probability increases with the distance of the vector from the hyperplane. A document is said to belong to the class with the highest probability.
Feature Vector Representation of Email • Each word token (or word stem) in a document corresponds to a feature. (bag of words) • All words in a training set correspond to positions in a feature vector. • Value of each feature is defined by tf-idf: the document term frequency (tf) scaled to the inverse document frequency (idf).
Learning Support Vector Classifiers • Crosses & circles represent positive and negative training examples, resp. • Lines represent decision surfaces • Thicker line is best decision surface. • Small boxes indicate support vectors.
Experiment: Automatic Categorization of GTRI Email using SVMs • Select samples of GTRI email that should be categorized and related to the University System of Georgia’s Records Retention Schedule. • Select 2/3 of samples in each of the six categories for training. • Preprocess each email sample in training set by converting it to text format and transforming it into a representation suitable to the learning method. • Train six binary SVM classifiers • Evaluate performance of the classifiers using 1/3 of samples from each category that were not used in the training.
RESULTS: Retention Categories • Category: Administration • (A4) Advisory Board Records • Explanation: This series documents the activities of boards and councils, which function in an advisory capacity. Boards and councils may have as their charge highly specific or broad areas of concern and include members from outside the institution. This series may include but is not limited to meeting minutes; agendas; reports; notes; working papers; audio recordings; transcriptions; and related documentation and correspondence. • Record Copy: Institutional Archives; Colleges & Units • Retention: Permanent for minutes, agendas, reports, and correspondence; 3 years for all other records • Citation or Reference: • (A13) Correspondence, Administrative • Explanation: Series documents communications received or sent which contain significant information about an institution's programs. Records include letters sent and received, memoranda, notes, enclosures, attachments and electronic messages. • Record Copy: Units • Retention: 5 years • Citation or Reference: O.C.G.A. 9-3-26
RESULTS: Retention Categories (A15) Correspondence, Transitory Explanation: Series documents communications received or sent which do not contain significant information about an institution's programs (Correspondence, Administrative), fiscal status (Correspondence, Fiscal), or routine agency operations (Correspondence, General). Records include, but are not limited to, advertising circulars, drafts and worksheets, desk notes, memoranda, electronic messages, and other records of a preliminary or informational nature. Record Copy: Units Retention: Until read Citation or Reference: (A16) Correspondence, General (Routine) Explanation: Series documents communications received or sent which do not contain significant information about an institution's programs. Records include: letters sent and received; memoranda; notes; transmittals; acknowledgments; community affair notices; charity fund drive records; routine requests for information or publications; enclosures, attachments and electronic messages. Record Copy: Units Retention: 5 years Citation or Reference: O.C.G.A. 9-3-26
RESULTS: Retention Categories (A38) Special Event Records Explanation: This series documents the efforts of a college or unit to provide informative sessions, short-courses, workshops, training programs, excursions, and celebratory events for members of the institution and the communities it serves. This series may include but is not limited to: materials on planning and arrangements; reports; promotional and publicity materials; press releases and news clippings; photographs; presentation materials and handouts; schedules of speakers and activities; registration and attendance lists; participant evaluations; and related documentation and correspondence. Record Copy: Creating units Retention: 7 years after end of event Citation or Reference: O.C.G.A. 9-3-24 Category: Information Management & Planning (D1) Computer System Maintenance Records Explanation: This series documents the maintenance of the institution's computer systems and is used to insure compliance with any warranties or service contracts, schedule regular maintenance and diagnose system or component problems, and document system backups. Records may include: computer equipment inventories; hardware performance reports; component maintenance records (invoices, warranties, maintenance logs, correspondence, maintenance reports, and related records); system backup reports; and backup tape inventories. Record Copy: Information Technology, Units Retention: For life of system or component for records related to system or component repair or service; until superseded for records related to regular or vital records backups
Improved Description of Retention Category • Administrative Correspondence Email • Official communication by Institutional, Departmental, and Divisional Management pertaining to the formulation, planning, implementation, interpretation, or modification of an entity’s programs, services, or projects and the policies and regulations that govern them. • Examples: • E-mail from the President announcing the development of a new campus. • Email from the President announcing the development of a new Research Center. • Email from the CIO announcing a new Service Center and explaining the planned benefits. • Email from the Chair of Pathology announcing the development of a new course curriculum. • Email from the Purchasing Department implementing new procedures for the procurement of contract services. • Email from the Library extending service delivery hours.
Improved Description of Retention Category • General Correspondence Email • Email communication that documents an entity’s activities (institution, department, division, etc.) arising from the routine operations of policies, programs, services, or projects. • Examples: • Faculty email notifications to students documenting course assignments and due dates. • Email notification from Accounting that the monthly Ledgers have been closed. • Email transmission of a Department’s monthly report. • An email transmittal submitting an official report, where documentation is needed to prove the report was submitted timely. • Human Resources notifications regarding changes in employee benefits. • Email notification from Grants Management notifying faculty of grant filing deadlines.
RESULTS – Text Categorization • Converted the sample emails to text files. For each email in the training sample, • Used GATE to tokenize each sample. • Preprocessed each sample to remove punctuation, digits and and 524 stop words (pronouns, prepositions, determiners), and infrequent terms (e.g., those terms that occur 4 times or less in the entire corpus). • Used SVM with the six training samples to construct six classifiers. The features consisted of 3333 tokens (words) that were the union of all the tokens in the 379 emails in the training sample, less the words removed in preprocessing • Feature Selection: The six classifiers have positive and negative weights associated with each feature. For each classifier, Selected 100 features with highest positive weights and 100 features with the highest negative weights. The union of these features resulted in 686 features. • Used SVM with the same six training samples to construct six classifiers based on the 686 features. • Used the six classifiers to categorize the 198 emails not in the training sample. Name – Short Title- 23
Analysis of Experimental Results The auto categorization errors involved just two out of 198 sample emails in the test set.
RESULTS • Creation of good, reliable category samples is essential to use of machine learning to create accurate classifiers. • Support Vector machines can be used to create highly accurate classifiers for email. • Descriptions of Records Categories in Record Schedule need to be enhanced with examples. • Accuracy of automatic categorization is improved by training to filing categories that are subcategories of retention categories.
RESULTS • Auto-categorization is not the complete solution to the Email Categorization and Retention Problem. The following ideas need to be investigated. • At the time of creation, tag copies of intra-organizational email with filing category. • Limit use of classifiers to those email categories specific to an office. • How to associate specific filing categories with generic retention categories. • If person routinely creates a record in filing category, include the category id in a template, or in a pull down menu. • Use subject line tags to facilitate categorization. • Email that is a response to a message that has already been categorized should be in the same category as the original and linked to that email.
References F. Ciravegna et al. (1999) Facile: Classifying texts integrating pattern matching and information extraction. Proceedings of IJCAI'99, Stockholm, pp. 890-895. T. Joachims (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning (ECML), Springer, pp. 137-142 F. Sebastiani (2002) Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47. Records Retention Manual. Board of Regents. University System of Georgia, March 30, 2010. www.usg.edu/records_management/schedules/A/