270 likes | 388 Views
An eRulemaking Corpus: Identifying Substantive Issues in Public Comments. Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University. Plan for the Talk. Background E-rulemaking CeRI FTA Grant Circulars Corpus
E N D
An eRulemaking Corpus: Identifying Substantive Issues in Public Comments Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University
Plan for the Talk • Background • E-rulemaking • CeRI FTA Grant Circulars Corpus • Text Categorization Experiments
RulemakingE-Rulemaking Rulemaking: one of the principal methods of making regulatory policy in the US ~4000+ per year “notice and comment” rulemaking: formal public participation phase 10 – 500,000 comments per rule comment length: 1 sentence – 10’s of pages agency legally bound to respond to all substantive issues E-rulemaking = e-notice and e-comment
Goals of Our Current Work Determine the degree to which automatic issue categorization can facilitate analysis of comments by identifying and categorizing “relevant issues”. Framed as a text categorization task: Given a comment set, the automated system should determine, for each sentence in each comment, which of a group of pre-defined issue categories it raises, if any. Builds on the work of Kwon & Hovy (2007) and Kwon et al. (2006)
Plan for the Talk Background CeRI FTA Grant Circulars Corpus Difficulties Interannotator agreement results Text Categorization Experiments
FTA Grant Circulars Rule • Topic: guidance to public and private transportation providers applying for federal aid for elderly, disabled and low income persons • 267 comments • shortest: 1 sentence • longest: 1420 sentences • 11,094 sentences total
FTA Grant Circulars Issue Set 17 top-level issues 39 fine-grained issues
Difficulties for Text Categorization • Large, hierarchical issue set
FTA Grant Circulars Issue Set 17 top-level issues 39 fine-grained issues
Difficulties for Text Categorization • Large, hierarchical issue set • “NONE” category • Skewed distribution across issues • 87% of the sentences are from 6 categories • 13% of the sentences are from 33 categories • Potentially multiple issues per sentence. • Even long sentences contain few words. • Variation in comment quality, scope, vocabulary and form.
Interannotator Agreement • 146 comments used for the study • 6 annotators • 2.66 annotators per comment • 41.5 sentences per comment • Overlap agreement measure
Plan for the Talk Background E-rulemaking Public comment analysis CeRI FTA Grant Circulars Corpus Difficulties Interannotator agreement results Text Categorization Experiments
Fine-grained issues (39) Coarse-grained issues (17) Standard Text Categorization Algorithms
Gold Standard Data Set • Simulate agency comment analysis process • One analyst / rule • Six data sets • One data set / annotator
Progress and Plans • Promising initial results rule-specific issue categorization of public comments • Annotate comments for more rules • Expert (rulewriter) vs. law student annotation • Integrate automatic text categorization into annotation interface • Active learning (Purpura, Cardie & Simons, dg.o 2008) • Collaboration with HCI colleagues in InfoSci
The End • For more on • the hierarchical text categorization method • Cardie et al. (dg.o 2008) • a new structural learning approach for hierarchical classification • Purpura et al. (in preparation) • active learning methods for hierarchical text categorization • Purpura, Cardie & Simons (dg.o 2008)
Minimizing the Costliest Errors** **Underinclusive errors are the most costly