230 likes | 241 Views
Mapping Regulations to Industry–Specific Taxonomies. Chin Pang Cheng, Gloria T. Lau, Kincho H. Law Engineering Informatics Group, Stanford University June 5, 2007. Motivating Problem. To Legal Practitioners: Hierarchical, well-structured Precise and concise
E N D
Mapping Regulations to Industry–Specific Taxonomies Chin Pang Cheng, Gloria T. Lau, Kincho H. Law Engineering Informatics Group, Stanford University June 5, 2007
Motivating Problem To Legal Practitioners: • Hierarchical, well-structured • Precise and concise • Familiar with regulatory organization systems To Industry Practitioners: • Voluminous • Not trained to read regulations • More familiar with industry-specific terminology and classification structure
Mapping Regulations to Taxonomies • Possible Cases: • One-Taxonomy-One-Regulation • One-Taxonomy-N-Regulation • N-Taxonomy-One-Regulation • N-Taxonomy-N-Regulation
One-Taxonomy-One-Regulation • Simple keyword latching task • Stemming (e.g. piling pile, disabled disable) • Word interval • Concept: “fire alarm system” • Regulation: “… fire alarm and detection system …”
Inverted Regulations • Each taxonomy concept is hyperlinked • “No Matched Sections” for non-matched OmniClass concepts • See other matched related concepts in that section
One-Taxonomy-N-Regulation Alabama (AL) regulation Arizona (AZ) regulation
One Regulation as the Base (AL) (AZ)
Similarity Comparison on Sections parent child node f0 reference node A U sibling nodes in comparison s-psc child s-ref psc(A) psc(U) ref(U) psc-psc Core from Lau, Law and Wiederhold (2005) • Feature extraction (e.g. concepts, measurements) • Comparison of shared features • Consideration of hierarchical and referential information AL regulation AZ regulation G.Lau, K.Law and G.Wiederhold. “Legal Information Retrieval and Application to E-Rulemaking,” In Proceedings of the 10th International Conference on Artificial Intelligence and Law (ICAIL 2005), Bologna, Italy, pp. 146-154, Jun 6-11, 2005.
Inclusion of Regulation Hierarchy • Terminological differences: revealed by neighbor inclusion
N-Taxonomy-One-Regulation • Multiple taxonomies exist in a single industry • Translation is unavoidable • E.g. in architectural, engineering and construction (AEC) industry • Industry Foundation Classes (IFC) • CIMsteel Integration Standards (CIS/2) • Automating Equipment Information Exchange (AEX) • UniFormatTM, MasterFormatTM • etc. • Possible solution: Merging taxonomy Unfamiliar taxonomy
Proposed Methodology of Taxonomy Mapping T1 sprinkler system T2 water flow orifice [F] 903.4.2 Alarms. Approved audible devices shall be connected to every automatic sprinkler system. Such sprinkler water-flow alarm devices shall be activated by water flow equivalent to the flow of a single sprinkler of the smallest orifice size installed in the system. Alarm devices shall be provided on the exterior of the building in an approved location. Where a fire alarm system is installed, actuation of the automatic sprinkler system shall actuate the building fire alarm system. T1 T2 alarm fire alarm system fire • Taxonomy Mapping: • Mainly manually nowadays • Usually term matching (e.g. fire fire alarm)
Demonstration in Construction Industry IfcSlab steel Taxonomy 2 (ifcXML) Taxonomy 1 (OmniClass) Knowledge Corpus International Building Code, IBC • Corpus: carefully selected (in the same domain)
Relatedness Analysis on Concepts Notations: • a pool of m concepts for a taxonomy • a corpus of N regulation sections • frequency vector is an N-by-1 vector storing the occurrence frequencies of concept i among the N documents • frequency matrix C is an N-by-m matrix in which the i-th column vector is • m = 4, N = 5 • = Example: C = Concept 3 is matched to Section 4 3 times
Cosine Similarity Measure • Common arithmetic measure of similarity to compare documents in text mining • Finding angle between two frequency vectors in N dimensions and from Taxonomy 1 and 2 respectively • Similarity score = [0, 1] • Represented using dot product and magnitude, the similarity score is given by:
Jaccard Similarity Coefficient N11 = number of sections both concepts i and j are matched to N10 = number of sections concept i is matched to but not concept j N01 = number of sections concept j is matched to but not concept i • Statistical measure of the extent of overlapping of two vectors in N dimensions and from Taxonomy 1 and 2 • Defined as size of intersection divided by size of union of the vector dimension sets: • For concept relatedness analysis,
Market Basket Model • Probabilistic measure to find item-item correlation used in data-mining • Two main elements: (1) set of items; (2) set of baskets • Association rule means a basket containing all the items is very likely to contain item j • Confidence of a rule = • Interest of a rule = • Example: • Coca-cola Pepsi: Low-confidence but high-interest
Market Basket Model (cont’d) • For concept relatedness analysis • N11 = number of sections both concepts i and j are matched to • N01 = number of sections concept j is matched to but not concept i • N10 = number of sections concept i is matched to but not concept j • N00 = number of sections both concepts i and j are NOT matched to • Probability of concept j is • Confidence of association rule is • Forward similarity of concept i and j is the interest as:
Asymmetry of Market Basket Model • Asymmetry of market basket model: • Forward similarity: • Backward similarity:
Evaluation of Accuracy • Root Mean Square Error (RMSE): • Difference between the true values and the predicted values • For Taxonomy1 of m concepts and Taxonomy2 of n concepts: • Precision: • Fraction of predictions that are correct • Recall: • Fraction of correct matches that are predicted
Evaluation Results • 20 concepts from OmniClass, 20 concepts from ifcXML • Cosine Similarity: • Average among three metrics • Jaccard Similarity: • NOT preferred (unacceptably low recall, though high precision) • Market Basket Model: • Preferred (lowest RMSE, highest recall)
Conclusion • Mapping industry-specific taxonomy to regulation allows industry practitioners to retrieve regulations faster • Four cases: • 1-Taxonomy-1-Regulation: simple keyword latching • 1-Taxonomy-N-Regulation: hierarchy of regulation sections considered • N-Taxonomy-1-Regulation: 3 similarity analysis metrics introduced (cosine similarity, Jaccard similarity, market basket model) • N-Taxonomy-N-Regulation: future step