1 / 60

REGNET

REGNET. A Comparative Analysis Framework For Semi-Structured Documents, With Applications To Government Regulations. Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004. ADAAG in HTML. UK DDA in HTML. IBC in PDF. Motivation. Multiple sources of regulations

kamin
Download Presentation

REGNET

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. REGNET A Comparative Analysis Framework For Semi-Structured Documents, With Applications To Government Regulations Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004

  2. ADAAG in HTML UK DDA in HTML IBC in PDF Motivation • Multiple sources of regulations • Multiple jurisdictions: federal, state, local, etc. • Different formats, terminologies, contexts • Amending rules, conflicting ideas

  3. Motivation • Multiple sources of regulations • Multiple jurisdictions: federal, state, local, etc. • Different formats, terminologies, contexts • Amending rules, conflicting ideas  Need for a repository • Locate relevant information • E.g., small business: penalty fees for violations  Need for analysis tool • Complexity of regulations • Multiple jurisdictions • Understanding of regulations & their relationships

  4. Example 1: Related Provisions ADAAG Appendix 4.6.3 … Such a curb ramp opening must be located within the access aisle boundaries, not within the parking space boundaries. CBC 1129B.4.3 … Ramps shall not encroach into any parking space. Exception: 1. Ramps located at the front of accessible parking spaces may encroach into the length of such spaces … • CBC allows curb ramps encroaching into accessible parking stall access aisles, while ADA disallows encroachment into any portion of the stall.

  5. Example 2: Related but Conflicting Provisions ADAAG 4.7.2 Slope. …Transitions from ramps to walks, gutters, or streets shall be flush and free of abrupt changes… CBC 1127B.5.5 Beveled lip. The lower end of each curb ramp shall have a ½ inch (13mm) lip beveled at 45 degrees as a detectable way-finding edge for persons with visual impairments. • ADAAG focuses on wheelchair traversal; CBC focuses on the visually impaired when using a cane.

  6. Scope • Repository development • Relatedness analysis • Performance evaluation , results and applications Relatedness analysis Repository development

  7. Repository development

  8. Sources of data • Accessibility standards • Americans with Disabilities Act Accessibility Guide (ADAAG) • Drafted chapter for rights-of-way access • Associated public comments • Uniform Federal Accessibility Standards (UFAS) • British Standard BS 8300 • Scottish Technical Standards, Part S • International Building Code (IBC), Chapter 11 • Drinking water standards • Code of Federal Regulations, Title 40 (40 CFR) • California Code of Regulations, Title 22 (22 CCR) • Fire code • International Building Code (IBC), Chapter 9

  9. Computational properties of regulations • Hierarchical tree structure • Referential structure • Discipline-centered, e.g., ADAAG for accessibility  Shallow parser to capture computational properties

  10. Digital publication of regulations • Current standard: HTML, PDF, plain text... • Our system standard: XML • Recreate regulatory structure • Unit of extraction: section/provision • Extract references • Extract features <regulation id="ibc" name="international building code" type="private"> <regElement id="ibc.1107" name="special occupancies"> … <regElement id="ibc.1107.2" name=“assembly area seating"> <reference id="ibc.1107.2.4.1" times="1" /> <concept name="assembl area" times="1" /> … <regText>Assembly areas with fixed seating shall comply … </regText> <regElement id="ibc.1107.2.1" name="services">...</regElement> <regElement id="ibc.1107.2.2" name=“wheelchair …">...</regElement> </regElement> </regElement> </regulation>

  11. Shallow parser: feature extraction • Combination of handcrafted rules and software tools • Generic features • Concepts - noun phrases • Exceptions - negated provisions • Definitions - terminologies defined in regulations • Domain-specific features • Non-structural characteristics specific to a corpus • To aid user retrieval of relevant materials • For analysis purpose: domain knowledge • Glossary terms - definitions from reference guides • Author-prescribed indices - concepts from field handbooks • Measurements - e.g., 2 inches max, 4 ppm • Chemicals - list of drinking water contaminants from EPA • Effective dates - provision updates

  12. Example of indexTerm, concept, measurement & exception features Original Section 4.6.3 from the UFAS 4.6.3* PARKING SPACES. Parking spaces for disabled people shall be at least 96 in (2440 mm) wide and shall have an adjacent access aisle 60 in (1525 mm) wide minimum (see Fig. 9). Parking access aisles shall be part of ... EXCEPTION: … an adjacent access aisle at least 96 in (2440 mm) wide complying with 4.5... Refined Section 4.6.3 in XML format <regElement name=”ufas.4.6.3” title=”parking spaces” asterisk=”1”> <concept name=”access aisl” num=”3” /> … <indexTerm name=”park space” num=”4” /> <measurement unit=”inch” magnitude=”96” quantifier=”min” /> <ref name=”ufas.4.5” num=”1” /> <regText> Parking spaces for disabled people shall ... </regText> <exception> If accessible parking spaces for ... </exception> </regElement>

  13. Scope • Repository development • Relatedness analysis • Performance evaluation , results and applications Relatedness analysis Repository development

  14. Relatedness analysis ADAAG 4.1.6(3)(d) Doors (i) Where it is technically infeasible to comply with clear opening width requirements of 4.13.5, a projection ... UFAS 4.14.1 Minimum Number Entrances required to be accessible by 4.1 shall be part of an accessible route and shall comply with ... Related elements: door and entrance

  15. Relatedness analysis • To utilize the computational properties of regulations for a complete comparison • Measure • Degree of relatedness: similarity score f(A, U)  (0, 1) • Nodes A and U are provisions from two different regulation trees f (0, 1)

  16. Base score f0 computation • Linear combination of feature matching • F(A,U,i) = similarity score between Sections (A,U) based on feature i • N = total number of features •  = weighting coefficient • Feature matching • Based on the Vector model using cosine similarity as the distance between feature vectors • Similarity between two documents M and N = • and are document vectors • i = concept feature • Concept vectors are formed per provision based on concept frequency in each provision • F(provision M, provision N, i=concept) = cosine between 2 concept vectors

  17. Axis dependency: non-Boolean matching • Vector model assumes mutual independence between axes • Domain experts do not necessarily agree • A measurement of “2 inches max” can be a 70% match to “2 inches” • Synonyms exist, e.g., ontology defined for chemicals • Limitation observed • Need flexibility to model domain knowledge, such as a 0, 50%, 75% and 100% measurement match:

  18. Proposed non-Boolean matching model • Define a feature matching matrix E • Eij= % match between features i and j • E.g., a 3-dimensional vector space using “2 ppm”, “2 ppm max” and “2 ft” as the first, second and third measurement axes: E = • Vector space transformation before cosine computation • Map feature vectors onto an alternate space to form consolidated frequency vectors • E.g., based on measurement features • Cosine similarity =

  19. Score refinements based on regulation structure • Neighbor inclusion • Diffusion of similarity between clusters of nodes in the tree • Self vs. parent-sibling-child (psc), fs-psc • psc vs. psc, fpsc-psc

  20. Neighbor inclusion: psc vs. psc • Take a linear combination of neighboring pair scores • Formulate a neighbor structure matrix N • Define score matrix  • We have psc-psc = NA0NUT

  21. Neighbor inclusion: self vs. psc • Take a linear combination of neighbor vs. self scores • Formulate a neighbor structure matrix N • Define score matrix  • We have s-psc = ½ (0NUT + NA0)

  22. Score refinements based on regulation structure • Reference distribution • Diffusion of similarity between referencing nodes and referenced nodes in the tree • E.g., f(A5.3, U6.4(a)) updates f(A2.1, U3.3)

  23. Reference distribution: s-ref and ref-ref • Take a linear combination of reference vs. self and reference vs. reference scores • Formulate a reference structure matrix R • Define score matrix  • We have ref-ref = RA0RUT and s-ref = ½ (0RUT + RA0)

  24. Final score: linear combination of ’s •  = structural weighting coefficient

  25. Scope • Repository development • Relatedness analysis • Performance evaluation, results and applications Relatedness analysis Repository development

  26. Performance evaluation • Conduct a user survey of rankings of similarity • 10 randomly chosen sections from the ADAAG and UFAS • Ranks 1 to 100 in the order of relevance • Root mean square error (RMSE) • = user-generated ranking vector • = machine-predicted ranking vector

  27. Survey results - Tabulated RMSE’s • Compared our analysis to Latent Semantic Indexing (LSI) •  = structural weighting coefficient •  = feature weighting coefficient • Average RMSE smaller than LSI • Measurement feature performs best • No improvement in result observed for structural comparison

  28. Results of comparisons: ADAAG vs. UFAS • Related accessible elements: door and entrance • No ontological information • Neighbor inclusion reveals higher similarity • Content of neighbors imply similarity between Section 4.1.6(3)(d) in ADAAG and Section 4.14.1 in UFAS

  29. Results of comparisons : UFAS vs. BS8300 • Terminological differences - revealed through neighbor inclusion

  30. Results of comparisons : 40CFRdw vs. 22CCRdw • Top ranked: Almost identical provisions, change of enforcing agency

  31. Results of comparisons : 40CFRdw vs. 22CCRdw • Use of ontological information • 40 CFR uses chemical acronyms, e.g., TTHM • 22 CCR spells out “total trihalomethanes”

  32. Application: e-rulemaking • Application domain: e-rulemaking • Comparison between draft of rules and the associated public comments • ADAAG Chapter 11, rights-of-way draft • Less than 15 pages • Over 1400 public comments received within 4 months • Comments ~10MBin size; most are several pages long  New regulation draft can easily generate a huge amount of data that needs to be reviewed and analyzed • Parsing of the draft and comments • From HTML to XML • Recreate structure of the draft using our shallow parser • Extract features from the draft and comments • Treat individual comments as provisions

  33. E-rulemaking Drafted regulations compared with public comments

  34. Results from e-rulemaking application • Related section in draft and public comment

  35. Results from e-rulemaking application • No related provisions identified • Concern not addressed in the draft

  36. Contributions • A framework for regulatory repository • Structure of regulations recreated in XML • Feature extractions • Prototype for similarity comparisons • Contextual comparisons • Domain knowledge • Structural comparisons • Performance Evaluation, Results and Applications • User survey and comparisons with LSI • Observations of comparisons between Federal, State, non-profit organization mandated codes and European standards • Accessibility • Drinking water control • Application on e-rulemaking

  37. Future research directions • In the legal domain • Regulatory competition • Cross border data transfer laws • Especially in the polyglot countries in EU • Regulatory updates • Track changes in updates • Track cross references between regulations • Extension of application to other domains of semi-structured documents • Software specifications • User manuals • Similarity/relatedness is settled - how about differences and conflicts? • Drinking water example of almost identical provisions

  38. Acknowledgments • Committee members • Prof. Kincho Law • Prof. Gio Wiederhold • Prof. Hans Bjornsson • Prof. Cary Coglianese • Prof. Hector Garcia-Molina, defense chair • Family, friends and everyone in the Engineering Informatics Group • Especially REGNET/REGBASE project members • This research is sponsored by the National Science Foundation

  39. Thank You!

  40. Backup Slides

  41. Natural tree hierarchy rendered by SpaceTree

  42. Concept ontology

  43. Semantics of relatedness/similarity • Similar: having characteristics in common; strictly comparable; alike in substance or essentials; not differing in shape but only in size or position. • Related: connected by reason of an established or discoverable relation.  Similarity is not static; it can depend on one’s viewpoint and desired outcome. • “related” provisions are more interested, e.g., the conflicting cases • Traditionally, it is called a “similarity score”.

  44. Cosine similarity • A document is represented as a n-entry vector M = (w1,M, w2,M, … , wn,M), where n is the total number of index terms in the corpus. • Similarity between two documents = • E.g., we take the frequency count of concept i as the concept weight wi,M in dM = (w1,M, w2,M, … , wn,M).

  45. Example of feature vectors • Traditional term match • each index term i is assigned a positive and non-binary weight wi,M in each document vector d M • Weight selection • Frequency of term, or • tf idf model • tf = term frequency; term density • idf = inverse document frequency = log(n/ni); term rarity • Excluding stopwords

  46. Vector space transformation • Define D such that E = DTD is fulfilled • Cosine between the consolidated frequency vectors: = = = =

  47. Boundary case: reduced space • Measurements i and j are synonyms • The following vectors should return the same answer

  48. Neighbor inclusion • Neighbor structure matrix formulation N • Each Section i corresponds to row i and column i of N • Entry Nij is 0 if ipsc(j) • For jpsc(i), entry Nij is 1/k where k is the total number of neighbors of i • Example:

  49. Matrix representation • Take the average scores of the neighboring pairs • Define •  = similarity scores between two regulations M and N • ij = similarity score between Section i from regulation M and Section j from regulation N  We have psc-psc = NA0NUT and s-psc = ½ (0NUT + NA0)

More Related