320 likes | 458 Views
The CFR meets the Semantic Web. (with a little unnatural language processing thrown in ). Background: a two-part history of the Semantic Web. SW is a maze of confusing buzzwords Can be thought of in two parts Pre-2005 (the “top-down” period) Post-2005 (the “bottom-up” period). SW Pre-2005.
E N D
The CFR meets the Semantic Web • (with a little unnatural language processing thrown in )
Background: a two-part history of the Semantic Web • SW is a maze of confusing buzzwords • Can be thought of in two parts • Pre-2005 (the “top-down” period) • Post-2005 (the “bottom-up” period)
SW Pre-2005 • A fascination with inferencing & top-down analysis • Staked out a lot of theoretical territory • Built basic standards: • RDF (statement encoding) : saying things about things • OWL (modeling and inferencing): describing relationships between things -- that is, creating ontologies
SW FROM 2005 to NOW • SW now seen as a big heap of statements • Became more practical • SKOS ( inexpensive conversion method/standard for metadata) • Linked Data ( altruistic, like named anchors ca. 1992 ) • Could be seen -- from a library point of view -- as a new set of techniques for metadata management better suited to the Web
The Semantic Web at the LII • Tying legal information to the real world, not just itself • Applications like: • Improvements to existing finding aids • Table of Popular Names, , Tables I and III • Finer-grained, more expressive PTOA • Search enhancement via term substitution and expansion • Publication of “regulated nouns” and definitions as Linked Data • Research-driven engineering as a practice/culture
Why use the SW toolset? • Sometimes the whole thing looks like an illustration of the Two Fool Rule • Why RDF? • XML is more cumbersome and less expressive • RDF supports inferencing • RDF allows processing of partial information • Why SPARQL? • um, SPARQL is how you query RDF
Why use SKOS? • it's a simple knowledge organization system • lightweight representation of things we need a lot: • thesauri • taxonomies • classification schemes • it might be a little too simple
SKOS: DRIVING INTO A DITCH <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"> <skos:Concept rdf:about="http://www.my.com/#canals"> <skos:definition>A feature type category for places such as the Erie Canal</skos:definition> <skos:prefLabel>canals</skos:prefLabel> <skos:altLabel>canal bends</skos:altLabel> <skos:altLabel>canalized streams</skos:altLabel> <skos:altLabel>ditch mouths</skos:altLabel> <skos:altLabel>ditches</skos:altLabel> <skos:altLabel>drainage canals</skos:altLabel> <skos:altLabel>drainage ditches</skos:altLabel> <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/> <skos:related rdf:resource="http://www.my.com/#channels"/> <skos:related rdf:resource="http://www.my.com/#locks"/> <skos:related rdf:resource="http://www.my.com/#transportation%20features"/> <skos:related rdf:resource="http://www.my.com/#tunnels"/> <skos:scopeNote>Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power</skos:scopeNote> </skos:Concept> </rdf:RDF>
Data reuse: DrugBank • Acetaminophen vs. Tylenol : CFR regulates by generic name • DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/) • http://www.drugbank.ca/ • Offered as Linked Data by Freie Universität Berlin • DrugBank associates brand names with their components • We offer component names as suggested search terms in Title 21 [*]
Can't everything be done with recycled data? Um, no. • Some datasets suck, or don´t exist yet • Conversion of existing resources is not painless • Many vocabularies rely on human interpretation • Many vocabularies are not rigorous enough for SKOS encoding (lotta bad SKOS out there)
CURATION ISSUES FOR EXisting Datasets • Appropriateness, coverage, provenance • Same metadata quality issues as usual • Many systems of subject terms or identifiers not designed for wide exposure: the "on a horse" problem • We’re talking about curation of vocabularies and schemas as much as we are about curation of data.
extracted vocabularies • The big idea: enhance CFR search via term expansion, suggestion, etc. • Reuse existing thesauri • Make a CFR-specific vocabulary by discovering how the CFR talks about itself • Use that knowledge to suggest better search terms • This is not simple phrase or n-gram matching like Google Suggest. • Rather, we discover how words within the CFR relate to each other and we structure them into a hierarchy of terms (SKOS)
Where do vocabularies come from? • Input: text elements in the CFR XML • Extraction and patterns: • Anaphora resolution (JavaRAP) • Natural Language Parser (Stanford Parser) • Hearst patterns: • Output: SKOS (Jena)
Anaphora resolution • John spent time in a Turkish prison. He is now the executive director of CALI. • Núria stole Sara’s chocolate and stuffed her face with it. (but whose face was it?) • When a sponsor conducting a nonclinical laboratory study intended to be submitted to or reviewed by the Food and Drug Administration utilizes the services of a consulting laboratory, contractor, or grantee to perform an analysis or other service, it shall notify the consulting laboratory, contractor, or grantee that the service is part of a nonclinical laboratory study that must be conducted in compliance with the provisions of this part.
Stanford Parser • Structured grammar trees & typed dependencies • Noun modifier: nn(product-10, chemical-9) • “productskos:narrowerchemical_product” • Conjunctions: conj(doctor-7, practitioner-9) • "doctorskos:relatedpractitioner”
Hearst Patterns • lexico-syntactic patterns that indicate hypernymic/hyponymic relations. • { NP (,)? (such as | like) (NP,)* (or | and) NP • Example: All vehicleslikecars, trucks, andgo-karts • PS: • hypernym == word for superset containing term • hyponym == more specific term
principal display panel parser understands “display” as a verb. oops.
Why is this hard? • Legal text is structurally complicated • Parser dies on long sentences, leading to incorrect extractions • Named entities ("Food, Drug, and Cosmetic Act") confuse the parser • Should be separately extracted/tagged • Parser should think of them as a single token, but doesn´t • May need authority files for entities and acronyms, etc. • Corpus is huge (CFR == 96.5 million words) • Strains memory limits and computational resources
Definitions: improving search and presentation • The big idea: find all terms defined by the reg or statute, and do cool stuff with them, for example • linking terms in text to their definitions • pushing definitions to the top of results when the term is searched for • altering presentation so that (legally) naive user understands the importance of definitions for, eg., compliance. • Of course, that also means figuring out what the scope of definitions is.... :(
Where do the definitions come from? • Input: heading elements in the CFR XML with the term "definition". • Using regular expressions, we extract • Defined term and definition text • Location of the definition (section of the CFR) • Scoping information: "For the purposes of this part" • Output: SKOS/RDF • defined term --> SKOS Vocabulary
Definitions: TOOLS • Python Natural Language Toolkit (NLTK) • ElementTree, XML parsing library • Snowball Stemmer Package • RDFlib, an RDF generation library
Why THiS is Hard: FINDING DEFINITIONS • Text containing definition can make it hard to extract. • Sponsor means: • (1) A person who initiates and supports, by provision of financial or other resources, a nonclinical laboratory study; • (2) A person who submits a nonclinical study to the Food and Drug Administration in support of an application for a research or marketing permit • Pattern identification/inconsistencies in sections that are not explicitly meant to be definitions (or, what does “means” mean?)
WHy this is hard: SCoping Definitions • Scoping not stated in text, implicit in structure • Complex scoping statements: • "The definitions and interpretations contained in section 201 of the act apply to those terms when used in this part". • "Any term not defined in this part shall have the definition set forth in section 102 of the Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are defined at the beginning of each subpart of that part".
Improvements • Vocabulary: better extraction and quality • Definitions: retrieval and completeness • Obligations: false positives, identification of parts • Product Codes: semantic matching
FUTURE WORK • RDF-ification, refinement, implementation: • Table III, PTOA, Popular Names • Agency structure • Data management and quality • Crowdsourcing
Resources: standards and primers • RDF: • Primer: http://www.w3.org/TR/rdf-primer/ • Advantages: http://www.w3.org/RDF/advantages.html • SKOS • http://www.w3.org/2004/02/skos/
More Resources • Linked Open Data: • General: http://linkeddata.org/ • Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/ • Government Data: http://logd.tw.rpi.edu/ • W3C Semantic Web resources: • http://www.w3.org/standards/semanticweb/
EVEN MORE Resources: rants and raves • VoxPop articles on the SW and Law: http://blog.law.cornell.edu/voxpop/category/semantic-web-and-law/ • Mangy dogs: http://liicr.nl/JPcAb2 • Legal Informatics blog: http://legalinformatics.wordpress.com/ • Books on law and the SW: http://liicr.nl/MGRbkA
Us • Núria • nuria.casellas@liicornell.org • @ncasellas • http://nuriacasellas.blogspot.com • Tom • tom@liicornell.org • @trbruce • http://blog.law.cornell.edu/(tbruce | metasausage)