420 likes | 630 Views
Ph.D. Research Proposal. RegLocator – A Regulation Management System Enhanced By Domain Knowledge Haoyi Wang March 16, 2005. Committee Members : Prof. Kincho Law Prof. Gio Wiederhold Prof. Eduardo Miranda. Topics. Problems Introduction Project Objectives Related Work
E N D
Ph.D. Research Proposal RegLocator – A Regulation Management System Enhanced By Domain Knowledge Haoyi Wang March 16, 2005 Committee Members: Prof. Kincho Law Prof. Gio Wiederhold Prof. Eduardo Miranda
Topics • Problems Introduction • Project Objectives • Related Work • System Architecture • Expected Contributions Engineering Informatics Group
Introduction on U.S. Statutes • Two major parts: federal and states. • Three types of codes: constitution, codes and regulations. • Regulations are rules made by government agencies. • Federal has 50 titles, and regulations in different states have diverse structures. Engineering Informatics Group
Structures Inside Regulations • Internal hierarchy (Title, Division, Chapter, Article, Section, etc.). • List in subject order(Administration, Food, Business, etc.). Regulation Title 1 Title 2 Title N Division 1 Division 2 Division N The following definitions shall apply to the regulations contained in this chapter … Source: Code of California Regulation, http://ccr.oal.ca.gov/ Engineering Informatics Group
Problems In Regulation Informatics • Many parties need accessing regulations. • The complexities of federal and state regulations: • Distributed resources; • Diverse document formats (pdf, word, html, etc.); • Semi-structured information: traditional information retrieval approaches do not consider such information. • Troubles: • Cause inefficient and ineffective access approaches; • Increase the risk of companies failing to comply with regulations; • Hinder public understanding of the government. Engineering Informatics Group
Distributed Regulation Resources – External Engineering Informatics Group
Distributed Regulation Resources – Internal • Specific topic may cross multiple regulation sections; • The size grows up at the speed of O(ncd) , where nc is the average number of children and d is the depth of the tree; • Manually search one topic will be hard if d become large; • Example: Environmental Compliance Assistance Platform (ENVCAP) • An application on environmental regulations; • Manually build an environmental directory on related regulations. Source: http://www.envcap.org Engineering Informatics Group
Distributed Regulation Resources –Internal(Cont.) • Limitations in ENVCAP • The regulations are distributed on states’ web sites; • Building the directory is a labor-intensive work and the directory is shallow: • Multiple locations for a subject, such as mercury; • Undividable document format, such as PDF. General Subjects Engineering Informatics Group
Content and Structural Query • Query includes structural restrictions: Structural Restriction Engineering Informatics Group
Content and Structural Query (Cont.) • Search “Environment AND “waste water”” in CFR: Irrelevant result Unreadable size Engineering Informatics Group
Content Only Query • Search only on content, but regulations are categorized. Subjects Tax Payer Waste Water Engineering Informatics Group
Research Objectives • Developing a universal and centralized platform to handle distributed diverse regulation files and explore their structural information; • Defining the mechanisms to interpret user’s query by the domain knowledge within regulations, such as hierarchies and categories; • Improving the traditional relevance algorithm to rank the search results with the considerations on the features of semi-structured files; Engineering Informatics Group
General Web Search Engine • Behind the scene, not only the content in web pages: • Link analysis, popularity of a web page is decided by others; • Meta data, more useful information than content; • Domain and URL path; • Html meta tags; • Titles, captions, etc. • Tweak the ranking of results: • By geographic location, e.g., local search; • By commercial. Engineering Informatics Group
Information Systems on Regulations • NaviLex (E. Pietrosanti and B. GraziaDio, 1999) • Search and navigate Italy banking legislation; • Structural, conceptual and functional dimensions of legal documents; • Query like “which regulation part includes a concept C and C is the subject of an obligation relation?” Engineering Informatics Group
Information Systems on Legal Cases • Several systems on representing legal cases: • SMILE, extract factors from cases in trade secret law (Brüninghaus & Ashley, 2001) • EMBRACE, framework for reasoning refugee law (Yearwood & Stranieri, 1999) • SPIRE & INQUERY, CBR+IR system on bankruptcy law (Daniels & Rissland, 1997) • Common approach: • Define a set of features to represent one type of cases by experts; • Given a new case, the value of features in it is identified; • Further process on cases can be performed by IR techniques, such as case-based reasoning. Engineering Informatics Group
heading section section chapter chapter heading author book title Information Systems on XML Files • A popular standard to represent and exchange knowledge. • XPath – standards to describe the structure in XML files • Example, in “book.xml”, /book/chapter or /book/chapter/section. • XQuery – standard query language to retrieve elements in XML files. This … John Smith XML Retrieval XML Query Language XQL We describe syntax of XQL Introduction Source: Fuhr, etc., 2001 Engineering Informatics Group
Information Systems on XML Files (Cont.) XML Systems Doc-Centric Data-Centric Structural Mapping (O2SQL) Model Mapping (XRel, Tequyla-TX) DB Approach (IRQL, PowerDB, Timber, XQueryIR) IR Approach (TIX, HyRex, XPres, XXL) Complexity & Capability Engineering Informatics Group
Distinguished Characters of RegLocator • Document-centric XML repository for regulations • Challenges of RegLocator • Hierarchy structure is defined by titles, not by element types; • Users only care about the content, no idea about tree structure; • Query is bag of words - how to find structural information within a query? • Match the structural information in query with the underlying content; • Utilize other domain knowledge in regulations, such as category and references; Engineering Informatics Group
Topics • Problems Introduction • Project Objectives • Related Work • System Architecture • Expected Contributions Engineering Informatics Group
Centralized Regulation Database Engineering Informatics Group
Mining Domain Knowledge • Complete formal representation is hard to build • First order logic (FOL), knowledge base, ontology, etc. • Automatic building process does not exist yet. • Knowledge engineering is a reasonable approach. • Using available information extraction and text mining methods to identify partial domain knowledge at the shallow level (features). • Title hierarchy, concept, reference, etc; • Relationship between concepts and categories; TITLE 17. Public HealthDivision 3. Air ResourcesChapter 1. Air Resources BoardSubchapter 2.6 Air Pollution Control District Rules(CCR) Engineering Informatics Group
Kernel Reg. DB Content DB Feature DB Hierarchy Identifier Entity Extractor Content Analyzer Main Processing Pipeline New Old Search Engine Content Indexer Feature Indexer Engineering Informatics Group
Major Components of RegLocator Content Engine 1. Web Crawler 2. Shallow Parser 3. Feature Extractor 4. Content Analyzer Render Engine 1. Search Box 2. Subject Directory 3. Results Rendering Test Engine 1. Functional Test 2. Performance Test Index Engine 1. Content 2. Domain features Search Engine 1. Basic Score 2. Feature Score 3. Structural Score Engineering Informatics Group
Content Engine – Web Crawler • Structure of web: hyperlinked online document. • Crawling the web sites • General processor + configurations; • Start from a control center; • Find links in a download page; • Follow the links to get more; • Avoid loop. outputDir = HI startTOC = http://www.hawaii.gov/dlnr/AdminRulesIdx.htm maxDepth=2 linkPattern1 = ^Final.*Rules matchLink1 = false filePattern1 = .*/dlnr/.*\\.pdf indexPattern1 = .* linkPattern2 = .* filePattern2 = .*/dlnr/.*\\.pdf Engineering Informatics Group
Structural Converter WORD HTML PDF TEXT XML Text Converter Content Engine – Shallow Parser Sample patterns on content filter: # Remove the title of the page s/TITLE 18. ENVIRONMENTAL QUALITY// # Remove the table of content s/<center>Supp\.(.*?)ARTICLE 1\.(.*?)ARTICLE 1\./<p>ARTICLE 1\./s Sample patterns on hierarchy recognition: 0@^<p>(CHAPTER (\d+))\. (.*?)$@<p><a NAME="$1" LEVEL="2" TITLE="$3">$3<\/a>@0 Engineering Informatics Group
Content Engine – Feature Extractor • Concept and Reference All salt, table salt, iodized salt, or iodized table salt in packages intended for retail sale shipped in interstate commerce 18 months after the date of publication of this statement of policy in the FEDERAL REGISTER, shall be labeled as prescribed by this section; and if not so labeled, the Food and Drug Administration will regard them as misbranded within the meaning of sections 403 (a) and (f) of the Federal Food, Drug, and Cosmetic Act. (21.CFR.100.155) Source: Stanford PCFG parser Engineering Informatics Group
Content Engine – Significant Concept • Relationship between concepts and categories • Corpora Comparison: • Assume a topic-related concept must have a significant distribution in the corpus related to that topic; • Decide whether a concept is related to a topic by comparing its distribution in this corpus with its distributions in others; • Distribution of a concept in a corpus: • View a corpus as a list of words; • Divide this list into many fixed-length regions; • A sample point is the occurrence times of this concept within a region. • Example, giving a short corpus, build a sample on “food” when each region is 20-word long. • The sample for “food” in this corpus is (0,0,1,1). Engineering Informatics Group
Content Engine – Significant Concept(Cont.) • Sample: compare 21CFR (Food and Drugs) and 40CFR (Environmental Protection) • Find the significant concepts in 21CFR; • List length, both corpora have more than 2M words; • Each region have 20K words, about 10 points in each sample. Engineering Informatics Group
Content Engine – Regulation in XML Form <!ELEMENT regulation (regElement+)> <!ATTLIST regulation id ID #REQUIRED name CDATA #REQUIRED type CDATA #REQUIRED> <!ELEMENT regElement (concept*, regText?, regElement*,reference*)> <!ATTLIST regElement id ID #REQUIRED name CDATA #REQUIRED> <!ELEMENT concept> <!ATTLIST concept name CDATA #REQUIRED times CDATA #REQUIRED> <!ELEMENT reference> <!ATTLIST reference id CDATA #REQUIRED times CDATA #REQUIRED> <!ELEMENT regText (#PCDATA | paragraph)*> <!ELEMENT paragraph (#PCDATA | pre | img )*> <!ELEMENT pre (#PCDATA)> <!ELEMENT img (#PCDATA> - <regulationid="40.cfr.1" name="STATEMENT OF ORGANIZATION AND GENERAL INFORMATION" type="federal"> - <regElementid="40.cfr.1.A" name="-- Introduction"> - <regElement id="40.cfr.1.1" name="Creation and authority."> <conceptname="executive branch" times="1" /> <conceptname="environmental protection agency" times="1" /> - <regText> <paragraph>Reorganization Plan 3 of 1970, established the U.S. Environmental Protection Agency (EPA) in the Executive branch as an independent Agency, effective December 2, 1970.</paragraph> </regText> </regElement> Engineering Informatics Group
Index Engine – Inverted Index Terms Pointers Source: CS276a, Fall, 2002 Engineering Informatics Group
Index Engine – Domain Knowledge • Tables storing relationships among regulation elements and features • Title hierarchies; • Significant concepts; • Tree structure; TITLE 17. Public HealthDivision 3. Air ResourcesChapter 1. Air Resources BoardSubchapter 2.6 Air Pollution Control District Rules Engineering Informatics Group
Search Engine – Three Stages • First stage: • Identify the candidate documents by query words; • Rank the candidate elements by tf.idf. • Second stage: • Re-rank the elements by mapping their domain features to the structural information in user’s query. • Third stage: • Tune the ranking by other internal structures. Engineering Informatics Group
Search Engine – First Stage • Rank the candidate elements by tf.idf: • Represent both candidate document and query by vectors. • Each element in a vector is the weight for a term: • Term Frequency (tf) measures the term density in a document; • Inverted Document Frequency (idf) measures the term’s informativeness; • The weight for term i in document d: • Relevance Score • Similarity is measured by the cosine of angle between two vectors: Engineering Informatics Group
Search Engine – Second Stage • Map the domain features in candidate elements to the structural information in user’s query: • Keywords in title hierarchy of an element: • E.g. for query “environment, waste water”, existing an element with “waste water” in content and “environment” in titles, • This element is a better candidate for the query: • Keywords are significant concepts of a category: • The elements from this category are more relevant • The ranking scores of candidates should be adjusted regarding these relationships. Engineering Informatics Group
Search Engine – Second Stage (Cont.) score_d_s = sum_l (w_l * sum_t (tf_q * tf_t ) / norm_d_l) where: score_d_s : score for document d on title hierarchy sum_l : sum for all title levels l w_l : weight for level l sum_t : sum for all terms t tf_q : the square root of the frequency of t in query tf_t : the square root of the frequency of t in title of d at level l norm_d_l : normalization denominator for d at title level l score_d_c = sum_c (w_c * tf_q * tf_c / norm_d_c) where: score_d_c : score for document d on concept’s significance sum_c : sum for all concepts c w_c : weight for concept c tf_c : the square root of the frequency of c in d norm_d_c : normalization denominator for d on concepts score_new = p_d * score_d + p_s * score_d_s + p_c * score_d_c where: score_new : updates ranking score from the second stage p_d, p_s, p_c : tuning parameters Engineering Informatics Group
Search Engine – Third Stage • Tune the relevance scoring by other internal structures: • Tree: ranking elements with concerns about granularity. • References. S10 S9 S7 S8 S1 S2 S3 S4 S5 S6 Engineering Informatics Group
Render Engine • Providing the GUIs on search box and/or subject directory • The browsing format for retrieved results • Traditional list interface; • Tree interface. Source: HCIL Space Tree Engineering Informatics Group
Test Engine • Functionality test • E.g., query format, special index, relevance scoring. • System performance • Speed and cost to accomplish a query. • Result quality • Precision – fraction of retrieved docs that are relevant. • Recall – fraction of relevant docs that are retrieved. • Coverage – results are not too specific or too general. Engineering Informatics Group
Expected Applications of RegLocator • Crawling different state web sites and collect regulations from related resources; • Identifying the structural information within regulation context and storing them in a XML repository; • Mapping the structural information from a query to the content of the regulations; • Ranking the relevant regulation elements with the considerations on domain features; • Building a regulation subject directory automatically with an acceptable accuracy rate. Engineering Informatics Group
Expected Contributions of RegLocator • Management kernel for U.S. regulation system; • Framework to manage semi-structured documents; • Approaches of knowledge engineering and information retrieval on regulatory information; • Relevance Algorithms on searching regulation documents. Engineering Informatics Group
Acknowledgements • Committee Members: • Prof. Gio Wiederhold • Prof. Eduardo Miranda • Prof. Kincho Law • Research Colleagues: • Gloria Lau,Shawn Kerrigan,Jun Peng, Chuck Han, Xianshan Pan, and Yang Wang. • NSF Grant No.: EIA-9983368 Engineering Informatics Group
Q & A Thank you Engineering Informatics Group