240 likes | 407 Views
JURISIN 2008 Second International Workshop on Juris-informatics. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. Graduate School of Information Science, Nagoya University, Japan Masato HAGIWARA, Yasuhiro OGAWA, Katsuhiko TOYAMA. Background.
E N D
JURISIN 2008Second International Workshopon Juris-informatics Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text Graduate School of Information Science,Nagoya University, Japan Masato HAGIWARA, Yasuhiro OGAWA,Katsuhiko TOYAMA
Background • Growing demand for translation of Japanese statutes • Social and economic globalization • Promotion of international investment toward Japan • Technical assistance to developing and/orformer socialist countries • Japanese government effort • “Study Council for Promoting Translation of Japanese Laws and Regulations into Foreign Language”
Bilingual Dictionary • Standard Japanese-English bilingual dictionary of legal terms (SBD) • Recommended to translators and lawyers • More than 250 major statutes to be translated,120 already released based on SBD • High compiling/maintenance cost • Should be technically supported
Dictionary Compilation Support • Natural language processing technique • Automatic extraction of bilingual lexicons byword alignment technique [Toyama et al. 2006] • Japanese entries must be fixed before application • Appropriate terms are still selected by hand Supported by automatic dictionary term selectionfrom unsegmented legal text
Defined Terms • What kind of terms should be selected? Definition sentences この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。 (The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.) (Act No. 239, 1950) この法律において、次の各号に掲げる用語の意義は、当該各号に定めるところによる。 一 著作物 思想又は感情を創作的に表現したものであつて、文芸、学術、美術又は音楽の範囲に属するものをいう。 二 著作者 著作物を創作する者をいう。 (In this Act, the meanings of the terms listed in the following items shall be as prescribed respectively in those items: (i) “work” means a production in which thoughts or sentiments are expressed in a creative way and which falls within the literary, scientific, artistic or musical domain; (ii) “author” means a person who creates the work;) (Act No. 48, 1970)
Pattern-based Term Extraction “Important terms appear in similar contexts” Commodity Exchange … in accordance with the standards and methods specified by aCommodity Exchange … … a market that a Commodity Exchange has opened for each single kind of … … a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent … … a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …
Pattern-based Term Extraction “Important terms appear in similar contexts” Commodity Exchange … in accordance with the standards and methods specified by a Commodity Exchange … … a market that a Commodity Exchange has opened for each single kind of … … a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent … … a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state … Patterns specified by # #has opened Member, etc. of a # equivalent to a # … one-third or more has been specified by articles of incorporation, at least such … … the locations where Old Marketshave been opened and Listed Commodities … … person is a member of a commodity futures association (hereinafter referred to… … in a foreign state equivalent toa Commodity Market; hereinafter the same shall apply…
Pattern-based Term Extraction “Important terms appear in similar contexts” Commodity Exchange … in accordance with the standards and methods specified by a Commodity Exchange … … a market that a Commodity Exchange has opened for each single kind of … … a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent … … a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state … Patterns Instances specified by # #has opened Member, etc. of a # equivalent to a # Articles of incorporation Old Markets commodity futures association Commodity Market … one-third or more has been specified by articles of incorporation, at least such … … the locations where Old Markets have been opened and Listed Commodities … … person is a member of a commodity futures association (hereinafter referred to… … in a foreign state equivalent toa Commodity Market; hereinafter the same shall apply…
Bootstrapping-based Methods • Espresso[Pantel and Pennacchiotti 2006] • Extraction of lexical relations (binary) • English news articles (segmented) • Tchai[Komachi and Suzuki 2008] • Extraction of semantic categories (unary) • Japanese query logs (unsegmented but short) • Long, unsegmented Japanese legal text • → Conventional analyzers/parsers are not applicable
Objectives • A new algorithm Monaka is proposed • Based on Tchai algorithm • Character n-gram based instance/pattern induction • Constraint to ensure proper segmentation • Evaluation to confirm its effectiveness fordictionary term extraction
Espresso Algorithm [Pantel and Pennacchiotti 2006] Instances wheat :: crop George Wendt :: star nitrogen :: element diborane :: substance Picasso :: artist tax :: charge protein :: biopolymer HCl :: string acit Corpus Seed instances Pattern Induction Instance Induction Extracted instances Patterns Pattern Ranking Instance Ranking Bootstrapping x is a y y such as x x and other y
Tchai Algorithm [Komachi and Suzuki 2008] • Applied Espresso to semantic category extractionfrom Japanese web query logs • Some improvements over Espresso • Query-based pattern induction seed: JAL query: JAL_flight pattern: #_flight • Local PMI Max • Ambiguous instance/pattern filtering • Ambiguous instance: 1.5x patterns of prev. instances • Ambiguous pattern: 2.0x instances of prev. patterns • Improves the precision of the extracted instances
Monaka Algorithm – Pattern Induction • Character n-gram based induction • Espresso→ Segmented English text • Tchai→ Short Japanese queries この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。 (The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.) (Act No. 239, 1950) Patterns Instance て「# いて「# おいて「# … #」と #」とは #」とは、 … 商品取引所 (Commodity Exchange)
Monaka Algorithm – Instance Induction この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。 (The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.) (Act No. 239, 1950) Instances 商品 商品取 商品取引 商品取引所 商品取引所」 … Pattern 律において「# (# as used in thisAct) Incorrectly segmented instances are extracted as well
Bidirectional Adjacency Constraint (BAC) • Constraint to ensure proper segmentation Instance i … この法律において「商品取引所」とは、会員商品取引所 … : Instance reliability
Bidirectional Adjacency Constraint (BAC) • Constraint to ensure proper segmentation Instance i … この法律において「商品取引所」とは、会員商品取引所 … : Preceding instance reliability : Succeeding instance reliability Combine as the generalized average high high high … 律において「商品取引所」とは、会員 … high low low … 律において「商品取引所」とは、会員 …
Monaka Algorithm – Ambiguous Patterns and Instances • Character n-gram based pattern/instance induction • Negative effect of generic instance/pattern is more serious e.g. “て「#”, #」と • The number of extracted instances is unpredictable • Ambiguous pattern filtering • Ambiguity = # of co-occurring instance types • Discard 10 most ambiguous patterns after each induction • Ambiguous instance filtering • Ambiguity = # of statutes in which the pattern appears (DF) • Discard ones which appear in more than 70% of the statutes
Experimental Settings • Corpus • 228 Japanese acts included in the translation project • Article, paragraph, and item numbers → head markers • Seed instances • Randomly chosen 100 defined terms out of1,225 defined terms extracted by regular expression • Bootstrapping • # of patterns: initially 100, incremented by 10 • # of instances: start with 100 seeds, 100 new instancescumulatively learned in each iteration • A total of 10 iterations
Evaluation 1. Defined term reproducibility test • How well the rest of the defined terms are reproduced,without depending on the definition sentences • Gold standard: 1,225 defined terms • Closed test 2. SBD coverage test • How many of the SBD entries are covered • Gold standard: all the 3,510 SBD entries appearedat least once in the corpus • Open test
Results – Defined Term Reproducibility Extracted a quarter of the defined termswith the precision of 29.2%
Results – SBD Coverage 5% or more improvement → Supports the effectiveness of the constraint
Result – Extracted Patterns • Mostly substrings of other patterns • Most of the patterns are quite generic • A single pattern may induce too many incorrect instances • Reliability measures are effective to rank patterns/instances • BAC is essential for extraction from unsegmented text
Conclusion • Monaka algorithm was proposed • Bootstrapping-based lexical knowledge acquisition • Simple character n-gram based instance/pattern induction • Constraint (BAC) to ensure proper segmentation • Ambiguous pattern/instance filtering • Evaluation results • Improved precision/recall in both defined term reproducibilityand SBD coverage • BAC helped to extract many correctly segmented instances • Future work: Application of Monaka to other domains • Highly “fixed” format of Japanese statutes • Investigation on the effect of “topic drift” [Komachi et al. 2008] showed bootstrapping tend to converge to generic instances