  1. Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo

  2. Contents • Introduction • Practice in legal retrieval • Generation of Background concepts • Combining concepts and contexts • Conclusion

  3. Introduction • Why needs advanced legal retrieval, e-discovery? • Document Collections • Legal Requirements • Efficiency

  4. Introduction • What challenges? • Explosive growth of document size • Extensive document source • Expanding document format collection • Informal language

  5. Introduction • Opportunities: • Background contexts utilization • Search documents deeply for every possible evidence • Examples – TREC: complaint as background information • More context information: Web and the links

  6. Practice in Retrieval Process • TREC legal track practice: • Defendants devise queries • Plaintiffs’ turns • Final queries for production request • Document Retrieved

  7. Practice in Retrieval Process • What can be added to the process? • Exploit the background information – complaints • Merge with the larger background – Web and links • Proposal in this work – Use Wikipedia as an example

  8. Modeling

  9. Generation of Background concepts • Representation of Background concepts: • Entities & Relations • Ease the conversion from texts to concepts • Facilitate unsupervised operations

  10. Generation of Background concepts • Concepts sources – Wikipedia • Page: a document • Title: central concept described by a document • Links: A set of concepts / terms to other pages • Word: Set of words

  11. Generation of Background concepts • Facilitate lexical realization from texts to concepts: • Surface concepts: Mentioned by a page • Hidden concepts: Indexed by no pages but exist in pages

  12. Generation of Background concepts • Entities: • Basic objects – named entities, locations, organizations …. • Definitions: • e⊂c, e≠r, e∈role of relations

  13. Generation of Background concepts • Relations: • Relationships between concept • r⊂c, • r≠e, • r=<role1, role2, role3>, rolei = e

  14. Semantical Domain • Semantical Domain: • Group of inter-related concepts, as defined by Wikipedians • Groups can be configured, reconfigured, depending on the size, nature of domains • Represent background information of different size, nature, structures

  15. Semantical Domain • Operations: • D = {pagei} where pagei∈ E • Overlap • Subsumed • Join

  16. Knowledge Extraction, Parsing • Parsing: • Conversion of syntactic parse into concepts representations • Dependency parsing • Fill the entities and relations automatically

  17. Entities & Relations • Highlights of the process: • Syntactic parsing of sentences • Conversion from linguistic representation to concepts representation • Constraint the concept spaces by different sizes and scopes

  18. Combining the concepts and background contexts • Algorithms: • Filter the background text and request text • Match the term set into Wikipedia • Build the network of concepts and relations • Combine for single network and filter unnecessary concepts • Extract terms and concepts and expand the query string • Fire the query to retrieval

  19. Conclusion

  20. Conclusion • Challenges in legal retrieval • Background contexts • Generation of background concepts • Project the context to concepts • Expand the queries for retrieval

  21. Conclusion • Current work: • Integration of language learning (not only parsing) and concepts generation process • Large scale construction of networks with full document set in 3 languages on Grid: • English: 1.7 million • Spanish: 300 thousand • Chinese: 200 thousand

  22. Conclusion • Current work: • Experiments running on 20M web pages corpus for expanded links • Generated Language, Concept spaces used in other Natural Language Technologies (NLT) • TREC-Legal: Testing the integration of knowledge base with the complaint text for queries • TREC-Legal: Building new matching mechanism (from KB induction) on small, concise set of documents

  23. Thank you QA

