1 / 50

Library of Chinese Academy of Sciences

Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services. World Library and Information Congress: 72nd IFLA General Conference and Council , 20-24 August 2006, Seoul, Korea. Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying.

tab
Download Presentation

Library of Chinese Academy of Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA General Conference and Council, 20-24 August 2006, Seoul, Korea Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying Library of Chinese Academy of Sciences

  2. outline • Introduction • What is IE (Information Extraction)? • Potential functions in Innovations of Library Services • Constructing a Chinese Information Extraction System • Tests and Evaluation

  3. 1. Introduction • Library of Chinese Academy of Sciences • Now changing the name to National Science Library of China • about 400 staffs, HQ in Beijing, 3 branches in Lanzhou, Chengdu, Wuhan, • serve 90 CAS research institutes across the country • in 2001,initiated Chinese National Science Digital Library (CSDL) program

  4. 1. Introduction • CSDL (Chinese National Science Digital Library ) • provided abundant digital information resources for users. (e-journals,6000 west,11000 Chinese, 15000 in one day) • developed information systems to support networked services.

  5. Union Catalogs & Document Delivery

  6. Federated database search

  7. Digital reference

  8. remote authentication

  9. 1. Introduction • CSDL (Chinese National Science Digital Library ) • provided abundant digital information resources for users. (e-journals,6000 west,11000 Chinese, 15000 in one day) • developed information systems to support networked services. • Carried out lots of training and propaganda program

  10. 1. Introduction • CSDL become one of the key research facility to researcher and graduated students of CAS. • While • Information requirement of researcher and graduated students changed rapidly • Traditional information retrieval methods is not sufficient

  11. 1. Introduction • The User of CSDL want to: • get rid of the information noise • effectively get a comprehensive view of recent development of domain • disclose significant relationships between information • The Librarian of CSDL want to: • improve the service standard of CSDL • turn the digital library into a knowledge repository

  12. 1. Introduction • Information Extraction (IE) is the emerging technology serves to our needs

  13. outline • Introduction • What is IE (Information Extraction)? • Potential functions in Innovations of Library Services • Constructing a Chinese Information Extraction System • Tests and Evaluation

  14. 2.What is IE (Information Extraction)? • NLP Research Group, Univsity of Sheffield • Information extraction (IE) is a term that has come to be applied to the activity of automatically extracting pre-specified sorts of information from natural language texts

  15. 2.What is IE (Information Extraction)? • Dr. Hamish Cunningham • IE is a process that takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output • Input • unstructured • free text • Output • fixed-format • unambiguous

  16. 2.What is IE (Information Extraction)? • Output (structured information source) can be used for: • searching • analysis • generating summary • constructing indices

  17. General Surgical left open capsulotomy mastectomy removal of her prosthesis today bonylymphoedema left arm shooting pain in the direction of ulna nerve local, regional or distant pain recurrence pain clinic management pain clinic pain clinic management a year’s time clinic no signs of recurrence at this time IE, A example ##### ####### NHS TRUST - PATIENT CASE NOTE ########:######### ####### DOB: 1944 CLEF-RMH-Entry-Key: 52A4F6DB2B46E AB 1992 Seen in General Surgical This lady who has had a mastectomy and left open capsulotomy and removal of her prosthesis was seen by me in the clinic today on behalf of XXXXXXXXXXX. She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem. The original problem was that she complained of shooting pain in the direction of ulna nerve and although there does not seem to be any evidence of local, regional or distant recurrence the pain itself warrants management in a pain clinic. XXXXXXXXX could be seen in the pain clinic at the XXXXXXX but as this would involve a lot of travelling would like to be treated nearer her home. I wonder whether it would be possible for you to investigate if there is a pain clinic available at XXXXXXXXXXX as I am sure XXXXX could be treated and benefit from its management. I have otherwise arranged for her to be seen in the clinic again in a year's time. There are no signs of recurrence at this time. 5213A4F612F1 Interventions Problems Problem Site Locations Time

  18. left open capsulotomy General Surgical General Surgical management left open capsulotomy management mastectomy left open capsulotomy mastectomy mastectomy removal of her prosthesis removal of her prosthesis today removal of her prosthesis today bonylymphoedema no signs of recurrence bonylymphoedema bonylymphoedema left arm left arm pain shooting pain in the recurrence shooting pain in the direction of ulna nerve shooting pain in the direction of ulna nerve local, regional or distant direction of ulna nerve local, regional or distant pain recurrence pain local, regional or distant recurrence pain clinic management left arm pain clinic pain clinic pain clinic management General Surgical pain clinic pain clinic pain clinic pain clinic pain clinic management clinic a year’s time today management clinic a year’s time a year’s time clinic no signs of recurrence no signs of recurrence at this time at this time at this time IE, A example Extracted Information could be collected… Interventions Problems Problem Site Locations Time

  19. 2.What is IE (Information Extraction)? • 5 kinds of Information Extraction tasks • Named Entity recognition (NE) • Coreference resolution (CO) • Template Element construction (TE) • Template Relation construction (TR) • Scenario Template production (ST)

  20. 2.What is IE (Information Extraction)? • NE is about finding entities • CO about which entities and references (such as pronouns) refer to the same thing • TE about what attributes entities have • TR about what relationships between entities there are • ST about events that the entities participate in.

  21. 2.What is IE (Information Extraction)? • Information Extraction will: • play a very important role in coping with the huge collections of digital information • bring innovations in library services

  22. outline • Introduction • What is IE (Information Extraction)? • Potential functions in Innovations of Library Services • Constructing a Chinese Information Extraction System • Tests and Evaluation

  23. 3. Potential functions in Innovations of Library Services • Automatic annotation and metadata creation • automatic annotation of digital materials • automatic acquisition of metadata • For example, MnM, S-CREAM, AERODAML, SemTag, KIM, hTechsight • ontology-based IE techniques

  24. 3. Potential functions in Innovations of Library Services • Improving data mining in information analysis • Large-scale data analysis • Detection of many types of evidence • Get enough structured data for analysis

  25. 3. Potential functions in Innovations of Library Services • Developing knowledge base from free text • statistical and numeric databases • terminological database • fact sheets • SOBA (SmartWeb Ontology-Based Annotation)

  26. 3. Potential functions in Innovations of Library Services • Generating answers in digital reference system • Most research libraries establish digital reference service • Can we get answers directly from information systems • Natural language QA (Question Answering)

  27. SO… • IE is very important • How to build an IE system (Chinese) • CSDL try to find an effective way

  28. outline • Introduction • What is IE (Information Extraction)? • Potential functions in Innovations of Library Services • Constructing a Chinese Information Extraction System • Tests and Evaluation

  29. 4. Constructing a Chinese Information Extraction System • A Chinese IE solution • which makes full use of GATE • trying to develop a Chinese IE plug-in to process Chinese information resource based on GATE framework.

  30. 4. Constructing a Chinese Information Extraction System • GATE • (General Architecture for Text Engineering) • Open Source, Developed from 1995 • GATE, a framework • Language Resources (LRs) • Processing Resources (PRs) • Visual Resources (VRs) • ANNIE (A Nearly-New IE system) • tokeniser, sentence splitter, POS tagger, gazetteer, finite state transducer and orthomatcher

  31. ANNIE Pipeline

  32. GATE: good for English

  33. GATE: Not so good for Chinese

  34. 4.Constructing a Chinese Information Extraction System • Key difficulties for Chinese information extraction • Chinese tokenizing • Chinese gazetteers • Chinese named entity recognition

  35. Chinese tokenizing • English language • words are separated by white space and punctuation • Chinese Language • without any separation between words

  36. a simple sentence (I am a Chinese) can be broken into several forms with segmenter (I am a Chinese) (I am China person) (I am center country person)

  37. Chinese gazetteers • GATE gazetteer lists for English • very abundant • GATE gazetteer lists for Chinese process • simple and short gazetteers such as date, time, organization, location, money, province etc • for a flexible language like Chinese, the list is very limited

  38. Chinese named entity recognition • GATE system uses JAPE (a Java Annotation Patterns Engine) rules to recognize NE

  39. JAPE rules • grammar of Chinese is quite different from that of English • the JAPE rules provided by GATE are not suitable for Chinese texts • We need to rewrite JAPE rules to implement Chinese information extraction

  40. Solutions to the problems

  41. three main tasks we have done • Integrating ICTCLAS to perform words segmentation

  42. three main tasks we have done • Developing Chinese gazetteers to enrich GATE language resources

  43. three main tasks we have done • Rewriting JAPE rules to recognize Chinese NE

  44. Chinese JAPE rule

  45. outline • Introduction • What is IE (Information Extraction)? • Potential functions in Innovations of Library Services • Constructing a Chinese Information Extraction System • Tests and Evaluation

  46. 5.Tests and Evaluation • one years of working, we implemented the system • carry out experiments

  47. Same piece of article

  48. Our output

  49. Conclusions • bring forth a solution for Chinese information extraction system • carried out a valuable experiment • still many works need to be done • lay a good foundation for our future works

  50. Thanks! 谢谢!

More Related