1 / 27

Language and Information

Language and Information. LIS 610 November 6, 2002 Nina Wacholder nina@scils.rutgers.edu. Agenda. Role of language in information science Current research: Human Computer Interaction with Electronic Indexes and Index Terms. Textual information.

quant
Download Presentation

Language and Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and Information LIS 610 November 6, 2002 Nina Wacholder nina@scils.rutgers.edu

  2. Agenda • Role of language in information science • Current research: Human Computer Interaction with Electronic Indexes and Index Terms Language and Information 11/06/02 Nina Wacholder

  3. Textual information • Information conveyed by alphabets, digits and punctuation • Organized into meaningful units recognized by some group of people Language and Information 11/06/02 Nina Wacholder

  4. Other techniques for conveying information • Spoken language • Gesture • Facial expression • Sound • Images (drawings, photographs …) Language and Information 11/06/02 Nina Wacholder

  5. Language • Uniquely human • Learned • Conventional Language and Information 11/06/02 Nina Wacholder

  6. Understanding language is hard • Expresses complex concepts • Ambiguity – words, phrases and sentences have more than one meaning • Synonymy – words, phrases and sentences have more than one meaning Language and Information 11/06/02 Nina Wacholder

  7. Complex concepts • Pencil • Face • Directions to Alexander Library • Theory of relativity • U.S. election law Language and Information 11/06/02 Nina Wacholder

  8. Synonymy • child, kid, adolescent, baby • flammable, inflammable • I was walking up the street that day. • I was walking down the street that day. • Moxie wrote that report. That report was written by Moxie. Language and Information 11/06/02 Nina Wacholder

  9. Ambiguity-- semantic • Bat • Make a bed • Moxie ate potatoes with a fork. • Moxie ate potatoes with fish. Language and Information 11/06/02 Nina Wacholder

  10. Ambiguity– structural (syntactic) • Red airplane terminal • [[red airplane] terminal] • [red [airplane terminal]] • Moxie saw Toxie in the park with a telescope • Moxie saw [Toxie in the park with a telescope] • Moxie [saw] Toxie in the park [with a telescope] Language and Information 11/06/02 Nina Wacholder

  11. Natural language processing (NLP) • Natural language • Computer language Language and Information 11/06/02 Nina Wacholder

  12. The NLP controversy: rules vs. statistics Language and Information 11/06/02 Nina Wacholder

  13. NLP by rule • Lexicon (vocabulary) • Det: a • ProperName: Moxie • Noun: report • Verb: wrote • Syntactic rules • NounPhrase[a report]  Det[a] Noun[report] • NounPhrase[Moxie]  ProperName[Moxie] • VerbPhrase[wrote a report]  Verb[wrote] NounPhrase[a report] • Sentence[Moxie wrote a report]  NounPhrase[Moxie] VerbPhrase[wrote a report] Language and Information 11/06/02 Nina Wacholder

  14. NLP by statistics • Luhn (1958) • tf*idf (Salton and Buckley 1988) • Maximum entropy (Berger, Della Pietra and Della Pietra 1996) Language and Information 11/06/02 Nina Wacholder

  15. Information-access tasks with significant natural language component • Information retrieval • Information extraction • Automatic summarization • Question answering Language and Information 11/06/02 Nina Wacholder

  16. Sparck Jones (2001) • Task core vs. task context • Information retrieval: 30-40% accuracy for systems in natural environment • Information extraction: 50% for core systems • Automatic summarization: no sound basis for core evaluation Language and Information 11/06/02 Nina Wacholder

  17. Task compare domain-independent, corpus-independent methods for automatic identification of terms to represent a document or collection of documents Methods for term identification Head-sorted NPs (HS) (Wacholder 1998) Keywords (KW) Technical Terms (TT) (Justeson and Katz 1995) Evaluation of Head Sorting MechanismWacholder, Klavans and Evans (2000) Language and Information 11/06/02 Nina Wacholder

  18. Examples of terms identified by indexing method Keywords Head-sorted NPs Technical terms asbestos/asbestosis workers cancer deaths worker/workers/worked asbestos workers lung cancer cancer 160 workers kent cigarette death cancer dr. talcott make lung cancer cigarette filter lorillard asbestos u.s. fiber cancer causing asbestos dr. lung cancer deaths … ... Language and Information 11/06/02 Nina Wacholder

  19. Ranking of terms by cumulative percentage Language and Information 11/06/02 Nina Wacholder

  20. Ranking by cumulative number of terms 1 = best; 5 = worst Language and Information 11/06/02 Nina Wacholder

  21. Summary of results • Head-sorted terms • mixed quality terms • good document coverage • Technical terms • high quality terms • poor document coverage • Keywords • low quality terms • good document coverage Language and Information 11/06/02 Nina Wacholder

  22. ISATC Pilot Project • Nina Wacholder, PI • PhD Students: Lu Liu, Mark Sharp, Peng Song, Xiaojun Yuan Language and Information 11/06/02 Nina Wacholder

  23. Research question • Null hypothesis: Properties of index terms do not affect information seeker’s selection of terms • What properties of index terms affect the selection of terms? • What effects do these properties have? Language and Information 11/06/02 Nina Wacholder

  24. Material • Text • Rice, McCreadie and Chang (2001) • Index terms • Head sorted terms (Wacholder 1998) • Technical terms (Justeson and Katz) • Human index terms Language and Information 11/06/02 Nina Wacholder

  25. Experimental Searching and Browsing Interface (ESBI) http://www.scils.rutgers.edu/cgi-bin/indexer.cg Language and Information 11/06/02 Nina Wacholder

  26. Initial results Language and Information 11/06/02 Nina Wacholder

  27. Future work • Further analysis of experimental data • Compare subjects by type (e.g., undergraduate, MLIS) • Effectiveness of searches (ie did they get the right answer) • Overlap of words in index terms with words in question • … • Evaluation of ESBI interface • Comparison of additional techniques for identifying terms • Use of different texts Language and Information 11/06/02 Nina Wacholder

More Related