270 likes | 385 Views
Language and Information. LIS 610 November 6, 2002 Nina Wacholder nina@scils.rutgers.edu. Agenda. Role of language in information science Current research: Human Computer Interaction with Electronic Indexes and Index Terms. Textual information.
E N D
Language and Information LIS 610 November 6, 2002 Nina Wacholder nina@scils.rutgers.edu
Agenda • Role of language in information science • Current research: Human Computer Interaction with Electronic Indexes and Index Terms Language and Information 11/06/02 Nina Wacholder
Textual information • Information conveyed by alphabets, digits and punctuation • Organized into meaningful units recognized by some group of people Language and Information 11/06/02 Nina Wacholder
Other techniques for conveying information • Spoken language • Gesture • Facial expression • Sound • Images (drawings, photographs …) Language and Information 11/06/02 Nina Wacholder
Language • Uniquely human • Learned • Conventional Language and Information 11/06/02 Nina Wacholder
Understanding language is hard • Expresses complex concepts • Ambiguity – words, phrases and sentences have more than one meaning • Synonymy – words, phrases and sentences have more than one meaning Language and Information 11/06/02 Nina Wacholder
Complex concepts • Pencil • Face • Directions to Alexander Library • Theory of relativity • U.S. election law Language and Information 11/06/02 Nina Wacholder
Synonymy • child, kid, adolescent, baby • flammable, inflammable • I was walking up the street that day. • I was walking down the street that day. • Moxie wrote that report. That report was written by Moxie. Language and Information 11/06/02 Nina Wacholder
Ambiguity-- semantic • Bat • Make a bed • Moxie ate potatoes with a fork. • Moxie ate potatoes with fish. Language and Information 11/06/02 Nina Wacholder
Ambiguity– structural (syntactic) • Red airplane terminal • [[red airplane] terminal] • [red [airplane terminal]] • Moxie saw Toxie in the park with a telescope • Moxie saw [Toxie in the park with a telescope] • Moxie [saw] Toxie in the park [with a telescope] Language and Information 11/06/02 Nina Wacholder
Natural language processing (NLP) • Natural language • Computer language Language and Information 11/06/02 Nina Wacholder
The NLP controversy: rules vs. statistics Language and Information 11/06/02 Nina Wacholder
NLP by rule • Lexicon (vocabulary) • Det: a • ProperName: Moxie • Noun: report • Verb: wrote • Syntactic rules • NounPhrase[a report] Det[a] Noun[report] • NounPhrase[Moxie] ProperName[Moxie] • VerbPhrase[wrote a report] Verb[wrote] NounPhrase[a report] • Sentence[Moxie wrote a report] NounPhrase[Moxie] VerbPhrase[wrote a report] Language and Information 11/06/02 Nina Wacholder
NLP by statistics • Luhn (1958) • tf*idf (Salton and Buckley 1988) • Maximum entropy (Berger, Della Pietra and Della Pietra 1996) Language and Information 11/06/02 Nina Wacholder
Information-access tasks with significant natural language component • Information retrieval • Information extraction • Automatic summarization • Question answering Language and Information 11/06/02 Nina Wacholder
Sparck Jones (2001) • Task core vs. task context • Information retrieval: 30-40% accuracy for systems in natural environment • Information extraction: 50% for core systems • Automatic summarization: no sound basis for core evaluation Language and Information 11/06/02 Nina Wacholder
Task compare domain-independent, corpus-independent methods for automatic identification of terms to represent a document or collection of documents Methods for term identification Head-sorted NPs (HS) (Wacholder 1998) Keywords (KW) Technical Terms (TT) (Justeson and Katz 1995) Evaluation of Head Sorting MechanismWacholder, Klavans and Evans (2000) Language and Information 11/06/02 Nina Wacholder
Examples of terms identified by indexing method Keywords Head-sorted NPs Technical terms asbestos/asbestosis workers cancer deaths worker/workers/worked asbestos workers lung cancer cancer 160 workers kent cigarette death cancer dr. talcott make lung cancer cigarette filter lorillard asbestos u.s. fiber cancer causing asbestos dr. lung cancer deaths … ... Language and Information 11/06/02 Nina Wacholder
Ranking of terms by cumulative percentage Language and Information 11/06/02 Nina Wacholder
Ranking by cumulative number of terms 1 = best; 5 = worst Language and Information 11/06/02 Nina Wacholder
Summary of results • Head-sorted terms • mixed quality terms • good document coverage • Technical terms • high quality terms • poor document coverage • Keywords • low quality terms • good document coverage Language and Information 11/06/02 Nina Wacholder
ISATC Pilot Project • Nina Wacholder, PI • PhD Students: Lu Liu, Mark Sharp, Peng Song, Xiaojun Yuan Language and Information 11/06/02 Nina Wacholder
Research question • Null hypothesis: Properties of index terms do not affect information seeker’s selection of terms • What properties of index terms affect the selection of terms? • What effects do these properties have? Language and Information 11/06/02 Nina Wacholder
Material • Text • Rice, McCreadie and Chang (2001) • Index terms • Head sorted terms (Wacholder 1998) • Technical terms (Justeson and Katz) • Human index terms Language and Information 11/06/02 Nina Wacholder
Experimental Searching and Browsing Interface (ESBI) http://www.scils.rutgers.edu/cgi-bin/indexer.cg Language and Information 11/06/02 Nina Wacholder
Initial results Language and Information 11/06/02 Nina Wacholder
Future work • Further analysis of experimental data • Compare subjects by type (e.g., undergraduate, MLIS) • Effectiveness of searches (ie did they get the right answer) • Overlap of words in index terms with words in question • … • Evaluation of ESBI interface • Comparison of additional techniques for identifying terms • Use of different texts Language and Information 11/06/02 Nina Wacholder