500 likes | 629 Views
Representing the Meaning of Documents. LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik. Agenda. The structure of interactive IR systems Character sets Terms as units of meaning Strings and segments Tokens and words Phrases and entities Senses and concepts
E N D
Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik
Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course
What do We Mean by “Information?” • How is it different from “data”? • Information is data in context • Databases contain data and produce information • IR systems contain and provide information • How is it different from “knowledge”? • Knowledge is a basis for making decisions • Many “knowledge bases” contain decision rules
What Do We Mean by “Retrieval?” • Find something that you want • The information need may or may not be explicit • Known item search • Find the class home page • Answer seeking • Is Lexington or Louisville the capital of Kentucky? • Directed exploration • Who makes videoconferencing systems?
Global Internet User Population 2000 2005 English English Chinese Source: Global Reach
Predict Nominate IR System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Document Examination Document Source Reselection Delivery Supporting the Search Process Source Selection Choose
IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection
Representing Electronic Texts • A character set specifies semantic units • Characters are the smallest units of meaning • Abstract entities, separate from their representation • A font specifies the printed representation • What each character will look like on the page • Different characters might be depicted identically • An encoding is the electronic representation • What each character will look like in a file • One character may have several representations • An input method is a keyboard representation
Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course
The character ‘A’ • ASCII encoding: 7 bits used per character 0 0 0 0 0 1 0 1 = 65 DEC (decimal) 0 1 0 0 0 0 0 1 = 65 DEC (decimal) • Number of representable characters: 27 = 128 distinct characters including 0 (NUL) • Some character codes used for non-visible characters, e.g. 7 = control-G = BEL
| 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | ASCII • Widely used in the U.S. • American Standard Code for Information Interchange • ANSI X3.4-1968 | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |
Geeky Joke for the Day • Why do computer geeks confuse Halloween and Christmas? • Because 31 OCT = 25 DEC! • 031 OCT = 0*82 + 3*81 + 1*80 octal = 0*102 + 2*101 + 5*100 decimal
The Latin-1 Character Set • ISO 8859-1 8-bit characters for Western Europe • French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1
Other ISO-8859 Character Sets -2 -6 -7 -3 -4 -8 -9 -5
East Asian Character Sets • More than 256 characters are needed • Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets • GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam • Many characters appear in several languages • Research Libraries Group developed EACC • Unified “CJK” character set for USMARC records
Unicode • Goal is to unify the world’s character sets • ISO Standard 10646 • Character set and encoding scheme separated • Full “code space” is used by character codes • Extends Latin-1 • UTF-7 encoding will pass through email • Originally designed for 64 printable ASCII characters • UTF-8 encoding works with disk file systems
Limitations of Unicode • Produces much larger files than Latin-1 • Fonts are hard to obtain for many characters • Some characters have multiple representations • e.g., accents can be part of a character or separate • Some characters look identical when printed • But they come from unrelated languages • The sort order may not be appropriate
Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course
Strings and Segments • Retrieval is (often) a search for concepts • But what we index are character strings • What strings best represent concepts? • In English, words are often a good choice • But well chosen phrases can be even better • In German, compounds may need to be split • Otherwise queries using constituent words would fail • In Chinese, word boundaries are not marked • Thissegmentationproblemissimilartothatofspeech • This segmentation problem is similar to that of speech
Longest Substring Segmentation • A greedy segmentation algorithm • Based solely on lexical information • Start with a list of every possible term • Dictionaries are a handy source for term lists • For each unsegmented string • Remove the longest single substring in the list • Repeat until no substrings are found in the list • Can be extended to explore alternatives
Longest Substring Example • Possible German compound term: • washington • List of German words: • ach, hin, hing, sei, ton, was, wasch • Longest substring segmentation • was-hing-ton • A language model might see this as bad • Roughly translates to “What tone is attached?”
Probabilistic Segmentation • For an input word c1 c2 c3 …cn • Try all possible partitions into w1 w2w3 … • c1c2 c3 …cn • c1c2 c3 c3 …cn • c1 c2 c3 …cnetc. • Choose the highest probability partition • E.g., compute Pr(w1 w2w3) using a language model • Challenges: search, probability estimation
Non-Segmentation: N-gram Indexing • Consider a Chinese document c1 c2 c3 …cn • Don’t segment (you could be wrong!) • Instead, treat every character bigram as a term • _c1 c2 ,c2 c3 ,c3 c4 ,… , cn-1 cn • Break up queries the same way
Tokens and Words • What is a word? • Kindergarten • Aux armes! • Doug’s running • Realistic review resubmit • Morphology: • How morphemes combine to make words • Morphemes are units of meaning • Remember antidisestablishmentarianism? • Anti (disestablishmentarian) ism
Morphemes and Roots • Inflectional morphology • Preserves part of speech • Destructions = Destruction+PLURAL • Destroyed = Destroy+PAST • Derivational morphology • Relates parts of speech • Destructor = AGENTIVE(destroy) • Can help IR performance, but expensive • Getting derivational morphology right is hard • {peninsula,insulate}:insula (Lat. “island”) ???
Stemming • Stem: in IR, a word equivalence class that preserves the main concept. • Often obtained by affix-stripping (Porter, 1980) • {destroy, destroyed, destruction}: destr • Inexpensive to compute • Usually helps IR performance • Can make mistakes! (over-/understemming) • {centennial,century,center}: cent • {acquire,acquiring,acquired}: acquir {acquisition}: acquis
Roots and Stems: beyond English • Arabic: alselam • Stem: selam • Root: SLM (peace) • Semantic families: altaliban • Stem: taliban (student) • Root: TLB (question) • Current research on best level of analyis
Phrases and Entities • Multi-word combinations identify entities • The president, Dubya, George W. Bush • Can also identify relationships of interest • Derek Jones, CEO of SadAndBankrupt.com,… • Entity roles, filling slots in templates
Named Entity Identification • Major categories of named entities • Influenced by text genres of interest… mostly news • Person, organization, location, date, money, … • Decent algorithms based on finite automata • Best algorithms based on supervised learning • Annotate a corpus identifying entities and types • Train a probabilistic model • Apply the model to new text
Example: Predictive Annotation for Question Answering In reality, at the time of Edison’s 1879 patent, the light bulb PERSON TIME had been in existence for some five decades …. Who patented the light bulb? patent light bulb PERSON When was the light bulb patented? patent light bulb TIME In what year was the light bulb patented? ??? What did Thomas Edison patent?
General Phrase Identification • Two types of phrases • Compositional: meaning derived from parts • Noncompositional: idiomatic expressions • e.g., “kick the bucket” or “buy the farm” • Three sources of evidence • Dictionary lookup • Parsing • Co-occurrence
Known Phrases • Same idea as longest substring match • But look for word (not character) sequences • Compile a term list that includes phrases • Technical terminology can be very helpful • Index any phrase that occurs in the list • Most effective in a limited domain • Otherwise hard to capture most useful phrases
Syntactic Phrases • Automatically construct sentence diagrams • Fairly good parsers are available • Index the noun phrases • Assumes that queries will focus on objects Sentence Prepositional Phrase Noun Phrase Noun phrase Det Adj Adj Noun Verb Prep Det Adj Adj Noun The quick brown fox jumped over the lazy dog’s back
Syntactic Variations • The “paraphrase problem” • Prof. Douglas Oard studies information access patterns. • Doug studies patterns of user access to different kinds of information. • Transformational variants (Jacquemin) • Coordinations • lung and breast cancer lung cancer • Substitutions • inflammatory sinonasal disease inflammatory disease • Permutations • addition of calcium calcium addition
Phrase Discovery: Collocations • Compute observed occurrence probability • For each single word and each word n-gram • “buy” 10 times in 1000 words yields 0.01 • “the” 100 times in 1000 words yields 0.10 • “farm” 5 times in 1000 words yields 0.005 • “buy the farm” 4 times in 1000 words yields 0.004 • Compute n-gram probability if truly independent • 0.01*0.10*0.005=0.000005 • Compare with observed probability • Record phrases that occur more often than expected
Phrase Indexing Lessons • Poorly chosen phrases hurt effectiveness • And some techniques can be slow (e.g., parsing) • Better to index phrases and words • Want to find constituents of compositional phrases • Better weighting schemes less benefit • Negligible improvement in some TREC systems • Very helpful for cross-language retrieval • Noncompositional translation, reduced ambiguity
Cross-Language IR and Phrases • Poser: quite ambiguous (Langenscheidt) • Place, put (a question, a motion) • Lay down (a principle) • Hang (curtains) • Set (a problem) • Poser une question: meaning is clear! • Ask a question • In this case, better to use the phrase • But is this really about phrases?
Senses and Concepts • What is a word sense? • Entry in a dictionary or thesaurus • Position or cluster in a semantic space • What is word sense disambiguation? • Identifying intended sense(s) from context • Goal for IR • Match on the intended concept, not just the words
Problems With Word Matching • Word matching suffers from two problems • Synonymy: paper vs. article • Homonymy: bank (river) vs. bank (financial) • Disambiguation in IR: seek to resolve homonymy • Index word senses rather than words • Synonymy usually addressed by • Thesaurus-based query expansion • Latent semantic indexing
Word Sense Disambiguation • Context provides clues to word meaning • “The doctor removed the appendix.” • For each occurrence, note surrounding words • Typically +/- 5 non-stopwords • Group similar contexts into clusters • Based on overlaps in the words that they contain • Separate clusters represent different senses
Disambiguation Example • Consider four example sentences • The doctor removed the appendix • The appendix was incomprehensible • The doctor examined the appendix • The appendix was removed • What clusters can you find? • Can you find enough word senses this way? • Might you find too many word senses?
Why Disambiguation Hurts • Bag-of-words techniques already disambiguate • When more words are present, documents rank higher • So a context for each term is established in the query • Formal disambiguation tries to improve precision • But incorrect sense assignments would hurt recall • Hard to distinguish homonymy from fine-grained polysemy • Average precision balances recall and precision • But the possible precision gains are small • And current techniques substantially hurt recall
Where Could Disambiguation Help? • Categorization of whole documents • Identifying location(s) in a topic hierarchy • Visualization • People are good at seeing signal amidst noise • Probabilistic models • Combining different sources of evidence • (Requires n-best rather than 1-best responses)
Summary • The goal is to index the right meaning units • Start by finding fundamental features • Characters or shape codes (for OCR) etc. • Combine them into easily recognized units • Words where possible, character n-grams otherwise • Consider alternatives to splitting or forming phrases • But stemming is generally a good idea • Usually best to match those units directly • Disambiguation strategies hurt more than they help
Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course
Course Goals • Appreciate IR system capabilities and limitations • Understand IR system design & implementation • For a broad range of applications and media • Evaluate IR system performance • Identify current IR research problems
Course Design • Text/readings provide background and detail • At least one recommended reading is required • Class provides organization and direction • We will not cover every important detail • Assignments and project provide experience • The TA can help CLIS students with the project • Final exam helps focus your effort
Grading • Assignments (15%) • Mastery of concepts and experience using tools • 796: “homework,” 828o: “programming” • Term project (796: 50%, 828o: 30%) • Options are described on course Web page • Final exam (796: 35%, 828o: 55%) • Two different in-class exams
Handy Things to Know • Classes will be videotaped • Available in the CLIS library if you miss class • Office hours are by appointment • Send an email, or ask after class • Everything is on the Web • At http://www.glue.umd.edu/~oard/teaching.html • Doug is most easily reached by email • oard@umd.edu
Some Things to Do This Week • At least skim the readings before class • Don’t fall behind! • Look at assignment 1 • Due in 2 weeks! • Explore the Web site • Start thinking about the term project