Efficient Information Extraction Techniques for Semantic Retrieval

Robust Semantics, Information Extraction, and Information Retrieval CS 4705

Problems with Syntax-Driven Semantics • Syntactic structures often don’t fit semantic structures very well • Important semantic elements often distributed very differently in trees for sentences that mean ‘the same’ I like soup. Soup is what I like. • Parse trees contain many structural elements not clearly important to making semantic distinctions • Syntax driven semantic representations are sometimes pretty verbose V --> serves

Alternatives? • Semantic Grammars • Information Extraction Techniques • Information Retrieval --> Information Extraction

Semantic Grammars • Alternative to modifying syntactic grammars to deal with semantics too • Define grammars specifically in terms of the semantic information we want to extract • Domain specific: Rules correspond directly to entities and activities in the domain I want to go from Boston to Baltimore on Thursday, September 24th • Greeting --> {Hello|Hi|Um…} • TripRequest  Need-spec travel-verb from City to City on Date

Predicting User Input • Semantic grammars rely upon knowledge of the task and (sometimes) constraints on what the user can do, when • Allows them to handle very sophisticated phenomena I want to go to Boston on Thursday. I want to leave from there on Friday for Baltimore. TripRequest  Need-spec travel-verb from City on Date for City Dialogue postulate maps filler for ‘from-city’ to pre-specified from-city

Drawbacks of Semantic Grammars • Lack of generality • A new one for each application • Large cost in development time • Can be very large, depending on how much coverage you want • If users go outside the grammar, things may break disastrously I want to leave from my house. I want to talk to someone human.

Information Extraction • Another ‘robust’ alternative • Idea is to ‘extract’ particular types of information from arbitrary text or transcribed speech • Examples: • Named entities: people, places, organizations, times, dates • Telephone numbers <Organization> MIPS</Organization> Vice President <Person>John Hime</Person> • Domains: Medical texts, broadcast news, voicemail,...

Appropriate where Semantic Grammars and Syntactic Parsers are Not • Appropriate where information needs very specific • Question answering systems, gisting of news or mail… • Job ads, financial information, terrorist attacks • Input too complex and far-ranging to build semantic grammars • But full-blown syntactic parsers are impractical • Too much ambiguity for arbitrary text • 50 parses or none at all • Too slow for real-time applications

Information Extraction Techniques • Often use a set of simple templates or frames with slots to be filled in from input text • Ignore everything else • My number is 212-555-1212. • The inventor of the wiggleswort was Capt. John T. Hart. • The king died in March of 1932. • Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots

The IE Process • Given a corpus and a target set of items to be extracted: • Clean up the corpus • Tokenize it • Do some hand labeling of target items • Extract some simple features • POS tags • Phrase Chunks … • Do some machine learning to associate features with target items or derive this associate by intuition • Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

Some examples • Semantic grammars • Information extraction

Information Retrieval • How related to NLP? • Operates on language (speech or text) • Does it use linguistic information? • Stemming • Bag-of-words approach • Does it make use of document formatting? • Headlines, punctuation, captions • Collection: a set of documents • Term: a word or phrase • Query: a set of terms

But…what is a term? • Stop list • Stemming • Homonymy, polysemy, synonymy

Vector Space Model • Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection • Is t in this document or query or not? D = (t1,t2,…,tn) Q = (t1,t2,…,tn) • Similarity metric:how many terms does a query share with each candidate document? • Weighted terms: term-by-document matrix D = (wt1,wt2,…,wtn) Q = (wt1,wt2,…,wtn)

How do we compare the vectors? • Normalize each term weight by the number of terms in the document: how important is each t in D? • Compute dot product between vectors to see how similar they are • Cosine of angle: 1 = identity; 0 = no common terms • How do we get the weights? • Term frequency (tf): how often does t occur in D? • Inverse document frequency (idf): # docs/ # docs term t occurs in • tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

Evaluating IR Performance • Precision: #rel docs returned/total #docs returned -- how often are you right when you say this document is relevant? • Recall: #rel docs returned/#rel docs in collection -- how many of the relevant documents do you find? • F-measure combines P and R

Improving Queries • Relevance feedback: users rate retrieved docs • Query expansion: many techniques • e.g. add top N docs retrieved to query • Term clustering: cluster rows of terms to produce synonyms and add to query

IR Tasks • Ad hoc retrieval: ‘normal’ IR • Routing/categorization: assign new doc to one of predefined set of categories • Clustering: divide a collection into N clusters • Segmentation: segment text into coherent chunks • Summarization: compress a text by extracting summary items • Question-answering: find a stretch of text containing the answer to a question

Summary • Many approaches to ‘robust’ semantic analysis • Semantic grammars targeting particular domains Utterance --> Yes/No Reply Yes/No Reply --> Yes-Reply | No-Reply Yes-Reply --> {yes,yeah, right, ok,”you bet”,…} • Information extraction techniques targeting specific tasks • Extracting information about terrorist events from news • Information retrieval techniques --> more like NLP

Efficient Information Extraction Techniques for Semantic Retrieval