180 likes | 272 Views
CC 437: Advanced Natural Language Engineering. Week 6, Class Assignment 2. Goal of this class. We’ll go in more detail over the assignment SW that may be used. The system you have to build. Input: A string of words (possibly a complete sentence)
E N D
CC 437: Advanced Natural Language Engineering Week 6, Class Assignment 2 ANLE
Goal of this class • We’ll go in more detail over the assignment • SW that may be used. ANLE
The system you have to build • Input: A string of words (possibly a complete sentence) LIST THE ESTATE AGENTS IN STRATFORD, LONDON. I AM LOOKING FOR A CAR MECHANIC IN WIVENHOE • Minimum Output: a query for a Web search engine (“ESTATE AGENT” OR PROPERTY OR “REAL ESTATE”) AND STRATFORD AND LONDON • Possible extension (10%): Actually access search engine E.g., GOOGLE: http://www.google.com/search?q=stratford+london+%22estate+agent%22+OR+%22real+estate%22+OR+property ANLE
Reminder: the basic pipeline in IE systems LEXICAL PROCESSING SYNTACTIC PROCESSING PREPROCESSING DISCOURSE PROCESSING SEMANTIC PROCESSING ANLE
Processing steps TERM IDENTIFICATION STOP WORDS POS TAGGING List the estate agents in Stratford, London. LEXICAL PROCESSING SYNTACTIC PROCESSING PREPROCESSING SYNONYMS TOKENIZATION SEMANTIC PROCESSING WEB ACCESS ANLE
Processing Steps, II • Preprocessing: • Possibly: eliminate stop words LIST THE ESTATE AGENTS IN STRATFORD LONDON • Possibly: XML markup ANLE
Preprocessing, I: tokenizing List the estate agents in Stratford, London PARAGRAPH MARKUP; TOKENIZER <W C=‘w’>List</W> <W C=‘w’>the</W> <W C=‘w’>estate</W> <W C=‘w’>agents</W> <W C=‘w’>in</W> <W C=‘w’>Stratford</W> <W C=‘w’>,</W> <W C=‘w’>London</W> ANLE
Processing Steps, II • LEXICAL PROCESSING: • POS TAGGING THE -> THE/DT; ESTATE -> ESTATE/NN • STEMMING / LEMMATIZATION AGENTS -> AGENT (or even: AGENT + N +PL) ANLE
Lexical Processing, I: POS tagging <W C=‘VB'>List</W> <W C=‘DT'>the</W> <W C=‘NN'>estate</W> <W C=‘NNS'>agents</W> <W C=‘IN'>in</W> <W C=‘NNP'>Stratford</W> <W C='CM'>,</W> <W C=‘NNP'>London</W> ANLE
Lexical Processing, II:lemmatizing / stemming <W C=‘VB'>List</W> <W C=‘DT'>the</W> <W C=‘NN'>estate</W> <W C=‘NNS'>agent</W> <W C=‘IN'>in</W> <W C=‘NNP'>Stratford</W> <W C='CM'>,</W> <W C=‘NNP'>London</W> ANLE
Processing Steps, II • SYNTACTIC PROCESSING: • Identify terms: “ESTATE AGENT” • Remove stopwords (e.g., words tagged as DT, IN, VB, … ) ANLE
Practical (partial) parsing:identifying search terms, filtering <SEARCHTERM> <W C=‘NN'>estate</W> <W C=‘NN'>agent</W> </SEARCHTERM> <SEARCHTERM> <W C=‘NNP'>Stratford</W> </SEARCHTERM> <BOOL> <W C='CM'>,</W> </BOOL> <SEARCHTERM> <W C=‘NNP'>London</W> </SEARCHTERM> ANLE
Processing Steps, II • SEMANTIC PROCESSING: “ESTATE AGENT” OR PROPERTY • QUERY FORMATION: • Abstract query • Concrete query ANLE
Semantic processing: finding synonyms, (or better keywords); interpreting stop words. <SEARCHTERM> <W C=‘NN'>estate</W> <W C=‘NN'>agent</W> </SEARCHTERM> <BOOL TYPE=‘OR’></BOOL> <SEARCHTERM> <W C=‘NN'>real</W> <W C=‘NN'>estate</W> </SEARCHTERM> <BOOL TYPE=‘AND’></BOOL> <SEARCHTERM> <W C=‘NNP'>Stratford</W> </SEARCHTERM> <BOOL TYPE=‘AND’> <W C='CM'>,</W> </BOOL> <SEARCHTERM> <W C=‘NNP'>London</W> </SEARCHTERM> ANLE
Available tools: • LINUX: • Overall system control: Shell scripts, Perl, Java • Tokenizing: Perl + Regular Expressions • POS: Brill tagger • Lexical Expansion: WordNet (Java interface, command line) • WINDOWS: • Overall system control: Java, Batch files, Perl • Tokenizing, POS tagging: Xerox (Tokenizer, POS + Lemmatizer) • WordNet: Use Java interface ANLE
Marking Scheme ANLE
Optionals • Write a simple Web page interface to your search engine • Write your own lexical resource (see following classes) ANLE
Deadline • Friday, December 12th, 12:00 ANLE