300 likes | 395 Views
Named Entity Recognition in an Intranet Query Log Richard Sutcliffe 1 , Kieran White 1 , Udo Kruschwitz 2 1 - University of Limerick, Ireland 2 - University of Essex, UK. Outline. Introduction The Log at Essex Manual Log Analysis Automatic SNE Recognition Using SNEs to Improve Retrieval
E N D
Named Entity Recognition in an Intranet Query Log Richard Sutcliffe1, Kieran White1, Udo Kruschwitz2 1 - University of Limerick, Ireland 2 - University of Essex, UK
Outline • Introduction • The Log at Essex • Manual Log Analysis • Automatic SNE Recognition • Using SNEs to Improve Retrieval • Conclusions
Introduction Web log analysis has become an active area (Jansen et al., 2000) A search engine can be general or specific Our study is of an intranet (specific) log Work follows from Kruschwitz (2003) and Kruschwitz et al. (2009) NEs are very important in QA Aim here was to link web log analysis and QA via NEs
Introduction QA What color is the top stripe on the U.S. flag? Web Logs student union Named Entities LTB 3, Chaplaincy, SPSS
The Log at Essex Log of UKSearch engine Period 1st October 2006 ‑ 30th September 2007 40,006 queries Interaction sequence Iterative refinement of search terms Suggests terms to augment or replace query 35,463 interaction sequences Session comprises one or more interaction sequences Indexes web pages in the essex.ac.uk domain and any files in that domain linked from an indexed web page
The Log at Essex ‑ Cont. 35527 95091B81DF16D8CFA6E7991A5D737741 Tue May 01 12:57:14 BST 2007 0 0 0 outside options outside options outside options 35528 95091B81DF16D8CFA6E7991A5D737741 Tue May 01 12:57:36 BST 2007 1 0 0 outside options art history outside options outside options art history outside options art history 35529 95091B81DF16D8CFA6E7991A5D737741 Tue May 01 12:57:57 BST 2007 2 0 0 history art outside options outside options art history history art history of art Appearance of raw log
The Log at Essex ‑ Cont. [Tue,May,1,12,57,14,BST,2007] >>> *T *Tue * outside options *T *Tue *USA outside options art history *T *Tue *USA history of art <<< Log shown as session with the first interaction sequence
Manual Log Analysis Subset of log Fourteen days Seven during holidays Seven during term Each group of seven days comprised one Monday, one Tuesday etc. 1,794 queries 632 during holidays 1,162 during term
Manual Log Analysis Cont. Twenty mutually exclusive topics Plus “Other” Each query was assigned to one of these
Manual Log Analysis – Cont. Topics used in manual classification
Manual Log Analysis – Cont. Topics used in manual classification
Manual Log Analysis – Cont. Topic analysis of 14day subset
Manual Log Analysis – Cont. Topic analysis of 14day subset
Manual Log Analysis Cont. Top six categories: Academic or other use Computer use Administration of studies Person name Structure and regulations Calendar / timetable These account for 62% of queries
Manual Log Analysis Cont. Four non-exclusive features Acronym lower case Initial capitals All capitals Typographic or spelling error 0-4 features are assigned to each query
Manual Log Analysis – Cont. Features used in manual classification
Manual Log Analysis – Cont. Typo / Spelling analysis of 14day subset
Automatic SNE Recognition - Training 1,035 distinct instances of SNEs were manually identified in queries Each manually classified as being one of 35 SNE types Presented each SNE to bing.com restricted to essex.ac.uk Selected all snippets in top ten documents SNE plus five tokens on each side Presented each snippet to OpenNLP's MaxEntbased name finder Identifying type of SNE in snippet Creating 35 name finder models
Automatic SNE Recognition - Training Examples of 35 SNE types
Automatic SNE Recognition - Training Examples of 35 SNE types
Automatic SNE Recognition - Training Examples of 35 SNE types
Automatic SNE Recognition - Training Examples of 35 SNE types
Automatic SNE Recognition - Evaluation Selected 500 queries from log Searched for these in the essex.ac.uk domain, using bing.com Recorded first snippet in top document returned 280 snippets were found Presented it to the 35 OpenNLP models Identifying one or more of relevant SNE types
Automatic SNE Recognition - Evaluation Results. P=C/(C+F). R=C/(C+M).
Automatic SNE Recognition - Evaluation Results. P=C/(C+F). R=C/(C+M).
Automatic SNE Recognition - Evaluation SNE clearly defined and good training examples results in good performance P was 1.0 for buildings, campuses, forms, online services, person names, regulations and policies, research groups, room names and software P was 0.94 for departments / schools / units Most interesting: departments / schools / units, online services and room names where there were 15, 41 and 11 correct instances
Automatic SNE Recognition - Evaluation Generally algorithm works very well Training examples were limited & numbers varied widely Some NEs were well defined online services, departments / schools / units Others were very poorly defined documentation, equipment Algorithm is disinclined to give false positive Thus P tends to be high
Using SNEs for QA Person names should match variants of themselves plus anaphors Kruschwitz = Udo Kruschwitz = he Person names could match a post name Kruschwitz = Director of Recruitment and Publicity
Using SNEs to Improve Retrieval SNEs are linked Course code, course name, degree code, degree name Department, research centre, research group, person Room number, person, building, department Thus a search for C700 should match B Sc. Biochemistry a group could match its department a room number could return the name of the occupant, the building or the department
Conclusions Categorised queries in an intranet log Thus identified important SNE types Extracted instances of these using a search engine Carried out initial training experiment with MaxEnt Proposed methods of using SNEs for IR and QA Hence used a web log to improve future search