1 / 30

Named Entity Recognition in an Intranet Query Log

Named Entity Recognition in an Intranet Query Log Richard Sutcliffe 1 , Kieran White 1 , Udo Kruschwitz 2 1 - University of Limerick, Ireland 2 - University of Essex, UK. Outline. Introduction The Log at Essex Manual Log Analysis Automatic SNE Recognition Using SNEs to Improve Retrieval

uma-weber
Download Presentation

Named Entity Recognition in an Intranet Query Log

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entity Recognition in an Intranet Query Log Richard Sutcliffe1, Kieran White1, Udo Kruschwitz2 1 - University of Limerick, Ireland 2 - University of Essex, UK

  2. Outline • Introduction • The Log at Essex • Manual Log Analysis • Automatic SNE Recognition • Using SNEs to Improve Retrieval • Conclusions

  3. Introduction Web log analysis has become an active area (Jansen et al., 2000) A search engine can be general or specific Our study is of an intranet (specific) log Work follows from Kruschwitz (2003) and Kruschwitz et al. (2009) NEs are very important in QA Aim here was to link web log analysis and QA via NEs

  4. Introduction QA What color is the top stripe on the U.S. flag? Web Logs student union Named Entities LTB 3, Chaplaincy, SPSS

  5. The Log at Essex Log of UKSearch engine Period 1st October 2006 ‑ 30th September 2007 40,006 queries Interaction sequence Iterative refinement of search terms Suggests terms to augment or replace query 35,463 interaction sequences Session comprises one or more interaction sequences Indexes web pages in the essex.ac.uk domain and any files in that domain linked from an indexed web page

  6. The Log at Essex ‑ Cont. 35527 95091B81DF16D8CFA6E7991A5D737741 Tue May 01 12:57:14 BST 2007 0 0 0 outside options outside options outside options 35528 95091B81DF16D8CFA6E7991A5D737741 Tue May 01 12:57:36 BST 2007 1 0 0 outside options art history outside options outside options art history outside options art history 35529 95091B81DF16D8CFA6E7991A5D737741 Tue May 01 12:57:57 BST 2007 2 0 0 history art outside options outside options art history history art history of art Appearance of raw log

  7. The Log at Essex ‑ Cont. [Tue,May,1,12,57,14,BST,2007] >>> *T *Tue * outside options *T *Tue *USA outside options art history *T *Tue *USA history of art <<< Log shown as session with the first interaction sequence

  8. Manual Log Analysis Subset of log Fourteen days Seven during holidays Seven during term Each group of seven days comprised one Monday, one Tuesday etc. 1,794 queries 632 during holidays 1,162 during term

  9. Manual Log Analysis ­ Cont. Twenty mutually exclusive topics Plus “Other” Each query was assigned to one of these

  10. Manual Log Analysis – Cont. Topics used in manual classification

  11. Manual Log Analysis – Cont. Topics used in manual classification

  12. Manual Log Analysis – Cont. Topic analysis of 14­day subset

  13. Manual Log Analysis – Cont. Topic analysis of 14­day subset

  14. Manual Log Analysis ­ Cont. Top six categories: Academic or other use Computer use Administration of studies Person name Structure and regulations Calendar / timetable These account for 62% of queries

  15. Manual Log Analysis ­ Cont. Four non-exclusive features Acronym lower case Initial capitals All capitals Typographic or spelling error 0-4 features are assigned to each query

  16. Manual Log Analysis – Cont. Features used in manual classification

  17. Manual Log Analysis – Cont. Typo / Spelling analysis of 14­day subset

  18. Automatic SNE Recognition - Training 1,035 distinct instances of SNEs were manually identified in queries Each manually classified as being one of 35 SNE types Presented each SNE to bing.com restricted to essex.ac.uk Selected all snippets in top ten documents SNE plus five tokens on each side Presented each snippet to OpenNLP's MaxEnt­based name finder Identifying type of SNE in snippet Creating 35 name finder models

  19. Automatic SNE Recognition - Training Examples of 35 SNE types

  20. Automatic SNE Recognition - Training Examples of 35 SNE types

  21. Automatic SNE Recognition - Training Examples of 35 SNE types

  22. Automatic SNE Recognition - Training Examples of 35 SNE types

  23. Automatic SNE Recognition - Evaluation Selected 500 queries from log Searched for these in the essex.ac.uk domain, using bing.com Recorded first snippet in top document returned 280 snippets were found Presented it to the 35 OpenNLP models Identifying one or more of relevant SNE types

  24. Automatic SNE Recognition - Evaluation Results. P=C/(C+F). R=C/(C+M).

  25. Automatic SNE Recognition - Evaluation Results. P=C/(C+F). R=C/(C+M).

  26. Automatic SNE Recognition - Evaluation SNE clearly defined and good training examples results in good performance P was 1.0 for buildings, campuses, forms, online services, person names, regulations and policies, research groups, room names and software P was 0.94 for departments / schools / units Most interesting: departments / schools / units, online services and room names where there were 15, 41 and 11 correct instances

  27. Automatic SNE Recognition - Evaluation Generally algorithm works very well Training examples were limited & numbers varied widely Some NEs were well defined online services, departments / schools / units Others were very poorly defined documentation, equipment Algorithm is disinclined to give false positive Thus P tends to be high

  28. Using SNEs for QA Person names should match variants of themselves plus anaphors Kruschwitz = Udo Kruschwitz = he Person names could match a post name Kruschwitz = Director of Recruitment and Publicity

  29. Using SNEs to Improve Retrieval SNEs are linked Course code, course name, degree code, degree name Department, research centre, research group, person Room number, person, building, department Thus a search for C700 should match B Sc. Biochemistry a group could match its department a room number could return the name of the occupant, the building or the department

  30. Conclusions Categorised queries in an intranet log Thus identified important SNE types Extracted instances of these using a search engine Carried out initial training experiment with MaxEnt Proposed methods of using SNEs for IR and QA Hence used a web log to improve future search

More Related