1 / 30

1 / 25

Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus Profiling for NLP/IR. Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven. 1 / 25. Presentation Outline. Introduction The Problems Information Retrieval (IR), Genre and Perception

goddard
Download Presentation

1 / 25

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genre Analysis of Structured E-mails for Corpus Profiling Workshop on Corpus Profiling for NLP/IR Malcolm Clark Supervisors: Professor Patrik O'Brian Holt Dr Ian Ruthven 1/25

  2. Presentation Outline Introduction The Problems Information Retrieval (IR), Genre and Perception Experiment – Research Questions, Setup, How do People use Textual Features? Conclusions Contributions and Implications Future Work Malcolm Clark 2/25

  3. Introduction Focuses IR and cognitive psychology. Corpuses contain ‘exemplar’ documents called genres useful for profiling corpora E-mail exchanges have socially constructed communicative behaviours which exist to improve the efficiency of a community of practice and for profiling corpora. Investigate these types of genres and how people use emails in terms of genre and perception for filtering. Malcolm Clark Malcolm Clark 3/25

  4. The Problems Identifying genres for profiling corpus Filter correct types of documents to user by genre: E-mail filtering Understanding user tasks Rapidly understand a text without the necessity for parsing the whole document? Malcolm Clark 4/25

  5. The Project Examines: • The value of structure. • How form or layout is perceived in structured texts? • Constructivist (recognition) and ecological approaches (action afforded ) or are they both used? • If and how the objects of a community of practice (COP) can be comprehended and exploited? • How readers react to genre features in document collections. Malcolm Clark 5/25

  6. Information Retrieval Division of IR into computer science lab experiments vs ‘user-orientated’ social studies Järvelin(2006) Malcolm Clark 6/25

  7. Genre – Background Readily observable features Communicative purpose TYPICAL GENRE Form Purpose Discourse Structure Comm’s Medium Arguments Structural Features Language or Symbol System Themes Topics Topics Topics Formality, specialised vocab Orlikowski and Yates 1994 Malcolm Clark Malcolm Clark 7/25

  8. Corpus - Genre Example fromE-mail-call for papers Header: Title etc Abstract Titles: Topics (list) Dates and submission Malcolm Clark 8/25

  9. Genre – What are Communities of Practice (COP)? What ? Social institutions/sites. When? Human ‘agents’ draw on genre rules to engage in organizational communication. How? Produced, reproduced, or modified. But how are they perceived and used? Malcolm Clark 9/25

  10. Two prominent fields in perception research: Human Perceptual Systems Perceive Final goal? Recognition Action Malcolm Clark 10/25

  11. Experiment Pilot - Research Questions: How human beings use genres features and what do they perceive? How can genre categorization be performed by using current skimming methods? How do genres evolve in communities of practice (i.e. e-mail etc)? How are the document genres and structural attributes used? Malcolm Clark 11/25

  12. Experiment Pilot - How do People Use Texts? By eye tracking i.e. the position and movement of the eye: Collect and analyse the empirical data produced by experiments in e-mail community of practice. Locating the strategies and features for profiling corpora - e.g. centred blocks of text, invariant cues. Taking into account: features, strategies etc. How do humans view genre? Malcolm Clark 12/25

  13. Experiment Pilot Malcolm Clark 13/25

  14. Pilot - Setup Method - 4 x 16 image blocks (4 genres in each two blocks). Measurements Amount of genres id’d correctly - purpose Structure vs Non-structure form - form Identification of genre response time - form Strategies and distinguishing features - purpose and form Variables Purpose/type of genre Form in 4 representations……………………….. Malcolm Clark 14/25

  15. CFP - Content AND Structure Malcolm Clark 15/25

  16. CFP – Structure and No Content Malcolm Clark 16/25

  17. CFP – Content No Structure Malcolm Clark 17/25

  18. CFP – No Content AND No Structure Malcolm Clark 18/25

  19. Setup Task and procedure Shown 64 images Vocally Id each image. Eyetracker records features and strategies used. Data recorded X/Y location saccades and fixations. Features and strategies Desktop video recording – Wink Timed and vocal responses Malcolm Clark 19/25

  20. Amount of genres id’d correctly-purpose 11.5 per block out of 16. Un-structured vs structure 41.6%/72.9% Orig (87.5%),Orig no content (77%), content no struc (68%), non 27% Structure vs Non-form - av. response time (sec): 2.22 vs 2.72 HOW WAS IT DONE????? Clues to strategies: skimmed shape - left (sem) / centred (cfp) aligned and blocks of text/numerics No structure/no struc or content: wide spirals of scanning behaviour poss looking keywords? Results after 5 Participants Malcolm Clark 20/25

  21. Results –Distinguishing features Malcolm Clark 21/25

  22. Genre largely overlooked but momentum is building. Our approach is useful for filtering e-mails/id features for characterisingdatasets Purpose and form very useful for using texts. Clues to perception processes found but need to add familiarity to the mix. Train machine to emulate human behaviour and understand textual input without reading whole text? Conclusions Malcolm Clark 22/25

  23. Development of a language/perception theory/framework of: How people use different types of texts. Modelling user tasks and behaviour in relation to genre and perception. Extend laboratory IR/user-orientated IR approach From: algorithms and machines. To: a user-oriented and contextual level. Contributions and Implications Malcolm Clark 23/25

  24. Focus on narrowing down my work domains. Investigate domains: Academic documents collections: CSIRO Enterprise Legal documents - Enron Weblogs – TREC Blog Web domains - Wikipedia Consider multi-genres e.g. course books, large documents e.g social work report Future Work Malcolm Clark 24/25

  25. Malcolm Clark 25/25

  26. Motivation Useful features for profiling corpora. Adds another type of filtering to large data collections to take advantage of genre i.e. news, biographical etc. Genre benefits organisations financially and administratively i.e. rapid retrieval of information. Embrace genre and perception to understand and examine these structures! Malcolm Clark 26/25

  27. Evaluation System Model the findings based on FERRET and McFRUMP’s Predictor and Substantiator. Our system: Genre Retrieval and Understanding Memory Program or GRUMP. Similar features to Clark and Watt (2007)? Malcolm Clark 27/25

  28. Skimming & Categorisation • Skimming • Used to identify the main points in a text much quicker than normal reading without having to understand every word. • Normally used when a reader has a large amount of text to read within a limited time. • Categorisation • Automatically labelled or classified. • No need for manual organisation, labelling or sorting. Malcolm Clark 28/25

  29. Evaluation System – How it Works Queries Texts Query Parser McFRUMP Parser Abstracts Case Frame Matcher Case frame patterns Relevant Texts Figure taken from Mauldin 1991 McFRUMP parser contains the Predictor/Substantiator, Scripts etc Malcolm Clark 29/27

  30. Evaluation System – Script Example • Using Schank’s (1981, ch 3) Conceptual Dependency theory of Scripts, Plans and Goals and DeJong’s (1982) FRUMP make different genre script’s: • John Doe was arrested last Saturday morning after holding up the New Haven Savings Bank • $ARREST SCRIPT • Police arrive at suspect location • Suspect Apprehended • Taken to police station • Charged • Incarcerated or bailed • Using this type of script format to understand stories, genre rules/features can be specified in scripts to understand texts. Modify script with genre rules 30/25

More Related