140 likes | 396 Views
Analysing Crime-Scene Reports. Scene of Crime Information System. Katerina Pastra and Horacio Saggion University of Sheffield. Outline. Project Overview SOCIS Architecture Corpus Linguistic Analysis Pointers. Project Overview. 2000 - 2003.
E N D
Analysing Crime-Scene Reports Scene of Crime Information System Katerina Pastra and Horacio Saggion University of Sheffield
Outline • Project Overview • SOCIS Architecture • Corpus • Linguistic Analysis • Pointers
Project Overview 2000 - 2003 • Domain: Scene of Crime Investigation (SOC) • Main Features : 1. Multimedia briefing • · Summarisation of text and images • 2. Generation • · Of formal reports & of photo albums 3. Intelligent Search
Project Overview (2) • Other systems for Crime Investigation: · Academic R&D Projects · Governmental agencies’ Systems · Commercial Systems BUT: SOCIS brings ‘intelligence’ to CI systems • The ‘Digital Evidence in Court’ issue: · Authenticity has to be verified ·Recently accepted in court
+ Image processing Text processing Integrated Knowledge Base A view of SOCIS
Text Processing • - Text Corpus • - Information Extraction system • >> Named Entities Recognition • >> Co-reference Resolution • Need: • Linguistic Analysis of the Language at the SOC • Lexical Information • Morphosyntactic Information • Semantic Information
The Corpus 4 days spent with a SOCO: 12 scenes visited * 2 complete case files examined * official documentation collected • Official documentation : SOC Reports = 77 Photo Indexes = 300 Witness Statements = 14 • Reported SOC Information : Press Association = 792 Washington Post = 233 Crime Watch = 8 • Reports - Photo indexes Witness statements Photographs • For the same case • For major crime • Of significant quantity NEEDED
SOC Language Characteristics • General Characteristics: • Telegraphic • Descriptive • Accurate • Objective Special text type : Reports
Lexical Information Characteristics: - Extensive use of abbreviations - Jargon Creation of Word - Lists (gazetteers): - Based on PITO’s CDM - Over 200 lists (domain + general) Words of interest are assigned a semantic category
Morphosyntactic Features • Extensive Ellipsis • Simple temporal dimensions • Limited co-ordination • Sub-ordination avoided • POS : NPs, PPs • Adjuncts of place - time, Qualifiers For identifying entities of interest automatically, we need to write specific rules using: • The word lists + Context Information
Pointers • SOCIS Sheffield Web Page http://www.dcs.shef.ac.uk/nlp/socis • SOCIS Surrey Web Page http://www.computing.surrey.ac.uk/ai/socis • NLP Group http://www.dcs.shef.ac.uk/nlp