230 likes | 317 Views
Michael P. Oakes. University of Sunderland. Contents. Proposals for a Master’s programme in Natural Language Processing Future research plans / link with Wolverhampton Plans for publications Plans for grant proposals Other funding ideas.
E N D
Michael P. Oakes University of Sunderland
Contents • Proposals for a Master’s programme in Natural Language Processing • Future research plans / link with Wolverhampton • Plans for publications • Plans for grant proposals • Other funding ideas
Proposals for a Master’s programme in Natural Language Processing • Some preliminaries: • Entry requirements: first or second class degree in a related discipline. Computer programming will be taught from scratch. • Funding: Erasmus, European Social Fund, ESRC Master’s training package scheme for programme development, work-based learning • Students must receive an accurate idea of the content of the programme beforehand • Induction week: meet the teaching team, familiarity with the University, formal registration, etc. • Diploma, Certificate and Master’s awards. 8 taught modules (24 lectures, 18 hours’ practical, 58 directed reading, 50 self-directed research).
Project • Close links with industry established through 3-month industrial placements, based either with the company or at the University. • The sponsor will either be from industry or academia, and there will also be a staff member from Wolverhampton to act as supervisor. • Project management (TOR, reviews), poster, viva,dissertation (typically introduction, research, analysis, implementation, evaluation / experiments, reflective conclusions).
Administration • Programme board of studies: Institute Director or deputy, student representatives, one or more employers’ representatives, module leaders, programme leader, responsible for the management of the programme and the well-being of each module. • Board of assessment: to decide student progression. External Examiner, no student representatives • Internal (prior to hand-out) and External (sample work shown prior to programme assessments) moderation. • Other quality control: student and staff feedback, EE’s report, programme annual report. • Each student has a personal tutor and student handbook. • Timely, face-to-face assessment may improve student satisfaction.
Future Research Plans, • And how these might complement the research topics of the Research Group in Computational Linguistics.
Automatic Summarisation • CAST Project produced an automatic summarisation tool: “term-based summarisation” • Content-Based Abstracting (Paice). • TRESTLE(Gaizauskas). • David Evans: evaluation of information extraction • Query-based summaries. Intrinsic (representativeness) vs. Extrinsic (judgeability) evaluation (Liang). • SumTrain: reached second round of EU evaluation. • Extraction of statistics-related phrases, e.g. “greater than”, “significant reduction in”, “was directly proportional to”, “did not affect”.
Concept-Based Abstracting Project • window length = 4 • STOP 6 "and foliar treatment AGEN" • 5 "foliar treatment AGEN +" • 5 "treatment AGEN + AGEN" • 4 "effect of mildew AGEN" • 3 "AGEN gave a significant" • 2 "AGEN was the most" • 2 "AGEN at different sowing" • 2 "AGEN increased fertile tillers“ • LOW-FQ 1 "effect of AGEN sprays"
Automatic Terminology Processing • Le An Ha looked at the concept of a terminology rather than individual terms. Knowledge patterns from glossaries: store of terms and relations between them. • David Evans. Identification of terms using TF.IDF and other statistical methods (see slide 20). • Shiyan Ou. Sentiment classification (see slide 20). • Constantin Orasan. Corpus of junk mail (spam filters, Farrow). • Constantin Orasan. Analysis of genre differences – project on “Language, Computation and Style” (authorship). • Englishes, Scrip newsfeeds, BELGA: “feature extraction for text classification”.
Annotation tools • Constantin Orasan: PALinkA, automatic annotation of anaphoric links. • Lewandowska, Oakes & Rayson: part-of-speech and semantic code tagging in English; alignment enables partial semantic tagging of L2.
Annotation: Aligned and Partially Tagged Polish text (Lewandowska, Oakes and Rayson) • Tak jest_A3+ mowi Polemarch_Z99 a do_Z5 tego jeszcze urzadra nocne nabozenstwo, ktore_Z8 warto zobaczyc • “_”_PUNC That_DD1_Z8 ’s_VBZ_A3+ the_AT_Z5 way_NN1_X4.2 of_IO_Z5 it_PPH1_Z8 ,_,_PUNC “_”_PUNC said_VVD_Q2.1 Polymarchus_NP1_Z99 _,_,PUNC “_”_PUNC and_CC_Z5 ,_,_PUNC besides_RR_Z5 _,_,PUNC there_EX_Z5, is_VBZ_A3+ to_TO_Z5 be_VBI_A3+ a_AT1_Z5 night_NNT1_T1.3 festival_NN1_K1/S1.1.3+ which_DDQ_Z8 will_VM_T1.1.3 be_VBI_A3+ worth_II_I1.3 seeing_VVG_X3.4 ._._PUNC
Mobile Devices • Laura Hasler and Dalila Mekhaldi: QALL-ME, Question-Answering for Digital Phones. • Chufeng Chen: Annotation of digital photographs taken with a GPS camera. A gazetteer “translated” longitude and latitude data into place name, geographical feature, e.g. Long = 54.91, Lat = -1.4, place = Sunderland, feature = harbour. Episodic memory.
Other Related Work • Andrea Mulloni: Corpus Linguistics. • Empirical vs. Chomskyan • Own interest “Statistics for Corpus Linguistics”. • Driving the process rather than merely testing for statistical significance, e.g. Mutual Information to find collocations. • Irina Temnikova: Machine Translation • Alignment for example-based machine translation (Lewandowska & Oakes).
Plans for Publications (1) • Book Chapters in press: • Processing Multilingual Corpora, Chapter 32 of Corpus Linguistics: An International Handbook, eds. Anke Lüdeling and Merja Kytö, Mouton de Gruyter. • Corpus Linguistics and Stylometry, Chapter 52, ibid. • Corpus Linguistics and Language Variation, in Contemporary Approaches to Corpus Linguistics, ed. Paul Baker, Continuum. • Javanese, in “Languages of the World”, ed. Bernard Comrie, Routledge. • J. Vilares, M. Oakes and M. Vilares: A Knowledge-Light Approach to Query Translation in CLIR. RANLP V, ed. N. Nicolov, Benjamins.
Plans for Publications (2) • Under second review: • S-W. Ke, C. Bowerman and M. Oakes, “Automatic classification of personal email with PERC and time-related strategies”, ACM Transactions on Information Systems. • W-C Lin, M. Oakes and J. Tait, “Improving image annotation via representative feature selection”, Cognitive Processing.
Plans for Publications (3) • Future plans: • VITALAS Video and image Indexing and reTrievAl in the LArge Scale. • Update “Statistics for Corpus Linguistics” – sold over 1500 copies, but now 10 years old • Last chapter was “Literary Detective Work”, which could be a book in its own right: disputed authorship (compendium of techniques, Shakespeare, religious texts, still unsolved mysteries e.g. The Quiet Don, Marxism and the Philosophy of Language), unknown languages (Linear B, Voynich manuscript). JLLC, QL.
Plans for Grant Proposals (1) • Closing the Semantic Gap • Related to machine learning (boosting), caption analysis, gazetteers, alignment of low level image content features and high level semantic features (words) • Son of VITALAS?
Plans for Grant Proposals (2) • Which words are truly characteristic of a corpus? X² etc. • Countable linguistic features. • Measures from IR e.g. PageRank (Łódź, Palomino). • AHRC (if theoretical, Englishes), ESRC (if applied, e.g. spam filters). • Sentiment analysis (Thijs Westerveld at Teezir): mining online opinions. Cheerful, chic, cheap, clean vs. chaos, cranky, cumbersome, damaged. • Interface between NLP and IR: sentence analysis e.g. adjectives, negatives; follow links to navigate websites. • IR relevant vs. irrelevant documents.
Plans for Grant Proposals (3) • Temporal relations in query language modelling (Dawei Song). • Temporal similarity + semantic similarity overall similarity. • The temporal similarity between texts (e.g. query and document) can be estimated by a) time stamp, b) temporal logic between the texts (Andrea Setzer).
Plans for Grant Proposals (4) • Corpus Profiling Workshop on October 18th. • Exploring how corpus characteristics affect the behaviour of techniques in IR and NLP, and to set out a roadmap for a shared research agenda. • Data set profile impacts on automatic classification, IR, anaphora resolution, automatic summarisation and word sense disambiguation.
Other Funding Ideas • IRSG-like “Industry Day” to foster industrial contacts (consultancy? Grant proposals?) • Organise conferences, e.g. bid for Corpus Linguistics, CLEF, ECIR. • Exploitation of Intellectual Property. • Is there an equivalent of CEDEC (Computing and Engineering Distance Education Centre) with whom we can discuss marketing programmes world-wide / part-time? Work-based learning?