150 likes | 354 Views
Digital Italian. An overview of Italian corpora. A linguistic corpus:. a body of texts / transcripts collected for linguistic purposes, computerized, representative for the variety studied, balanced, annotated. Linguistic annotation can be useful or restrictive.
E N D
Digital Italian An overview of Italian corpora
A linguistic corpus: • a body of texts / transcripts collected for linguistic purposes, • computerized, • representative for the variety studied, • balanced, • annotated.
Linguistic annotation can be useful or restrictive Extra-linguistic annotation useful for sociolinguistic research Annotation
General Written Diachronic Specialized Spoken Synchronic Italian corpora
Corpus e lessico di frequenza dell’italiano scritto (COLFIS) Corpus di riferimento dell’italiano scritto / Corpus dinamico dell’italiano scritto (CORIS/CODIS) General corpora Written Italian
COLFIS (over three and a half million words) Newspapers Periodicals Books Il Corriere della Sera La Repubblica La Stampa Other, arts, science and technology, cars and boats, children and youngsters, home and hobby, women’s magazines, photo love story, general information, society, radio and television, sport, travels and ecology. Other, arts, children, SF, detective and spy stories, hobby and travel, classics, modern narrative, romance, essays, natural and exact sciences, human and social sciences, theatre and poetry. Economy, news of local interest, society, crime news, internal / external affairs, science, show biz and sports. COLFIS - structure
CORIS / CODIS (one hundred million words) Press Fiction Academic Prose Legal and Administrative Prose Miscella-nea Epheme-ra Newspaper, periodical, supplement Novels, short stories Human sciences, natural sciences, physics, experimental sciences Legal, bureaucratic, administrative Books on religion, travel, cookery, hobbies, etc. Letters, leaflets, instruction National, local/ specialist, non-specialist /connotated, non-connotated Italian, foreign, for adults, for children, crime, adventure, SF, women literature Books, reviews, scientific, popular history, philosophy, arts, literary criticism, law,economy, biology, etc. Books, reviews Books, reviews Private, public/ Printed form, electronic form CORIS/CODIS – structure
Lessico di frequenza dell’italiano parlato (LIP) -> Bancadati dell’italiano parlato (BADIP). Archivio delle varietà dell’italiano parlato (AVIP). LABLITA General corpora Spoken Italian
CLIPS (the spoken corpus) Radio and television speech Field recordings Readings Telephone speech Entertainment, informative transmissions, cultural and educational transmissions, commercials. Map task dialogues and spot the difference game. Readings by the speakers themselves or by professional dubbing actors. Conversations between a fake tour-operator and three hundred people. Spoken and written Italian:Corpora e lessici dell’italiano parlato e scritto (CLIPS)
Corpus di italiano televisivo (CIT) La Repubblica Specialized corpora
CIT Current affairs Entertainment (games, talk-show, varieties) Commer-cials Sports news Newscast Com-menta-ries. Play-by-play Studio broadcast. On-field broadcast. Text Text. Slogans. Studio broad-cast On-field broad-cast Text Headlines. Studio broadcast. On-field broadcast CIT – structure
La Repubblica Year 1985 - 2000 Genre News Comment Topic Religion Culture Economics Education News Politics Science Society Sport Weather Unclassified La Repubblica – structure
Thank you! Anne-Marie OBRETIN Mres in European Languages and Cultures University of Exeter ao231@exeter.ac.uk