1 / 36

Chapter Ten Language and the Computer

Chapter Ten Language and the Computer. Corpus Linguistics 语料库语言学. Definition 定义 Criticisms and the revival of corpus linguistics 语料库语言学受到的批判及其复兴 Concordance 共现索引 Text encoding and annotation 语篇编码和注解 The roles of corpus data 语料库数据的作用. Corpus Linguistics.

dorcas
Download Presentation

Chapter Ten Language and the Computer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter TenLanguage and the Computer

  2. Corpus Linguistics语料库语言学 Definition定义 Criticisms and the revival of corpus linguistics语料库语言学受到的批判及其复兴 Concordance共现索引 Text encoding and annotation语篇编码和注解 The roles of corpus data语料库数据的作用

  3. Corpus Linguistics • Corpus (plural corpora): a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language--for example, to determine how the usage of a particular sound, word, or syntactic construction varies. • 语料(corpus,复数形式corpora):一个语言数据的存储,可以是被编辑为书面文本,也可以是被作为录音言语的誊本。语料的主要目的是鉴定一个语言的假说--例如,确定一个特定的语音、单词,或句法结构的使用如何变化。

  4. 3.1 Corpus Linguistics • Corpus linguistics deals with the principles and practice of using corpora in language study. • A computer corpus is a large body of machine-readable texts. • 语料库语言学:论述语言研究中使用语料的原理和实践。一个计算机语料库是机器可读文本的重要躯干。

  5. 语料 (CORPUS,13世纪,来自拉丁语的corpus一词;意思是"body"(躯干;身体):复数形式通常是corpora)。(1)一个文本的集合,尤其指完整的和自身需求的文本集合;如:Anglo-Saxon诗句的语料。(2)复数形式也可写成corpuses。在语言学和词典编纂学上,指文本、语句或其它样本的集会,通常作为一个电子数据库储存。一般说来,计算机语料库可以储存上百万的流行词汇,其特征能通过标记的方式(为词和其它构成的作标记,并加以确认和分类)和使用共现关系程序来分析。 语料库语言学:研究任何这样的语料中的数据。

  6. Criticisms and the revival of corpus linguistics Chomsky changed the direction oflinguistics away from empiricism to rationalism. 1. the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance.

  7. 2. the only way to account for a grammar ofa language is y description of its rules, rather than by enumeration of its sentences. It is the syntactic rules that are finite. 3. Even if language is a finite construct, corpus methodology is not the best method to study language.

  8. (a) * He shines Tony books. (b) He gives Tony books. (c) He lends Tony books. (d) He owes Tony books. How can ungrammatical utterances be distinguished from ones that haven’t occurred? If the corpus does not contain sentence (a), how do we conclude that it is ungrammatical while the rest of the sentences are grammatical?

  9. There are also problems of practicality with corpus linguistics. How can one imagine searching through an 11-million-word corpus using nothing more than one’s eyes? Despite the criticisms, corpus linguistics continues to develop, especially after the computer slowly starts to become the mainstay of corpus linguistics.

  10. Concordance 计算机有能力搜索一个特定的词,词汇的顺序,甚至一个文本里的某一个词类。计算机也能检索一个词所有的实例,它还能计算一个词出现的次数,从而收集到有关这个词的频率的信息。然后以某种方式对数据进行分类。

  11. poor in Tale of Two Cities, Book 1

  12. Text encoding and annotation • "gives"包含词类的隐含部分的信息"第三人称单数现在时动词",在正常阅读里,我们仅能通过求助于预先存在的英语语法知识来检索它。然而,在一个已经注解过的语料里,形式"gives"可能以"gives-VVZ"的形式出现,代码"VVZ"表示它是一个词汇中动词(VV)的第三人称单数现在时(Z)形式。诸如这样的注解,使检索和分析包含在语料里的语言的信息变得更快、更容易。

  13. Leech(1993)描写了适用于文本语料的注解的7条准则。Leech(1993)描写了适用于文本语料的注解的7条准则。 1. 为了恢复到自然的语料,从有注解的语料里删去注解是可能的。2. 从文本里单独摘录注解是可能的。3. 注解方案应该以终端用户可利用的指导方针为基础。4. 应该弄清楚,注解是如何并且由谁来完成。5. 终端用户应该知道语料注解不是没有错误的,而只是一种潜在的有用的工具。6. 注解方案应尽可能地立足于普遍接受的和中性的理论原则。7. 任何注解方案都无优先权被视为是标准的注解。

  14. The roles of corpus data • Speech research • Lexical studies • Semantics • Sociolinguistics • Psycholinguistics

  15. Speech research言语研究 • A spoken corpus provides a broad sample of speech, extending over a wide selection of variables such as speaker gender, speaker age, speech class, genre, etc. This allows generalizations to be made about spoken language as the corpus is as wide and as representative as possible. It also provides for variation with a given spoken language to be studied. It also provides a sample of naturalistic speech rather than speech elicited under artificial conditions.

  16. Lexical studies词汇研究 • A linguist who has access to a corpus can call up all the examples of a word or phrase from many millions of words of texts in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a large number of natural examples are examined.

  17. Semantics语义学 • Corpus linguistics contributes to semantics by helping to establish an approach which is objective, because semantic distinctions are associated in texts with characteristic observable contexts—syntactic, morphological and prosodic—and by considering he environment of the linguistic entities an empirical objective indicator for a particular semantic distinction can be arrived. Another role of corpora in semantics has been in establishing more firmly the notions of fuzzy categories and gradience. In looking empirically at natural language in corpora, clear-cut boundaries do not exist; instead there are gradients of membership which are connected with frequency of inclusion.

  18. Sociolinguistics社会语言学 • Although sociolinguistics is an empirical field of research it is not often rigorously sampled. Sometimes the data are also elicited rather than naturalistic data. A corpus can provide a representative sample of naturalistic data which can be quantified.

  19. Psycholinguistics心理语言学 • In the field of psycholinguistics, sampled corpora can provide psycholinguistics with more concrete and reliable information about frequency, including the frequencies of different senses and parts of speech of ambiguous words. Next, corpora data can be used to examine the occurrence of speech errors in natural conversation. A third role for corpora lied in the analysis of language pathologies, where an accurate picture of abnormal data must be constructed before it is possible to hypothesize and test what may be wrong with the human language processing system.

  20. Computer Mediated Communication 计算机介入的信息交流 • Mail and News 邮件和新闻 • PowerPoint • Blog 博客 • Chatroom 谈话室 • Emoticons or Smileys 表情符号和笑眯眯

  21. Computer Mediated Communication 计算机介入的信息交流的特点:突出语言在计算机网络环境中的语言使用的关系,并通过使用语篇分析的方法来谈论这个焦点。 以语篇为基础的CMC形式:电子邮件、讨论组、实时聊天、虚拟现实的角色扮演游戏等。

  22. Mail and News • 信件邮箱和网上旅行是人们进入互联网的两种主要浏览方式。 • 网上旅行是信息检索 • 信件邮箱是邮件或者新闻的获取和发送

  23. PowerPoint • 是在电子投影仪上演示幻灯片,用户编制的幻灯片是书面语篇、录像、图像、音箱动画的集合。

  24. 三种形式: 1.作为工具的制作软件:编写幻灯片上的要点和创建相配视听材料的软件。 2.作为语篇的演示文稿:指被广泛地用来在幻灯片上围绕一定主题制作各种形式的材料。 3.作为语篇类型的演示方式:指一种重复的活动或表示意义的形式。

  25. Blog • 具有各种链接点和帖子的网络杂志。按逆年代顺序编排,最新的帖子出现在网页的上端(Dan Gilmore)

  26. 特点: 1.基本单位:帖子 2.逆年代顺序 3.累赠、开放性 4.内容简短 5.what’s new 6.链接 7.私人、非正式 8.共同具有的声音

  27. Chatroom • 简单说是一个在互联网上一群人的讲话 • 是一个网址,用户们可以在这个网址里进行实时的信息传递。

  28. Emoticons or Smileys • afk away from keyboard • bbl be back later • bbiab be back in a bit • brb be right back • btw by the way • cya see ya • gmta great minds think alike 是人们使用计算机键盘上所能找到的字母组成的字符串。

  29. j/k just kidding • irl in real life • lol laughing out loud • nick internet nickname • rotfl rolling on the floor laughing • ttfn ta ta for now • ttyl talk to you later • wb welcome back

  30. 5.3 Emoticons/smileys • :-) ha ha • |-) hee hee • |-D ho ho • :-> hey hey • :-( boo boo • :-| hmmm • :-O oops

  31. :-* oooops • :-o uh oh! • {} 'no comment' • :-o oh, no! • #:-o oh, no! • :-0 ohhhhhh! • |:-O big ohhhhhh! • :-))) reeeaaaaaallllly happy • >;-(' I am spitting mad • :'-( I am crying

  32. <3 I love you • :'-) I am so happy, I am crying • :-@ I am screaming • ((H))) a big hug • :-X a big wet kiss • :-D I am laughing (at you!) • |-O I am bored/yawning/snoring • :-o zz z z Z Z I am bored • :-S I am confused • :-e I am disappointed

  33. (:-... I am heart-broken • |-| I am going to sleep • ( @ @ ) You're kidding! • @*&$!% you know what that means.... • **-( I am very, very shocked • :^D great! I like it! • M:-) I salute you (respect) • :+( I am hurt by that remark • =-o I am suprised • <=-O I am frightened

  34. =-<> I am awe struck • $-> I am happily excited • :-~(~~~ I am moved to tears • =^) I am open minded • >w oh really! (ironic)

More Related