290 likes | 511 Views
自建学习者语料库. 王立非 philipw@126.com. 提纲:. 语料库的定义 / 种类 / 规模 建库原则 / 设计 建设口语库 建设书面语库 文本头标注. What is a corpus?. Bodies of natural language material (whole texts, samples from texts, or sometimes just unconnected sentences), which are stored in machine-readable form . Voice data is also corpus!.
E N D
自建学习者语料库 王立非 philipw@126.com
提纲: • 语料库的定义/种类/规模 • 建库原则/设计 • 建设口语库 • 建设书面语库 • 文本头标注
What is a corpus? Bodies of natural language material (whole texts, samples from texts, or sometimes just unconnected sentences), which are stored inmachine-readableform.
Why use a corpus? Why use electronic text? • To study knowledge of language through specimens of language use: naturally-occurring data ... • Accessibility • Speed: can be analyzed more quickly • Accuracy: for some tasks, processing e-text is more accurate than eye scan
种类: • A. by medium: printed, electronic text, digitized speech, video (e.g. for ASL), mixed • C. language variables: • monolingual vs. multilingual (CHILDES database) • original vs. translations (parallel) • native speaker vs. learner (e.g. corpora of learner compositions)
Taxonomies of corpora • D. language states: synchronic vs. diachronic (e.g. Brown vs. Helsinki Diachronic corpus) • E. Plain vs. annotated
语料库规模: 5-10万词(小型) >100万词(中型) 5000万词(大型) >1亿词(特大型)
语言学习者 媒介年龄 风格性别 话题母语 技术性地区 任务环境其他外语 语言水平 学习环境 实际经验 语料库设计应考虑(Granger 2003):
语料库设计: 学习者语料库 口语子库 笔语子库 已 附 码 未 附 码 已 附 码 未 附 码
语音文件与文字文件配对检验。 • 第一次校对。 • 第二次校对。 • 抽检语音和文字样本。 • 将文字(WORD)文件格式转换为纯文本(TEXT)文件格式。 • 对纯文本文件进行切分。 • 对每条切分过的文本语料进行文本头标注。 • 对语料库进行统计分析。 • 将部分语料作研究试用。 • 将语料制成光盘出版。
文本 入库 抽样 语音 口语库建设流程:
TAGGED DATA ARTICLE PAST TENSE SMALL CORPUS TASK A TEXT DATA All DATA TASK B RAW DATA DATA BY TASK TASK C DATA BY YEAR SECCL 1996 1997 1998 TAGGED DATA 1999 2000 2001 2002 口语库结构:
语料命名: • 原则:简单明了,不重名,(字母+数字) • SECCL命名采用三级编号,即,年份-组别-序号, • 如:01-47-01为2001年第47组第1号考生的语音样本。 • 相同组别的语音样本存放在同一个文件夹中,此文件夹以年份和组号命名(如:2001-47)。
三类标注: • 文本头标注 • 错误标注 • 口语特征
文本头标注: 1)<SPOKEN> = Spoken (口语) 2)<TEM4> = 英语专业四级考试 3)<GRADE2> = Grade 2 (二年级) 4)<YEAR02> = YEAR 2002 (2002年样本) 5)<GROUP01> = Group 01 (第1组) 6)<TASKTYPE1> = Task Type 1 (口试题型 1) 7)<SEX1F> = Sex 1 Female (性别 1 女生),<Sex20> = Sex 2 Absent (性别 2 男生没有) 8)<RANK07> = Rank 07 (口试小组内排名第7)
文本头标注: <SPOKEN> <TEM 4> <GRADE 2> <YEAR00> <GROUP65> <TASKTYPE 1> <SEX 1 F> <Sex 2 0> <RANK 07>
口语特征标注: • 会话角色标注:用A、B角色记录。 • 自我重复/修正(Self Repetition/Repair) • a) 按实际重复次数如实记录。如:听到think二次,就记录为think think。 • b) 长停顿(Long Pause) • 自然的中间停顿,用逗号<,>表示, • 如果是完整的句子间的停顿,用句号<.>标注。 • 非流利停顿(0.3秒),用省略号 <…>标注,如:I … think。 • c) 发音错误(Wrong Pronunciation) • 转写时,写出其正确形式,然后将听到的错误发音用相应的字母拼出来,放在尖括号< >中。如:very 的错误发音记录为:very <weri>, Loise记录为noise<loise>,Sheep记录为ship<sheep>。
语法错误标注: 将错误放在< >,而将正确的形式放在文本中, 例如,如果听到runned,就记录为 ran <runned>。 He likes <like> to stay in the hotel.
建书面语库 文体:议论文 长度:300词以上。 时间:限时作文为40分钟,非限时作文时间不限。 题目:指定命题。 类型:英语专业 年龄:18-22岁 性别:男女生均有 水平:1-4年级 参加院校:全国不同类型和水平层次的9所高校。 采集量:每个年级100-150篇作文。 总数:限时作文约1600篇,非限时作文1600篇。 语料规模:100万词。
Annotation of corpora Automatic tagging and manual tagging • A. Header mark-up • B. Part-of-speech tagging • C. Syntactic annotation (parsed corpora) • D. Pragmatic annotation • E. Rhetorical information • F. Discourse structure
文本头标注: 1)<WCOMP> = Written Composition (笔头作文) 2)<ARG> = Argumentation(议论文),<NAR> = Narration(叙述文),<EXP> = Exposition(说明文) 3)<GRADE3> = Grade 3 (三年级) 4)<YEAR02> = YEAR 2002 (2002级) 5)<TIMED> = 限时作文 6)<UNTIMED> = 非限时作文 7)<FYSY> = 院校代码 8)<LENGTH300W> = 作文长度300词 9)<SCORE> = 得分 10)<gr-> = grammar error(语法错误) 11) <sp-> = spelling error(拼写错误) 12) <mis-> = missing(遗漏)
词性(POS)自动赋码: <WCOMP> <NULL> <ARG> <NULL> <GRADE1> <NULL> <YR03> <NULL> <TIMED> <NULL> <SCORE?> <NULL> <LENGTH243W> <NULL> <s>Education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> . <.> </s><s>Maybe <RR> many <DA2> people <NN> do <VD0> n't <XX> realize <VVI> it <PPH1> , <,> but <CCB> it <PPH1> is <VBZ> excatly <JJ> ture <NN1> . <.> </s><s>Even <RR> when <CS> a <AT1> baby <NN1> is <VBZ> still <RR> in <II> his <APPGE> ( <(> her <APPGE> ) <)> mother <NN1> 's <GE> stomach <NN1> , <,> this <DD1> process <NN1> has <VHZ> begun <VVN> . <.> </s><s>The <AT> mother <NN1> talks <NN2> to <II> him <PPHO1> ( <(> her <APPGE> ) <)> , <,> touches <VVZ> her <APPGE> baby <NN1> upon <II> her <APPGE> stomach <NN1> , <,> and <CC> the <AT> baby <NN1> responds <VVZ> by <II> making <VVG> sounds <NN2> , <,> kicking <VVG> gentlely <RR> and <CC> etc. <RA> some <DD> experts <NN2> advise <VV0> that <DD1> music <NN1> is <VBZ> good <JJ> for <IF> these <DD2> babies <NN2> . <.> </s><s>That <DD1> is <VBZ> called <VVN> " <"> education <NN1> before <CS> born <VVN> . <.> " <"> </s><s>After <CS> a <AT1> baby <NN1> is <VBZ> born <VVN> , <,> he <PPHS1> ( <(> she <PPHS1> ) <)> usually <RR> will <VM> be <VBI> taught <VVN> to <TO> say <VVI> " <"> Mom <NN1> " <"> or <CC> " <"> Dad <NN1> . <.> " <"> </s><s>he <PPHS1> will <VM> also <RR> be <VBI> lift <NN1> upon <II> the <AT> floor <NN1> to <TO> learn <VVI> to <TO> walk <VVI> . <.> </s><s>These <DD2> seem <VV0> nothing <PN1> special <JJ> , <,> but <CCB> everyone <PN1> experince <VV0> the <AT> process <NN1> , <,> or <CC> how <RRQ> could <VM> all <DB> of <IO> us <PPIO2> speak <VV0> and <CC> walk <VV0> ? <?> </s><s>That <DD1> is <VBZ> a <AT1> kind <NN1> of <IO> education <NN1> , <,> too <RR> . <.> </s><s>This <DD1> kind <NN1> of <IO> education <NN1> was <VBDZ> not <XX> paid <VVN> enough <DD> attention <NN1> , <,> until <CS> recent <JJ> years <NNT2> . <.> </s><s>A <AT1> lot <NN1> of <IO> researches <NN2> have <VH0> been <VBN> done <VDN> , <,> and <CC> people <NN> find <VV0> that <CST> it <PPH1> 's <VBZ> very <RG> important <JJ> for <IF> a <AT1> young <JJ> baby <NN1> to <TO> receive <VVI> a <AT1> good <JJ> education <NN1> . <.> </s><s>At <II> the <AT> age <NN1> of <IO> 7 <MC> or <CC> 8 <MC> , <,> children <NN2> will <VM> be <VBI> sent <VVN> to <II> primary <JJ> schools <NN2> . <.> </s><s>It <PPH1> 's <VBZ> " <"> real <JJ> education <NN1> " <"> in <II> most <DAT> of <IO> people <NN> 's <GE> view <NN1> . <.> </s><s>In <II> my <APPGE> opinion <NN1> , <,> learning <VVG> at <II> school <NN1> is <VBZ> be <VBI> educated <VVN> , <,> and <CC> also <RR> a <AT1> kind <NN1> of <IO> education <NN1> . <.> </s><s>In <II> China <NP1> , <,> students <NN2> are <VBR> at <II> school <NN1> , <,> such <II21> as <II22> primary <JJ> school <NN1> , <,> junior <JJ> middle <JJ> school <NN1> , <,> senior <JJ> middle <JJ> school <NN1> and <CC> university <NN1> . <.> </s><s>That <DD1> 's <VBZ> to <TO> say <VVI> it <PPH1> will <VM> take <VVI> a <AT1> person <NN1> nearly <RR> sixteen <MC> years <NNT2> to <TO> study <VVI> at <II> school <NN1> . <.> </s><s>It <PPH1> 's <VBZ> a <AT1> long <JJ> time <NNT1> , <,> however <RR> , <,> it <PPH1> is <VBZ> not <XX> enough <RR> to <TO> learn <VVI> . <.> </s><s>Nowadays <RT> , <,> it <PPH1> 's <VBZ> very <RG> pouplar <JJ> for <IF> adults <NN2> at <II> work <NN1> to <TO> learn <VVI> . <.> </s><s>They <PPHS2> can <VM> learn <VVI> in <II> skilling <VVG> school <NN1> , <,> on <II> the <AT> internet <NN1> and <CC> soon <RR> . <.> </s><s>It <PPH1> proves <VVZ> that <DD1> education <NN1> on <II> adults <NN2> is <VBZ> still <RR> going <VVG> on <RP> . <.> </s><s>As <II21> for <II22> old <JJ> people <NN> , <,> I <PPIS1> do <VD0> n't <XX> know <VVI> much <RR> , <,> but <CCB> I <PPIS1> know <VV0> there <EX> 're <VBR> universities <NN2> for <IF> the <AT> old <JJ> , <,> where <CS> they <PPHS2> can <VM> learn <VVI> writing <NN1> and <CC> drawing <NN1> , <,> music <NN1> and <CC> a <AT1> lot <NN1> of <IO> things <NN2> . <.> </s><s>Many <DA2> old <JJ> people <NN> are <VBR> fond <JJ> of <IO> morning <NNT1> exercise <NN1> , <,> and <CC> it <PPH1> takes <VVZ> them <PPHO2> a <AT1> lot <NN1> of <IO> time <NNT1> to <TO> learn <VVI> . <.> </s><s>Do <VD0> n't <XX> you <PPY> believe <VVI> that <DD1> education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> ? <?> </s> 请梁茂成老师来讲!
P. Nation & S. GRANGER 将参加 第四届全国英语写作教学与研究国际研讨会 9.23-24, 北京 对外经济贸易大学 欢迎大家 报名:ewconference@126.com
Corpus is the mainstream!-- G. Leech <WCOMP> <NULL> <ARG> <NULL> <GRADE1> <NULL> <YR03> <NULL> <TIMED> <NULL> <SCORE?> <NULL> <LENGTH243W> <NULL> <s>Education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> . <.> </s><s>Maybe <RR> many <DA2> people <NN> do <VD0> n't <XX> realize <VVI> it <PPH1> , <,> but <CCB> it <PPH1> is <VBZ> excatly <JJ> ture <NN1> . <.> </s><s>Even <RR> when <CS> a <AT1> baby <NN1> is <VBZ> still <RR> in <II> his <APPGE> ( <(> her <APPGE> ) <)> mother <NN1> 's <GE> stomach <NN1> , <,> this <DD1> process <NN1> has <VHZ> begun <VVN> . <.> </s><s>The <AT> mother <NN1> talks <NN2> to <II> him <PPHO1> ( <(> her <APPGE> ) <)> , <,> touches <VVZ> her <APPGE> baby <NN1> upon <II> her <APPGE> stomach <NN1> , <,> and <CC> the <AT> baby <NN1> responds <VVZ> by <II> making <VVG> sounds <NN2> , <,> kicking <VVG> gentlely <RR> and <CC> etc. <RA> some <DD> experts <NN2> advise <VV0> that <DD1> music <NN1> is <VBZ> good <JJ> for <IF> these <DD2> babies <NN2> . <.> </s><s>That <DD1> is <VBZ> called <VVN> " <"> education <NN1> before <CS> born <VVN> . <.> " <"> </s><s>After <CS> a <AT1> baby <NN1> is <VBZ> born <VVN> , <,> he <PPHS1> ( <(> she <PPHS1> ) <)> usually <RR> will <VM> be <VBI> taught <VVN> to <TO> say <VVI> " <"> Mom <NN1> " <"> or <CC> " <"> Dad <NN1> . <.> " <"> </s><s>he <PPHS1> will <VM> also <RR> be <VBI> lift <NN1> upon <II> the <AT> floor <NN1> to <TO> learn <VVI> to <TO> walk <VVI> . <.> </s><s>These <DD2> seem <VV0> nothing <PN1> special <JJ> , <,> but <CCB> everyone <PN1> experince <VV0> the <AT> process <NN1> , <,> or <CC> how <RRQ> could <VM> all <DB> of <IO> us <PPIO2> speak <VV0> and <CC> walk <VV0> ? <?> </s><s>That <DD1> is <VBZ> a <AT1> kind <NN1> of <IO> education <NN1> , <,> too <RR> . <.> </s><s>This <DD1> kind <NN1> of <IO> education <NN1> was <VBDZ> not <XX> paid <VVN> enough <DD> attention <NN1> , <,> until <CS> recent <JJ> years <NNT2> . <.> </s><s>A <AT1> lot <NN1> of <IO> researches <NN2> have <VH0> been <VBN> done <VDN> , <,> and <CC> people <NN> find <VV0> that <CST> it <PPH1> 's <VBZ> very <RG> important <JJ> for <IF> a <AT1> young <JJ> baby <NN1> to <TO> receive <VVI> a <AT1> good <JJ> education <NN1> . <.> </s><s>At <II> the <AT> age <NN1> of <IO> 7 <MC> or <CC> 8 <MC> , <,> children <NN2> will <VM> be <VBI> sent <VVN> to <II> primary <JJ> schools <NN2> . <.> </s><s>It <PPH1> 's <VBZ> " <"> real <JJ> education <NN1> " <"> in <II> most <DAT> of <IO> people <NN> 's <GE> view <NN1> . <.> </s><s>In <II> my <APPGE> opinion <NN1> , <,> learning <VVG> at <II> school <NN1> is <VBZ> be <VBI> educated <VVN> , <,> and <CC> also <RR> a <AT1> kind <NN1> of <IO> education <NN1> . <.> </s><s>In <II> China <NP1> , <,> students <NN2> are <VBR> at <II> school <NN1> , <,> such <II21> as <II22> primary <JJ> school <NN1> , <,> junior <JJ> middle <JJ> school <NN1> , <,> senior <JJ> middle <JJ> school <NN1> and <CC> university <NN1> . <.> </s><s>That <DD1> 's <VBZ> to <TO> say <VVI> it <PPH1> will <VM> take <VVI> a <AT1> person <NN1> nearly <RR> sixteen <MC> years <NNT2> to <TO> study <VVI> at <II> school <NN1> . <.> </s><s>It <PPH1> 's <VBZ> a <AT1> long <JJ> time <NNT1> , <,> however <RR> , <,> it <PPH1> is <VBZ> not <XX> enough <RR> to <TO> learn <VVI> . <.> </s><s>Nowadays <RT> , <,> it <PPH1> 's <VBZ> very <RG> pouplar <JJ> for <IF> adults <NN2> at <II> work <NN1> to <TO> learn <VVI> . <.> </s><s>They <PPHS2> can <VM> learn <VVI> in <II> skilling <VVG> school <NN1> , <,> on <II> the <AT> internet <NN1> and <CC> soon <RR> . <.> </s><s>It <PPH1> proves <VVZ> that <DD1> education <NN1> on <II> adults <NN2> is <VBZ> still <RR> going <VVG> on <RP> . <.> </s><s>As <II21> for <II22> old <JJ> people <NN> , <,> I <PPIS1> do <VD0> n't <XX> know <VVI> much <RR> , <,> but <CCB> I <PPIS1> know <VV0> there <EX> 're <VBR> universities <NN2> for <IF> the <AT> old <JJ> , <,> where <CS> they <PPHS2> can <VM> learn <VVI> writing <NN1> and <CC> drawing <NN1> , <,> music <NN1> and <CC> a <AT1> lot <NN1> of <IO> things <NN2> . <.> </s><s>Many <DA2> old <JJ> people <NN> are <VBR> fond <JJ> of <IO> morning <NNT1> exercise <NN1> , <,> and <CC> it <PPH1> takes <VVZ> them <PPHO2> a <AT1> lot <NN1> of <IO> time <NNT1> to <TO> learn <VVI> . <.> </s><s>Do <VD0> n't <XX> you <PPY> believe <VVI> that <DD1> education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> ? <?> </s> Thank You