1 / 28

自建学习者语料库

自建学习者语料库. 王立非 philipw@126.com. 提纲:. 语料库的定义 / 种类 / 规模 建库原则 / 设计 建设口语库 建设书面语库 文本头标注. What is a corpus?. Bodies of natural language material (whole texts, samples from texts, or sometimes just unconnected sentences), which are stored in machine-readable form . Voice data is also corpus!.

saburo
Download Presentation

自建学习者语料库

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 自建学习者语料库 王立非 philipw@126.com

  2. 提纲: • 语料库的定义/种类/规模 • 建库原则/设计 • 建设口语库 • 建设书面语库 • 文本头标注

  3. What is a corpus? Bodies of natural language material (whole texts, samples from texts, or sometimes just unconnected sentences), which are stored inmachine-readableform.

  4. Voice data is also corpus!

  5. Why use a corpus? Why use electronic text? • To study knowledge of language through specimens of language use: naturally-occurring data ... • Accessibility • Speed: can be analyzed more quickly • Accuracy: for some tasks, processing e-text is more accurate than eye scan

  6. 种类: • A. by medium: printed, electronic text, digitized speech, video (e.g. for ASL), mixed • C. language variables: • monolingual vs. multilingual (CHILDES database) • original vs. translations (parallel) • native speaker vs. learner (e.g. corpora of learner compositions)

  7. Taxonomies of corpora • D. language states: synchronic vs. diachronic (e.g. Brown vs. Helsinki Diachronic corpus) • E. Plain vs. annotated

  8. 语料库规模: 5-10万词(小型) >100万词(中型) 5000万词(大型) >1亿词(特大型)

  9. 学习者语料库设计

  10. 语言学习者 媒介年龄 风格性别 话题母语 技术性地区 任务环境其他外语 语言水平 学习环境 实际经验 语料库设计应考虑(Granger 2003):

  11. 语料库设计: 学习者语料库 口语子库 笔语子库 已 附 码 未 附 码 已 附 码 未 附 码

  12. 语料库分层结构:

  13. 建口语库流程:

  14. 语音文件与文字文件配对检验。 • 第一次校对。 • 第二次校对。 • 抽检语音和文字样本。 • 将文字(WORD)文件格式转换为纯文本(TEXT)文件格式。 • 对纯文本文件进行切分。 • 对每条切分过的文本语料进行文本头标注。 • 对语料库进行统计分析。 • 将部分语料作研究试用。 • 将语料制成光盘出版。

  15. 文本 入库 抽样 语音 口语库建设流程:

  16. TAGGED DATA ARTICLE PAST TENSE SMALL CORPUS TASK A TEXT DATA All DATA TASK B RAW DATA DATA BY TASK TASK C DATA BY YEAR SECCL 1996 1997 1998 TAGGED DATA 1999 2000 2001 2002 口语库结构:

  17. 语料命名: • 原则:简单明了,不重名,(字母+数字) • SECCL命名采用三级编号,即,年份-组别-序号, • 如:01-47-01为2001年第47组第1号考生的语音样本。 • 相同组别的语音样本存放在同一个文件夹中,此文件夹以年份和组号命名(如:2001-47)。

  18. 三类标注: • 文本头标注 • 错误标注 • 口语特征

  19. 文本头标注: 1)<SPOKEN> = Spoken (口语) 2)<TEM4> = 英语专业四级考试 3)<GRADE2> = Grade 2 (二年级) 4)<YEAR02> = YEAR 2002 (2002年样本) 5)<GROUP01> = Group 01 (第1组) 6)<TASKTYPE1> = Task Type 1 (口试题型 1) 7)<SEX1F> = Sex 1 Female (性别 1 女生),<Sex20> = Sex 2 Absent (性别 2 男生没有) 8)<RANK07> = Rank 07 (口试小组内排名第7)

  20. 文本头标注: <SPOKEN> <TEM 4> <GRADE 2> <YEAR00> <GROUP65> <TASKTYPE 1> <SEX 1 F> <Sex 2 0> <RANK 07>

  21. 口语特征标注: • 会话角色标注:用A、B角色记录。 • 自我重复/修正(Self Repetition/Repair) • a) 按实际重复次数如实记录。如:听到think二次,就记录为think think。 • b) 长停顿(Long Pause) • 自然的中间停顿,用逗号<,>表示, • 如果是完整的句子间的停顿,用句号<.>标注。 • 非流利停顿(0.3秒),用省略号 <…>标注,如:I … think。 • c) 发音错误(Wrong Pronunciation) • 转写时,写出其正确形式,然后将听到的错误发音用相应的字母拼出来,放在尖括号< >中。如:very 的错误发音记录为:very <weri>, Loise记录为noise<loise>,Sheep记录为ship<sheep>。

  22. 语法错误标注: 将错误放在< >,而将正确的形式放在文本中, 例如,如果听到runned,就记录为 ran <runned>。 He likes <like> to stay in the hotel.

  23. 建书面语库 文体:议论文 长度:300词以上。 时间:限时作文为40分钟,非限时作文时间不限。 题目:指定命题。 类型:英语专业 年龄:18-22岁 性别:男女生均有 水平:1-4年级 参加院校:全国不同类型和水平层次的9所高校。 采集量:每个年级100-150篇作文。 总数:限时作文约1600篇,非限时作文1600篇。 语料规模:100万词。

  24. Annotation of corpora Automatic tagging and manual tagging • A. Header mark-up • B. Part-of-speech tagging • C. Syntactic annotation (parsed corpora) • D. Pragmatic annotation • E. Rhetorical information • F. Discourse structure

  25. 文本头标注: 1)<WCOMP> = Written Composition (笔头作文) 2)<ARG> = Argumentation(议论文),<NAR> = Narration(叙述文),<EXP> = Exposition(说明文) 3)<GRADE3> = Grade 3 (三年级) 4)<YEAR02> = YEAR 2002 (2002级) 5)<TIMED> = 限时作文 6)<UNTIMED> = 非限时作文 7)<FYSY> = 院校代码 8)<LENGTH300W> = 作文长度300词 9)<SCORE> = 得分 10)<gr-> = grammar error(语法错误) 11) <sp-> = spelling error(拼写错误) 12) <mis-> = missing(遗漏)

  26. 词性(POS)自动赋码: <WCOMP> <NULL> <ARG> <NULL> <GRADE1> <NULL> <YR03> <NULL> <TIMED> <NULL> <SCORE?> <NULL> <LENGTH243W> <NULL> <s>Education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> . <.> </s><s>Maybe <RR> many <DA2> people <NN> do <VD0> n't <XX> realize <VVI> it <PPH1> , <,> but <CCB> it <PPH1> is <VBZ> excatly <JJ> ture <NN1> . <.> </s><s>Even <RR> when <CS> a <AT1> baby <NN1> is <VBZ> still <RR> in <II> his <APPGE> ( <(> her <APPGE> ) <)> mother <NN1> 's <GE> stomach <NN1> , <,> this <DD1> process <NN1> has <VHZ> begun <VVN> . <.> </s><s>The <AT> mother <NN1> talks <NN2> to <II> him <PPHO1> ( <(> her <APPGE> ) <)> , <,> touches <VVZ> her <APPGE> baby <NN1> upon <II> her <APPGE> stomach <NN1> , <,> and <CC> the <AT> baby <NN1> responds <VVZ> by <II> making <VVG> sounds <NN2> , <,> kicking <VVG> gentlely <RR> and <CC> etc. <RA> some <DD> experts <NN2> advise <VV0> that <DD1> music <NN1> is <VBZ> good <JJ> for <IF> these <DD2> babies <NN2> . <.> </s><s>That <DD1> is <VBZ> called <VVN> " <"> education <NN1> before <CS> born <VVN> . <.> " <"> </s><s>After <CS> a <AT1> baby <NN1> is <VBZ> born <VVN> , <,> he <PPHS1> ( <(> she <PPHS1> ) <)> usually <RR> will <VM> be <VBI> taught <VVN> to <TO> say <VVI> " <"> Mom <NN1> " <"> or <CC> " <"> Dad <NN1> . <.> " <"> </s><s>he <PPHS1> will <VM> also <RR> be <VBI> lift <NN1> upon <II> the <AT> floor <NN1> to <TO> learn <VVI> to <TO> walk <VVI> . <.> </s><s>These <DD2> seem <VV0> nothing <PN1> special <JJ> , <,> but <CCB> everyone <PN1> experince <VV0> the <AT> process <NN1> , <,> or <CC> how <RRQ> could <VM> all <DB> of <IO> us <PPIO2> speak <VV0> and <CC> walk <VV0> ? <?> </s><s>That <DD1> is <VBZ> a <AT1> kind <NN1> of <IO> education <NN1> , <,> too <RR> . <.> </s><s>This <DD1> kind <NN1> of <IO> education <NN1> was <VBDZ> not <XX> paid <VVN> enough <DD> attention <NN1> , <,> until <CS> recent <JJ> years <NNT2> . <.> </s><s>A <AT1> lot <NN1> of <IO> researches <NN2> have <VH0> been <VBN> done <VDN> , <,> and <CC> people <NN> find <VV0> that <CST> it <PPH1> 's <VBZ> very <RG> important <JJ> for <IF> a <AT1> young <JJ> baby <NN1> to <TO> receive <VVI> a <AT1> good <JJ> education <NN1> . <.> </s><s>At <II> the <AT> age <NN1> of <IO> 7 <MC> or <CC> 8 <MC> , <,> children <NN2> will <VM> be <VBI> sent <VVN> to <II> primary <JJ> schools <NN2> . <.> </s><s>It <PPH1> 's <VBZ> " <"> real <JJ> education <NN1> " <"> in <II> most <DAT> of <IO> people <NN> 's <GE> view <NN1> . <.> </s><s>In <II> my <APPGE> opinion <NN1> , <,> learning <VVG> at <II> school <NN1> is <VBZ> be <VBI> educated <VVN> , <,> and <CC> also <RR> a <AT1> kind <NN1> of <IO> education <NN1> . <.> </s><s>In <II> China <NP1> , <,> students <NN2> are <VBR> at <II> school <NN1> , <,> such <II21> as <II22> primary <JJ> school <NN1> , <,> junior <JJ> middle <JJ> school <NN1> , <,> senior <JJ> middle <JJ> school <NN1> and <CC> university <NN1> . <.> </s><s>That <DD1> 's <VBZ> to <TO> say <VVI> it <PPH1> will <VM> take <VVI> a <AT1> person <NN1> nearly <RR> sixteen <MC> years <NNT2> to <TO> study <VVI> at <II> school <NN1> . <.> </s><s>It <PPH1> 's <VBZ> a <AT1> long <JJ> time <NNT1> , <,> however <RR> , <,> it <PPH1> is <VBZ> not <XX> enough <RR> to <TO> learn <VVI> . <.> </s><s>Nowadays <RT> , <,> it <PPH1> 's <VBZ> very <RG> pouplar <JJ> for <IF> adults <NN2> at <II> work <NN1> to <TO> learn <VVI> . <.> </s><s>They <PPHS2> can <VM> learn <VVI> in <II> skilling <VVG> school <NN1> , <,> on <II> the <AT> internet <NN1> and <CC> soon <RR> . <.> </s><s>It <PPH1> proves <VVZ> that <DD1> education <NN1> on <II> adults <NN2> is <VBZ> still <RR> going <VVG> on <RP> . <.> </s><s>As <II21> for <II22> old <JJ> people <NN> , <,> I <PPIS1> do <VD0> n't <XX> know <VVI> much <RR> , <,> but <CCB> I <PPIS1> know <VV0> there <EX> 're <VBR> universities <NN2> for <IF> the <AT> old <JJ> , <,> where <CS> they <PPHS2> can <VM> learn <VVI> writing <NN1> and <CC> drawing <NN1> , <,> music <NN1> and <CC> a <AT1> lot <NN1> of <IO> things <NN2> . <.> </s><s>Many <DA2> old <JJ> people <NN> are <VBR> fond <JJ> of <IO> morning <NNT1> exercise <NN1> , <,> and <CC> it <PPH1> takes <VVZ> them <PPHO2> a <AT1> lot <NN1> of <IO> time <NNT1> to <TO> learn <VVI> . <.> </s><s>Do <VD0> n't <XX> you <PPY> believe <VVI> that <DD1> education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> ? <?> </s> 请梁茂成老师来讲!

  27. P. Nation & S. GRANGER 将参加 第四届全国英语写作教学与研究国际研讨会 9.23-24, 北京 对外经济贸易大学 欢迎大家 报名:ewconference@126.com

  28. Corpus is the mainstream!-- G. Leech <WCOMP> <NULL> <ARG> <NULL> <GRADE1> <NULL> <YR03> <NULL> <TIMED> <NULL> <SCORE?> <NULL> <LENGTH243W> <NULL> <s>Education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> . <.> </s><s>Maybe <RR> many <DA2> people <NN> do <VD0> n't <XX> realize <VVI> it <PPH1> , <,> but <CCB> it <PPH1> is <VBZ> excatly <JJ> ture <NN1> . <.> </s><s>Even <RR> when <CS> a <AT1> baby <NN1> is <VBZ> still <RR> in <II> his <APPGE> ( <(> her <APPGE> ) <)> mother <NN1> 's <GE> stomach <NN1> , <,> this <DD1> process <NN1> has <VHZ> begun <VVN> . <.> </s><s>The <AT> mother <NN1> talks <NN2> to <II> him <PPHO1> ( <(> her <APPGE> ) <)> , <,> touches <VVZ> her <APPGE> baby <NN1> upon <II> her <APPGE> stomach <NN1> , <,> and <CC> the <AT> baby <NN1> responds <VVZ> by <II> making <VVG> sounds <NN2> , <,> kicking <VVG> gentlely <RR> and <CC> etc. <RA> some <DD> experts <NN2> advise <VV0> that <DD1> music <NN1> is <VBZ> good <JJ> for <IF> these <DD2> babies <NN2> . <.> </s><s>That <DD1> is <VBZ> called <VVN> " <"> education <NN1> before <CS> born <VVN> . <.> " <"> </s><s>After <CS> a <AT1> baby <NN1> is <VBZ> born <VVN> , <,> he <PPHS1> ( <(> she <PPHS1> ) <)> usually <RR> will <VM> be <VBI> taught <VVN> to <TO> say <VVI> " <"> Mom <NN1> " <"> or <CC> " <"> Dad <NN1> . <.> " <"> </s><s>he <PPHS1> will <VM> also <RR> be <VBI> lift <NN1> upon <II> the <AT> floor <NN1> to <TO> learn <VVI> to <TO> walk <VVI> . <.> </s><s>These <DD2> seem <VV0> nothing <PN1> special <JJ> , <,> but <CCB> everyone <PN1> experince <VV0> the <AT> process <NN1> , <,> or <CC> how <RRQ> could <VM> all <DB> of <IO> us <PPIO2> speak <VV0> and <CC> walk <VV0> ? <?> </s><s>That <DD1> is <VBZ> a <AT1> kind <NN1> of <IO> education <NN1> , <,> too <RR> . <.> </s><s>This <DD1> kind <NN1> of <IO> education <NN1> was <VBDZ> not <XX> paid <VVN> enough <DD> attention <NN1> , <,> until <CS> recent <JJ> years <NNT2> . <.> </s><s>A <AT1> lot <NN1> of <IO> researches <NN2> have <VH0> been <VBN> done <VDN> , <,> and <CC> people <NN> find <VV0> that <CST> it <PPH1> 's <VBZ> very <RG> important <JJ> for <IF> a <AT1> young <JJ> baby <NN1> to <TO> receive <VVI> a <AT1> good <JJ> education <NN1> . <.> </s><s>At <II> the <AT> age <NN1> of <IO> 7 <MC> or <CC> 8 <MC> , <,> children <NN2> will <VM> be <VBI> sent <VVN> to <II> primary <JJ> schools <NN2> . <.> </s><s>It <PPH1> 's <VBZ> " <"> real <JJ> education <NN1> " <"> in <II> most <DAT> of <IO> people <NN> 's <GE> view <NN1> . <.> </s><s>In <II> my <APPGE> opinion <NN1> , <,> learning <VVG> at <II> school <NN1> is <VBZ> be <VBI> educated <VVN> , <,> and <CC> also <RR> a <AT1> kind <NN1> of <IO> education <NN1> . <.> </s><s>In <II> China <NP1> , <,> students <NN2> are <VBR> at <II> school <NN1> , <,> such <II21> as <II22> primary <JJ> school <NN1> , <,> junior <JJ> middle <JJ> school <NN1> , <,> senior <JJ> middle <JJ> school <NN1> and <CC> university <NN1> . <.> </s><s>That <DD1> 's <VBZ> to <TO> say <VVI> it <PPH1> will <VM> take <VVI> a <AT1> person <NN1> nearly <RR> sixteen <MC> years <NNT2> to <TO> study <VVI> at <II> school <NN1> . <.> </s><s>It <PPH1> 's <VBZ> a <AT1> long <JJ> time <NNT1> , <,> however <RR> , <,> it <PPH1> is <VBZ> not <XX> enough <RR> to <TO> learn <VVI> . <.> </s><s>Nowadays <RT> , <,> it <PPH1> 's <VBZ> very <RG> pouplar <JJ> for <IF> adults <NN2> at <II> work <NN1> to <TO> learn <VVI> . <.> </s><s>They <PPHS2> can <VM> learn <VVI> in <II> skilling <VVG> school <NN1> , <,> on <II> the <AT> internet <NN1> and <CC> soon <RR> . <.> </s><s>It <PPH1> proves <VVZ> that <DD1> education <NN1> on <II> adults <NN2> is <VBZ> still <RR> going <VVG> on <RP> . <.> </s><s>As <II21> for <II22> old <JJ> people <NN> , <,> I <PPIS1> do <VD0> n't <XX> know <VVI> much <RR> , <,> but <CCB> I <PPIS1> know <VV0> there <EX> 're <VBR> universities <NN2> for <IF> the <AT> old <JJ> , <,> where <CS> they <PPHS2> can <VM> learn <VVI> writing <NN1> and <CC> drawing <NN1> , <,> music <NN1> and <CC> a <AT1> lot <NN1> of <IO> things <NN2> . <.> </s><s>Many <DA2> old <JJ> people <NN> are <VBR> fond <JJ> of <IO> morning <NNT1> exercise <NN1> , <,> and <CC> it <PPH1> takes <VVZ> them <PPHO2> a <AT1> lot <NN1> of <IO> time <NNT1> to <TO> learn <VVI> . <.> </s><s>Do <VD0> n't <XX> you <PPY> believe <VVI> that <DD1> education <NN1> is <VBZ> a <AT1> life-long <JJ> process <NN1> ? <?> </s> Thank You

More Related