1 / 28

Lecture 5: Annotating Things & List Comprehensions

Lecture 5: Annotating Things & List Comprehensions. Methods in Computational Linguistics II Queens College. Linguistic Annotation. Text only takes us so far. People are reliable judges of linguistic behavior.

isanne
Download Presentation

Lecture 5: Annotating Things & List Comprehensions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 5: Annotating Things& List Comprehensions Methods in Computational Linguistics II Queens College

  2. Linguistic Annotation Text only takes us so far. People are reliable judges of linguistic behavior. We can model with machines, but for “gold-standard” truth, we ask people to make judgments about linguistic qualities.

  3. Example Linguistic Annotations Sentence Boundaries Part of Speech Tags Phonetic Transcription Syntactic parse trees Speaker Identity Semantic Role Speech Act Document Topic Argument structure Word Sense many many many more

  4. We need… Techniques to process these. Every corpus has its own format for linguistic annotation. so…we need to parse annotation formats.

  5. The/DET Dog/NN is/VB fast/JJ ./. <word ortho=“The” pos=“DET”></word> <word ortho=“Dog” pos=“NN”></word> <word ortho=“is” pos=“VB”></word> <word ortho=“fast” pos=“JJ”></word> The dog is fast. 1, 3, DET 5, 7, NN 9,10, VB 12,15, JJ 16, 16, .

  6. Constructing a linguistic corpus • Decisions that need to be made: • Why are you doing this? • What material will be collected? • How will it be collected? • Automatically? • Manually? • Found material vs. laboratory language? • What meta information will be stored? • What manual annotations are required? • How will each annotation be defined? • How many annotators will be used? • How will agreement be assessed? • How will disagreements be resolved? • How will the material be disseminated? • Is this covered by your IRB if the material is the result of a human subject protocol?

  7. Part of Speech Tagging Task: Given a string of words, identify the parts of speech for each word.

  8. Part of Speech tagging Surface level syntax. Primary operation Parsing Word Sense Disambiguation Semantic Role labeling Segmentation Discourse, Topic, Sentence

  9. How is it done? Learn from Data. Annotated Data: Unlabeled Data:

  10. Learn the association from Tag to Word

  11. Limitations Unseen tokens Uncommon interpretations Long term dependencies

  12. Parsing Generate a parse tree.

  13. Parsing Generate a Parse Tree from: The surface form (words) of the text Part of Speech Tokens

  14. Parsing Styles

  15. Parsing styles

  16. Context Free Grammars for Parsing S → VP S →NP VP NP → Det Nom Nom → Noun Nom → Adj Nom VP → Verb Nom Det → “A”, “The” Noun → “I”, “John”, “Address” Verb → “Gave” Adj → “My”, “Blue” Adv → “Quickly”

  17. Limitations The grammar must be built by hand. Can’t handle ungrammatical sentences. Can’t resolve ambiguity.

  18. Probabilistic Parsing • Assign each transition a probability • Find the parse with the greatest “likelihood” • Build a table and count • How many times does each transition happen

  19. Segmentation • Sentence Segmentation • Topic Segmentation • Speaker Segmentation • Phrase Chunking • NP, VP, PP, SubClause, etc.

  20. Split into words sent = “That isn’t the problem, Bob.” sent.split() vs. nltk.word_tokenize(sent)

  21. List Comprehensions Compact way to process every item in a list. [x for x in array]

  22. Methods Using the iterating variable, x, methods can be applied. Their value is stored in the resulting list. [len(x) for x in array]

  23. Conditionals Elements from the original list can be omitted from the resulting list, using conditional statements [x for x in array if len(x) == 3]

  24. Building up These can be combined to build up complicated lists [x.upper() for x in array if len(x) > 3 and x.startswith(‘t’)]

  25. Lists Containing Lists Lists can contain lists [[a, 1], [b, 2], [d, 4]] ...or tuples [(a, 1), (b, 2), (d, 4)] [ [d, d*d] for d in array if d < 4]

  26. Lists within lists are often called 2-d arrays This is another way we store tables. Similar to nested dictionaries. a = [[0,1], [1,0] a[1][1] a[0][0]

  27. Using multiple lists Multiple lists can be processed simultaneously in a list comprehension [x*y for x in array1 for y in array2]

  28. Next Time • Word Similarity • Wordnet • Data structures • 2-d arrays. • Trees • Graphs

More Related