Lecture 5: Annotating Things & List Comprehensions

Lecture 5: Annotating Things& List Comprehensions Methods in Computational Linguistics II Queens College

Linguistic Annotation Text only takes us so far. People are reliable judges of linguistic behavior. We can model with machines, but for “gold-standard” truth, we ask people to make judgments about linguistic qualities.

Example Linguistic Annotations Sentence Boundaries Part of Speech Tags Phonetic Transcription Syntactic parse trees Speaker Identity Semantic Role Speech Act Document Topic Argument structure Word Sense many many many more

We need… Techniques to process these. Every corpus has its own format for linguistic annotation. so…we need to parse annotation formats.

The/DET Dog/NN is/VB fast/JJ ./. <word ortho=“The” pos=“DET”></word> <word ortho=“Dog” pos=“NN”></word> <word ortho=“is” pos=“VB”></word> <word ortho=“fast” pos=“JJ”></word> The dog is fast. 1, 3, DET 5, 7, NN 9,10, VB 12,15, JJ 16, 16, .

Constructing a linguistic corpus • Decisions that need to be made: • Why are you doing this? • What material will be collected? • How will it be collected? • Automatically? • Manually? • Found material vs. laboratory language? • What meta information will be stored? • What manual annotations are required? • How will each annotation be defined? • How many annotators will be used? • How will agreement be assessed? • How will disagreements be resolved? • How will the material be disseminated? • Is this covered by your IRB if the material is the result of a human subject protocol?

Part of Speech Tagging Task: Given a string of words, identify the parts of speech for each word.

Part of Speech tagging Surface level syntax. Primary operation Parsing Word Sense Disambiguation Semantic Role labeling Segmentation Discourse, Topic, Sentence

How is it done? Learn from Data. Annotated Data: Unlabeled Data:

Learn the association from Tag to Word

Limitations Unseen tokens Uncommon interpretations Long term dependencies

Parsing Generate a parse tree.

Parsing Generate a Parse Tree from: The surface form (words) of the text Part of Speech Tokens

Parsing Styles

Parsing styles

Context Free Grammars for Parsing S → VP S →NP VP NP → Det Nom Nom → Noun Nom → Adj Nom VP → Verb Nom Det → “A”, “The” Noun → “I”, “John”, “Address” Verb → “Gave” Adj → “My”, “Blue” Adv → “Quickly”

Limitations The grammar must be built by hand. Can’t handle ungrammatical sentences. Can’t resolve ambiguity.

Probabilistic Parsing • Assign each transition a probability • Find the parse with the greatest “likelihood” • Build a table and count • How many times does each transition happen

Segmentation • Sentence Segmentation • Topic Segmentation • Speaker Segmentation • Phrase Chunking • NP, VP, PP, SubClause, etc.

Split into words sent = “That isn’t the problem, Bob.” sent.split() vs. nltk.word_tokenize(sent)

List Comprehensions Compact way to process every item in a list. [x for x in array]

Methods Using the iterating variable, x, methods can be applied. Their value is stored in the resulting list. [len(x) for x in array]

Conditionals Elements from the original list can be omitted from the resulting list, using conditional statements [x for x in array if len(x) == 3]

Building up These can be combined to build up complicated lists [x.upper() for x in array if len(x) > 3 and x.startswith(‘t’)]

Lists Containing Lists Lists can contain lists [[a, 1], [b, 2], [d, 4]] ...or tuples [(a, 1), (b, 2), (d, 4)] [ [d, d*d] for d in array if d < 4]

Lists within lists are often called 2-d arrays This is another way we store tables. Similar to nested dictionaries. a = [[0,1], [1,0] a[1][1] a[0][0]

Using multiple lists Multiple lists can be processed simultaneously in a list comprehension [x*y for x in array1 for y in array2]

Next Time • Word Similarity • Wordnet • Data structures • 2-d arrays. • Trees • Graphs

Lecture 5: Annotating Things & List Comprehensions