350 likes | 596 Views
An Introduction to Edison. Vivek Srikumar 17 th April 2012. Curator gives us easy access to several layers of annotation over text What can we do with these? . Outline. What is Edison? Installing Edison Using Edison Creating Edison objects Accessing the Curator Adding and using views.
E N D
An Introduction to Edison Vivek Srikumar 17th April 2012
Curator gives us easy access to several layers of annotation over text What can we do with these?
Outline • What is Edison? • Installing Edison • Using Edison • Creating Edison objects • Accessing the Curator • Adding and using views
What is Edison? • A uniform representation of diverse NLP annotations • A library of NLP data structures • A Java client to the Curator
NLP Annotations John Smith bought the car. Parse tree Part-of-speech NNP John NNPSmith VBD bought DT the NN car . . Shallow parse NP John Smith VP bought NPthe car S NP VP NNP NP NNP Semantic roles Predicate buy A0 John Smith A1the car VBD NN DT Named Entities PER John Smith John Smith bought the car And many others….
A uniform representation • Main ideas • All the annotations over text are graphs • Nodes: Labeled spans of text • Spans indexed by tokens in the text • Edges: Relations between the nodes • Edison terminology • TextAnnotation: A container of tokens and views • View: A graph that denotes a specific annotation • Constituent: A labeled span of text (nodes) • Relation: A labeled directed edge between Constituents
A uniform representation TextAnnotation Raw text: John Smith bought the car. Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Views Name: SENTENCE Name: POS Name: PARSE_CHARNIAK Constituents: {…} Constituents: {…} Constituents: {…} Relations: {…} Relations: {…} Relations: {…} and other views….
Getting started with Edison • Download the jar from http://cogcomp.cs.illinois.edu/page/software_view/Edison • Click the download link and follow instructions • Add the edison jar and its dependencies to your class path • Dependencies • Cogcomp core utilities • Apache commons libraries • Thrift (to communicate with the Curator) • Porter stemmer • LBJ Library • Java WordNet interface • Javadoc available under “User Guide”
Edison using Maven • Add the following repository definition to your pom.xml file • Add Edison as a dependency <repositories> <repository> <id>CogcompSoftware</id> <name>CogcompSoftware</name> <url>http://cogcomp.cs.illinois.edu/m2repo/</url> </repository> </repositories> <dependency> <groupId>edu.illinois.cs.cogcomp</groupId> <artifactId>edison</artifactId> <version>0.2.9</version> <type>jar</type> <scope>compile</scope> </dependency>
So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!
A uniform representation TextAnnotation Raw text: John Smith bought the car. Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Views Name: SENTENCE Name: POS Name: PARSE_CHARNIAK Constituents: {…} Constituents: {…} Constituents: {…} Relations: {…} Relations: {…} Relations: {…} and other views….
Three ways to create TextAnnotations • When you don’t know the tokenization • Use this for raw text, if you don’t want to use the Curator • When you know the tokenization • Use this for pre-tokenized text • Using the Curator • Use this for raw text • If your text is pre-tokenized, you can still use the Curator for adding views
Creating TextAnnotations (1) • When to use this approach • If you don’t know the tokenization (i.e. words) • Want to use the LBJ tokenizer and sentence splitter • Note: Every TextAnnotation has a textId and corpusId, these could be used in the future for book-keeping
Creating TextAnnotations (1) String corpus = "2001_ODYSSEY"; String textId = "001"; String text1 = "Good afternoon, gentlemen. I am a HAL-9000 computer."; TextAnnotation ta1 = newTextAnnotation(corpus, textId, text1); System.out.println(ta1.getText()); System.out.println(ta1.getTokenizedText()); // Print the sentences. The `Sentence` class has the same // methods as a `TextAnnotation`. List<Sentence> sentences = ta1.sentences(); System.out.println(sentences.size() + " sentences found."); for (inti = 0; i < sentences.size(); i++) { Sentencesentence = sentences.get(i); System.out.println(sentence); }
Creating TextAnnotations (2) • When to use this approach • When you know the tokenization • That is, when some external source specifies the tokens of the text • After creating it, it can be used as before
Creating TextAnnotations (2) String corpus = "2001_ODYSSEY"; String textId = "002"; List<String> tokenizedSentences = Arrays.asList("Good afternoon , gentlemen .", "Iam a HAL-9000 computer ."); TextAnnotation ta2 = newTextAnnotation(corpus, textId, tokenizedSentences); System.out.println(ta2.getText()); System.out.println(ta2.getTokenizedText()); // Print the sentences. The `Sentence` class of the same // methods as a `TextAnnotation`. List<Sentence> sentences = ta2.sentences(); System.out.println(sentences.size() + " sentences found."); for (int i = 0; i < sentences.size(); i++) { Sentencesentence = sentences.get(i); System.out.println(sentence); }
Connecting to the Curator (1) If you don’t know anything about your text, the curator can tokenize your text for you. String text = "Good afternoon, gentlemen. I am a HAL-9000 " + "computer. I was born in Urbana, Il. in 1992"; String corpus = "2001_ODYSSEY"; String textId = "001"; // We need to specify a host and a port where the curator server is // running. String curatorHost = "my-curator-server.cs.uiuc.edu"; intcuratorPort = 9090; CuratorClient client = newCuratorClient(curatorHost, curatorPort); // Should the curator's cache be forcibly updated? booleanforceUpdate = false; // Get the text annotation object from the curator, which splits the // sentences and tokenizes it. TextAnnotation ta = client.getTextAnnotation(corpus, textId, text, forceUpdate); Create a curator client Create a TextAnnotation
Connecting to the Curator (2) If you know the tokenization and want all the Curator’s annotators to respect this tokenization String corpus = "2001_ODYSSEY"; String textId = "002"; List<String> tokenizedSentences = Arrays.asList("Good afternoon , gentlemen .", "Iam a HAL-9000 computer ."); TextAnnotation ta2 = newTextAnnotation(corpus, textId, tokenizedSentences); // Weneedtospecify a host and a portwherethecurator server is // running. StringcuratorHost = "my-curator-server.cs.uiuc.edu"; intcuratorPort = 9090; CuratorClientclient = newCuratorClient(curatorHost, curatorPort, true); Create your TextAnnotation as before Curator shoud Respect tokenization Note: A Curator Client in this mode cannot create TextAnnotations. Doing so will trigger an exception!
So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!
Views • Views are graphs, Constituents are nodes and Relations are edges • Every TextAnnotation can be seen as a container for views, indexed by their name • View is a Java class that represents any graph over constituents • Specializations of the View class to deal with specific types • TokenLabelView, SpanLabelView, TreeView, PredicateArgumentView, CoreferenceView • You can create your own views or specializations too!
Example: Part-of-speech John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Constituents Part-of-speech NNP John NNPSmith VBD bought DT the NN car . . Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1)thone. 0-1 NNP 1-2 NNP 2-3 VBD 3-4 DT 4-5 NN 5-6 . No Relations! This specialization of the View class is called a TokenLabelView, where each constituent assigns a label to a token and there are no relations. Use for part-of-speech, stem/lemma, etc.
Adding part-of-speech from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the part-of-speech annotation? booleanforceUpdate = false; // Add the part of speech view from the Curator client.addPOSView(ta, forceUpdate); // Get the part-of-speech view from the TextAnnotation. This view will // be filed under the name 'ViewNames.POS'. Also, we know that // this view will be a TokenLabelView. TokenLabelViewposView = (TokenLabelView) ta.getView(ViewNames.POS); // Iterate through the text and get the POS label for each token for (inttokenId = 0; tokenId < ta.size(); tokenId++) { String token = ta.getToken(tokenId); String posLabel = posView.getLabel(tokenId); System.out.println(token + "\t" + posLabel); } Curator call This method is available for TokenLabelVIews
Example: Shallow parse John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Constituents Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1)thone. Shallow parse NP John Smith VP bought NPthe car 0-2 NP 2-3 VP 3-4 NP No Relations! This specialization of the View class is called a SpanLabelView, where each constituent assigns a label to a span of text and there are no relations. Use for named entities, shallow parse, Wikifier, etc.
Adding shallow parse from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the shallow parse annotation? booleanforceUpdate = false; // Add the shallow parse/chunk view from the Curator client.addChunkView(ta, forceUpdate); // Get the shallow parse view from the TextAnnotation. This view will // be filed under the name 'ViewNames.SHALLOW_PARSE'. Also, we know that // this view will be a SpanLabelView. SpanLabelViewchunkView = (SpanLabelView) ta.getView(ViewNames.SHALLOW_PARSE); // Get all constituents whose span is contained in the span (0, 2). List<Constituent> constituents = chunkView.getSpanLabels(0, 2); // Iterate over them and print their labels for(Constituent c: constituents) { String label = c.getLabel(); System.out.println(label); } Curator call Available for SpanLabelView
Other SpanLabel views in the Curator • Shallow parse • ViewNames.SHALLOW_PARSE • Use ‘client.addChunkView(ta, forceUpdate)’ • Named entities • ViewNames.NER • Use ‘client.addNamedEntityView(ta, forceUpdate)’ • Wikifier • ViewNames.WIKIFIER • Use ‘client.addWikifierView(ta, forceUpdate) Note: For these function calls to work, the corresponding annotator should exist in your instance of the Curator. Otherwise, an exception will be triggered
Example: Parse view John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Parse tree Constituents ParentOf S NP VP 0-1 NNP 0-5 S 0-2 NP 3-5 VP NNP NP NNP VBD Rest of the tree not shown. NN DT ParentOf John Smith bought the car ParentOf Relations This specialization of the View class is called a TreeView, where the graph represents a tree. Use for full parse and dependency trees.
Adding Charniak parse from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the parse annotation? booleanforceUpdate = false; // Add the charniak parse view from the Curator client.addCharniakParse(ta, forceUpdate); // Get the Charniak parse view from the TextAnnotation. This view will // be filed under the name 'ViewNames.PARSE_CHARNIAK'. Also, we know // that this view will be a TreeView. TreeViewparseView = (TreeView) ta.getView(ViewNames.PARSE_CHARNIAK); // get all parse nodes List<Constituent> treeNodes = parseView.getConstituents(); // get the tree structure for the first sentence (i.e. sentence #0) Tree<String> parseTree = parseView.getTree(0); // Get path between parse tree nodes (common feature) String parsePath = PathFeatureHelper.getFullParsePathString( treeNodes.get(0), treeNodes.get(1), 400); Curator call Do interesting things
Tree views from the curator • Charniak parser • ViewNames.PARSE_CHARNIAK • client.addCharniakParse(ta, forceUpdate) • Easy-first dependency parser • ViewNames.DEPENDENCY • client.addEasyFirstDependencyView(ta, forceUpdate) • Stanford parser • ViewNames.PARSE_STANFORD • client.addStanfordParse(ta, forceUpdate) • Stanford dependency parser • ViewNames.DEPENDENCY_STANFORD • client.addStanfordDependencyView(ta, forceUpdate)
Other Curator calls • Verb semantic roles • View name: ViewNames.SRL • client.addSRLView(ta, forceUpdate) • Adds a view of type PredicateArgumentView, which is a subclass of the View class • Nominal semantic roles • View name: ViewNames.NOM • client.addNOMView(ta, forceUpdate) • Adds a view of type PredicateArgumentView • Coreference • View name:ViewNames.COREF • client.addCorefView(ta, forceUpdate) • Adds a view of type CoreferenceView, which is a subclass of the View class
So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!
Using views • All views provide access to • Constituents: • getConstituents, getConstituentsCoveringToken, getConstituentsCoveringSpan • Relations: getRelations • Allows us to manipulate several different views • Eg: Get the parse tree nodes that contain the named entity constituent that whose label is “PER”: for (Constituent c : namedEntityView.getConstituents()) { if (c.getLabel().equals("PER")) { List<Constituent> parseConstituents = parseView .getConstituentsCovering(c); // do something with these } }
Using constituents and relations • Each constituent belongs to a view • Constituents provide the following methods: • getLabel(): gets the label of the constituent • getSpan(): gets the span of the constituent • getIncomingRelations(): gets list of Relations that are incident to this constituent in this view • getOutgoingRelations(): gets list of Relations whose source is this constituent in this view • Relations provide the following accessors: • getRelationName(),getSource(),getTarget()
Other useful functionality • Supports • Top-K views • Custom views, for your application • Provides helper functions for common tasks • Look at the functions in classes in the package edu.illinois.cs.cogcomp.edison.features.helpers • Provides interface to WordNet • WordNetManager • Collin’s head-finding rules • Several feature extraction utilities • Look the classes at edu.illinois.cs.cogcomp.edison.features
So far… • What is Edison? • Installing Edison • Creating a TextAnnotation • Adding views from the Curator • Using views • …?? • Profit!
Links • Edison download http://cogcomp.cs.illinois.edu/page/software_view/Edison • Example code http://cogcomp.cs.illinois.edu/software/edison/ • API documentation http://cogcomp.cs.illinois.edu/software/edison/apidocs