210 likes | 234 Views
An Introduction to Using Semgrex. Chloé Kiddon. What is Semgrex?. bought. dobj. nsubj. shirt. Bob. amod. det. red. a. A java utility (in javanlp) for identifying patterns in Stanford JavaNLP SemanticGraph structure
E N D
An Introduction to Using Semgrex Chloé Kiddon
What is Semgrex? bought dobj nsubj shirt Bob amod det red a • A java utility (in javanlp) for identifying patterns in Stanford JavaNLP SemanticGraph structure • Much like Tregex, which does this for tree structures (Levy, Andrew 2006) and is based on tgrep-2 style syntax and functionality. (These slides adapted from the structure of theirs) • Applied the same way you use regular expressions to find patterns in strings Ex. {tag:/VB.*/} >dobj ({} >amod {lemma:red})
Semgrex Overview SemgrexPatterns are composed of nodes, representing IndexedWords, and relations between them, representing edges in a SemanticGraph SemgrexMatchers can be used on singular SemanticGraphs OR on two SemanticGraphs and an Alignment between them Ex. an RTE problem has the hypothesis graph, the text graph, and the alignment from the hypothesis graph’s IndexedFeatureLabels to the text graph’s IndexedFeatureLabels SemgrexPatterns return matches for IndexedFeatureLabels in a SemanticGraph
Syntax - Nodes Nodes are represented as {attr1:value1;attr2:value2;…} Attributes are regular strings; values can be strings or regular expressions marked by “/”s {lemma:run;pos:/VB.*/} => any verb form of the word “run” {} is any node in the graph {$} is any root in the graph {#} is the empty word (IndexedFeatureLabel.NO_WORD) Comes up when working with alignments Descriptions can be negated with ! !{lemma:boy} => any word that isn’t “boy”
Syntax - Nodes Nodes are represented as {attr1:value1;attr2:value2;…} {$;pos:/VB.*/} => any root verb Regular expressions can be used to produce negations {lemma:/^{?!boy}.*$/} => any word that isn’t “boy”
Grouping Nodes Perhaps you want a node that is either word with an ner TIME tag, or the lemma “when”. The node {ner:TIME;lemma:when} does not accomplish this OR operation Can use brackets and | to specify these groupings [ {lemma:locate} | {ner:LOCATION} ] A node that is either a word with a lemma “locate” or a word with LOCATION ner Can also be negated by putting a ! In front
Grouping Nodes To get the combination of a positive and a negative trait, you can use () and the == operator ( {word:Paris} == !{ner:LOCATION} ) A node that has the text Paris without being a Location
Syntax - Relations • Relationships between nodes can be specified • Relations in Semgrex have two parts: the relation symbol and the relation type: i.e. <nsubj • A <reln B : A is the dependent of a reln relation with B • A >reln B : A is the governor of a reln relation with B • A <<reln B : There is some node in a dep->gov chain from A that is the dependent of a reln relation with B • A >>reln B : There is some node in a gov>dep chain from A that is the governor of a reln relation with B • A @ B : A is aligned to B through an Alignment object • Relation types can be regular strings or regular expressions encased by “/”
Building complex expressions Relations can be strung together for “and” All relations are relative to first node in string {} >nsubj {} >dobj {} “A node that is the governor of both an nsubj relation and a dobj relation” & symbol is optional: {} >nsubj {} & >dobj {} Nodes can be grouped w/ parentheses {pos:NN} @ ({} <nsubj {}) “A noun that is aligned to a node that is the dependent of an nsubj relation ” Not the same as {pos:NN} @ {} <nsubj {}
Other Operators on Relations Operators can be combined via “or” with | Ex: {} <agent {} | <nsubj {} “A node that is either an agent or a nsubj in the graph” Like with nodes, & takes precedence over | Ex: {} <agent {} | <nsubj {} & >amod {lemma:red} “An agent node OR a subject modified by the word ‘red’” Equivalent operators are left-associative Any relation can be negated with “!” prefix Ex: {tag:/VB.*/} !@ {tag:/VB.*/} “An verb that is not aligned to another verb”
Other Operators on Relations For times when the pattern will be being matched on a pair of graphs and their alignment, the default search point is the graph that where the alignments are from To override this, place a ‘@’ at the beginning of the pattern Ex: for a hypGraph, txtGraph and alignment hyp->txt {ner:LOCATION} @ {} Represents all LOCATION nodes in the hypGraph aligned to nodes in the txtGraph @ {ner:LOCATION} @ {} Represents all LOCATION nodes in the txtGraph that are aligned to nodes in the hypGraph
Grouping relations To specify operation order, use [ and ] Ex: {tag:nn} [ <prep_in {} | <prep_on {} ] @ {#} “A noun that is the dependent of either a prep_in or prep_on relation and is aligned to NO_WORD” Grouped relations can be negated Just put ! before the [
Named Relations Suppose we want to find two nodes connected by any relation which have a pair of nodes aligned to them with the same relation Name relations with = The first showing of a named relation in a pattern is the one that is stored as the relation ({} >/.*subj|agent/=reln {}) @ ({} >=reln {}) We can retrieve the string form of the relation found in the graph later by using that name
Named Nodes We can name nodes as well as relations Name nodes with = and if the node matches, we can retrieve node by that name Ex: {} <nsubj {}=verb Verb with subject found by this pattern is stored by the name “verb” The first showing of a named node in the pattern is the one stored under that name. All others must be equal to that first one Ex. ({} >nsubj {}=subject} @ ({} >nsubj ({} @ {}=subject)) Finds a node that is both the governor of an nsubj relation to a node called “subject” and aligned to a node that is the governor of an nsubj relation to a node aligned to the node labeled as “subject”
Optional Relations to Nodes Sometimes we want to try to match a sub-expression to retrieve named nodes if they exist, but still match if sub-expression fails. Use optional relation prefix ‘?’ Ex: {} >/nsubj|agent/ {}=subject ?>/.*obj/ {}=object Matches nodes that are governors of nsubj or agent relations If the node also is the governor of some sort of object relation, then, we can retrieve the object using the key “object” If there is no object, the expression will still match Cannot be combined with negation Can be used in front of bracketed relations: ?[….]
Use of Semgrex classes Semgrex usage is like java.util.regex Two ways of calling the matcher: for a single SemanticGraph or for two SemanticGraphs and an Alignment between them String s = “({} >nsubj {}=subject} @ ({} >nsubj ({} @ {}=subject))” SemgrexPattern p = SemgrexPattern.compile(s); SemgrexMatcher m = p.matcher(graph); SemgrexMatcher m = p.matcher(hypGraph, alignment, txtGraph); while (m.find()) { System.out.println(m.getMatch().word()); }
Use of Semgrex classes • Named nodes are retrieved w/ getNode() • Named relations are retrieved w/ getRelnString() IndexedFeatureLabel subj = m.getNode(“subject”); String subjReln = m.getRelnString(“subjReln”);
A Real Code Example - Before private void checkCopula(Problem problem, SemanticGraph hypGraph, SemanticGraph txtGraph) { IndexedFeatureLabel root = hypGraph.getFirstRoot(); IndexedFeatureLabel subj = hypGraph.getChildWithReln(root, "nsubj"); if (subj != null) { IndexedFeatureLabel alignedRoot = problem.getTxtWord(root); if (alignedRoot != IndexedFeatureLabel.NO_WORD){ IndexedFeatureLabel appos = txtGraph.getChildWithReln(alignedRoot, "appos"); List<IndexedFeatureLabel> appositionList; try { appositionList = txtGraph.getChildrenWithReln(problem.getTxtWord(subj), "nn"); } catch (IllegalArgumentException e) { appositionList = new ArrayList<IndexedFeatureLabel>(); } if(appos != null) { if(problem.getTxtWord(subj).equals(appos)) { problem.addFeature(this, Feature.APPOSITION_MATCH, "apposition in text between " + root.word() + " and " + subj.word()); } else { problem.addFeature(this, Feature.APPOSITION_MISMATCH, "no apposition in text between " + root.word() + " and " + subj.word()); } } else if (!appositionList.isEmpty()) { boolean appositionPositiveFiring = false; for (IndexedFeatureLabel apposition : appositionList) { if (alignedRoot.equals(appos)) { problem.addFeature(this, Feature.APPOSITION_MATCH, "apposition in text between " + root.word() + " and " + subj.word()); appositionPositiveFiring = true; break; } } if (!appositionPositiveFiring) { problem.addFeature(this, Feature.APPOSITION_MISMATCH, "no apposition in text between " + root.word() + " and " + subj.word()); } } } } } }
A Real Code Example - After private void checkCopula(Problem problem, SemanticGraph hypGraph, SemanticGraph txtGraph) { IndexedFeatureLabel root = hypGraph.getFirstRoot(); if (checkAttributiveStructure(hypGraph) && !checkAttributiveStructure(txtGraph)) { if(VERBOSE) System.err.println("in check copula"); SemgrexPattern copulaPat = SemgrexPattern.compile("({}=subj <nsubj ({$}=root @ {}=alignedRoot)) @ ({} [ [>nn {}=alignedRoot] | [<appos {}=alignedRoot]])"); SemgrexMatcher copulaMatcher = copulaPat.matcher(hypGraph, problem.getAlignment(), txtGraph); if (copulaMatcher.find()) { problem.addFeature(this, Feature.APPOSITION_MATCH, "apposition in text between " + copulaMatcher.getNode("root").word() + " and " + copulaMatcher.getNode("subj").word()); } else { problem.addFeature(this, Feature.APPOSITION_MISMATCH, "no apposition in text between " + copulaMatcher.getNode("root").word() + " and " + copulaMatcher.getNode("subj").word()); } }
For More Help… More information and links to other sources of documentation are available at nlp.stanford.edu/software/tregex.shtml If you find a bug (i.e. a pattern that should work but doesn’t) or need more help, email java-nlp-user@lists.stanford.edu