170 likes | 283 Views
Grammar Extraction and Refinement from an HPSG Corpus. Kiril Simov BulTreeBank Project (www.BulTreeBank.org) Linguistic Modeling Laboratory, Bulgarian Academy of Sciences kivs@bultreebank.org ESSLLI'2002 Workshop on Machine Learning Approaches in Computational Linguistics
E N D
Grammar Extraction and Refinement from an HPSG Corpus Kiril Simov BulTreeBank Project (www.BulTreeBank.org) Linguistic Modeling Laboratory, Bulgarian Academy of Sciences kivs@bultreebank.org ESSLLI'2002 Workshop on Machine Learning Approaches in Computational Linguistics August 5 - 9, 2002
Plan of the Talk • DOP model • An HPSG Corpus - definition • Formalism for HSPG • Extraction of HPSG Grammar from HPSG Corpus • Refinement of an HPSG grammar • Conclusion
DOP Model [Bod 1998] • Grammar formalism for the target grammar • Procedure for the construction of sentence analyses in the chosen grammar formalism • Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus • A performance model guiding the analysis of new sentences with respect to some desirable conditions
DOP Model (2) • Two additional unspoken assumptions are: • The structures in the corpus are decomposable into the grammar formalism • The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus This assumption refers to the quality of the corpus
Corpus in a Grammar Formalism A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar in this grammatical formalism: S. S C S SGC() and S. S C S'.(S' ((S)) S' C)
HPSG Corpus Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King’s logic - SRL
Feature Graphs (1) S,F,A - SRL finite signature G = <N,V,,T> is a feature graph iff G is a directed, connected and rooted graph such that N is a set of nodes, V : NFN is a partial arc function, is the root node, T : NS is a total species assignment function
Feature Graphs (2) Some notions: Subsumption based on isomorphism Unification - there is no most general unifier Complete feature graphs - all information from signature is presented Paths Subgraphs
Feature Graphs (3) • Feature graphs can be interpreted via translation to SRL clauses • Exclusive matrixes can be represented as a set of feature graphs (exclusive set of graphs) • An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs • A sentence analysis can be represented as a complete feature graph
Feature Graphs (4) • Complete feature graphs are a good representation for an HPSG corpus • Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs) Important property: For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node
Corpus Grammar A grammar such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar C SGC() In feature graph terms: For each complete graph in the corpus, the grammar contains a graph which subsumes it
Grammar Extraction (1) Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF Each extracted fragment has to • contain all features for the root node, and • subsume at least one complete graph in the corpus
Grammar Extraction (2) The set GF is ordered by subsumption relation. The complete graphs from the corpus are at the bottom. Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of C
Grammar Extraction (3) All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity A grammar from this hierarchy can be chosen by specifying additional constraints over it such as: • it is the most general one that doesn’t overgenerate or undergenerate over the corpus, or • it satisfies some external conditions like - the shortest inference over the corpus and etc
The set GF as a Grammar This is the original idea behind DOP Model • GF contains all generalizations over the corpus • GF will overgenerate over the corpus • GF will accept ungrammatical sentences Thus a special inference mechanism is necessary in order to use GF as a grammar
Grammar Refinement In the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement We can choose the most general grammars that refine the original grammar as a new grammar
Conclusions • We define an HPSG corpus as a set of complete graphs • We define an HPSG grammar as a set of graphs • We define a procedure for extraction of corpus grammars from the corpus • We define a refinement of a grammar on the basis of a corpus