UNIVER SITY OF MARIBOR

Inferring Context-Free Grammars for Domain-Specific LanguagesMatej Črepinšek,Marjan MernikUniversity of Maribor, SloveniaBarrett R. Bryant, Faizan Javed, Alan SpragueThe University of Alabama at Birmingham, USA FACULTY OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE UNIVERSITY OF MARIBOR

Outline of the Presentation • Motivation • Related Work • Inferring CFG for DSLs • Results • Conclusion

Motivation • Machine learning of grammars finds many applications in • syntactic pattern recognition, • computational biology, • computational linguistic, etc. • Can be grammatical inference useful also in software engineering?

Motivation • Software engineers would like to recover grammar from legacy systems in order to automatically generate various software analysis and modification tools. • Currently, this can not be done for real GPL (e.g., Cobol) using grammatical inference. • Grammar can be semi-automatically recovered from compilers and language manuals [R. Laemmel, C. Verhoef. Semi-automatic Grammar Recovery. SP&E, Vol. 31, No. 15, 2001].

Motivation • What about grammar inference for DSLs (e.g., FDL, VHDL)? • car: all( carBody, Transmission, Engine) Transmission: one-of( automatic, manual ) Engine: more-of( electric, gasoline ) • entity HALFADDER is port( A, B: in bit; SUM, CARRY: out bit);end HALFADDER; • Currently, experiments were performed on theoretical sample languages only, such as L={ww | w  {a,b}+}, L={w=wR | w  {a,b}+}

Motivation • Grammars are found in many applications outside language definition and implementation. • Grammar-based systems(GBSs) [M. Mernik, M. Črepinšek, T. Kosar, D. Rebernak, V. Žumer. Grammar-based Systems: Definition and Examples. Informatica, 28(3):245-254, 2004] • In this cases, the grammar needs to be extracted solely from artifacts represented as sentences/programs written in some unknown language.

Motivation Metamodel Model – an instance of Metamodel

Motivation VideoStore ::= MOVIES CUSTOMERS MOVIES ::= MOVIES MOVIE | MOVIE MOVIE ::= title type CUSTOMERS ::= CUSTOMERS CUSTOMER |  CUSTOMER ::= name days RENTALS RENTALS ::= RENTALS RENTAL | RENTAL RENTAL ::= MOVIE • TheRingregAndy3TheRingreg • TheRingregShrek2childAnn1Shrek2child

1..* NT5 title type NT11 1..* 1..* NT6 name days Motivation • TheRingregAndy3TheRingreg • TheRingregShrek2childAnn1Shrek2child NT15 ::= NT11 NT7 NT15 |  NT11 ::= NT10 NT6 NT10 ::= NT5 NT10 |  NT7 ::= NT5 NT7 |  NT6 ::= name days NT5 ::= title type

Related Work • Gold Theorem - it is impossible to identify any of the four classes of languages in the Chomsky hierarchy using only positive samples. • Positive and negative samples are needed. • So far, grammar inference has been mainly successful in inferring regular languages.

Related Work (Regular Grammars) A number of algorithms (e.g., RPNI) first construct the maximal canonical automaton(MCA(S+)) or prefix tree acceptor (PTA(+))from positive samples, and generalize the automaton by using a statemerging process.

Related Work (Regular Grammars) The following equation enumerates the search space:

Related Work (CF Grammars) • Learning context-free grammars G=(V, T, P, S) is more difficult than learning regular grammars. • Using representative positive samples (that is, positive samples which exercise every production rule in the grammar) along with negative samples did not result in the same level of success as with regular grammar inference.

        + + num num num + num num Related Work (CF Grammars) • Hence, some researchers resorted to using additional knowledge to assist in the induction process (e.g., skeleton derivation trees - unlabelled derivation trees).

Inferring CFG • What is the search space in the case of CFG inference? • If we limit ourselves to binary trees (CNF), then all possible unlabelled derivations trees is given by n-th Catalan number:

Inferring CFG • For example, there are 14 different full binary trees when l=5 …

Inferring CFG • For full binary trees to be valid derivation trees, the interior nodes need to be labeled with non-terminals.

Inferring CFG Search space of context-free grammar inference

Inferring CFG • For effective use of an evolutionary algorithm we have to choose a suitable representation of the problem, suitable parameters and genetic operators, and the evaluation function to determine the fitness of chromosomes.

E  T E T  operator E T  E  int T T  operator E T  Crossover point E  int T E  T E  T  Mutation point E  int T T  E E T  E  int T T  operator E T  E  T E E  T E  T  Inferring CFG

Option point E  int T T  operator F T  F  E F  E  int T T  operator E T  Inferring CFG To enhance the search, the following heuristic operators have been proposed: • option operator, • iteration* operator, and • iteration+ operator.

fitness cases (positive and negative samples) Population of grammars run parser on each fitness case sucessfulness of parsing Test grammars Selection fitness value generated parser for each grammar in the population Crossover and mutation LISA compiler generator parser generation Inferring CFG

Inferring CFG For the given grammar[i] its fitness fj(grammar[i]) on the j-fitness case is defined as: fj(grammar[i]) = length(successfully parsed programj)/length(programj)*2 Finally, the total fitness f(grammar[i]) is defined as: f(grammar[i])=(Nk=1 fk(grammar[i]))/N

Inferring CFG If a grammar correctly recognized all positive samples than it is tested also on negative samples. Its fitness value is defined as: f(grammar[i]) = 1.0 -(m/M*2) where m=number of fully parsed negative samples M=number of all negative samples

NT8 NT7 NT5 NT6 NT4 NT2 NT1 NT3 NT1 #id := #int + #int Inferring CFG • Initial population should not be completly randomly generated. NT8 -> NT7 NT1 NT7 -> NT5 NT6 NT6 -> NT1 NT3 NT5 -> NT4 NT2 NT4 -> #id NT3 -> + NT2 -> := NT1 -> #int

Inferring CFG • Identify sub-languages and construct derivation trees for sub-programs first. But this is as hard as the original problem. • We can use an approximation: frequent sequences. • A string of symbols is called a frequent sequence if it appears at least  times, where  is some preset threshold.

Inferring CFG • GIE-BF tool

Results • Using presented approach we were able to infer grammars for small DSLs (Table 2 in the paper). • An example of positive/negative samples and control parameters (Table 3 in the paper). • Comparison of inferred and original grammars (Table 4 and 5 in the paper).

Conclusion • An ongoing research work on context-free grammarinference was presented. • So far, we have been able to infer grammars for DSLs which are bigger in size and more pragmatic than in other research efforts. • We are convinced that this approach, when enhanced with other data mining techniques and heuristics, is scalable and feasible to infer grammars of more realistically sized languages.

Thank you! http://www.cis.uab.edu/softcom/GenParse

UNIVER SITY OF MARIBOR