How to remove ambiguity

How to simplify all kinds of ambiguous grammars.

Prerequisites: the concepts of grammar, language and parse tree.

Note: The transformation of the grammar is the first (optional) step for the creation of a parser of every kind: LL or LR or LALR. This will solve the ambiguity for LR and LALR but for the LL parser you’ll still need to remove epsilon production, remove left recursion and remove left factorization.

Teaching tool for all kinds of parser: jsmachines.

In this post only the transformation will be treated but if you’re writing a parser this post has some insight.

As you all know a context free grammar ( CFG ) is ambiguous if there is a string in the language that can be generated by 2 or more, parse trees. A CFG is unambiguous if every string generated has exactly one parse tree.

The parser generators are able to handle only some type of ambiguity, in particular the ambiguity generated by the priority and associativity of the operators. An example of the associativity problem is:

Where for this simple grammar, the string INT + INT + INT generate 2 parse tree.

Even if this kind of ambiguity is handled by parser generator, I’m going to show you how to create an equivalent unambiguous grammar. Equivalent means that the new grammar will generate the same language as the old one. It’s often the case that a string in the new grammar, will have a different parse tree than the one in the old grammar.

The transformation is composed of 3 steps:

  • Find where the ambiguity is.
  • Decide which parse tree is the correct one.
  • Create an equivalent unambiguous grammar.

Example left associativity

  • Exp → Exp + INT — — for enforcing the left associativity.
  • Exp → INT — — for producing the string INT .

Example right associativity

The correct tree is the right one. The rewritten grammar is:

  • Exp → INT ^ Exp — — for enforcing the right associativity.
  • Exp → INT — — for producing the string INT .

Example no associativity

They are both incorrect. Since this is a toy example we could rewrite the grammar as Exp → INT< INT . In a normal case we need the application of the rule only once, so we need to introduce a new non terminal; for example:

  • Exp → Exp’ < Exp’
  • Exp’ → Exp’ + INT | INT

Example priority (precedence)

In this case we have to introduce a new non-terminal. It’s important to remember that the tree is evaluated bottom up, so the priority in the grammar is inverted, in the first rule there is the lowest priority.

  • Exp → Exp + Term | Term
  • Term → Term * INT | INT

Both the operators are left associative.

CFG for arithmetic

Exp → Exp * Exp | Exp / Exp | Exp + Exp | Exp - Exp |Exp ^ Exp | (Exp) | -Exp | INT

The order precedence from the lowest to the highest is : + - then * / then ^ then unary minus and (). Also the operator power has right associativity, the unary minus has no associativity, all the others have left associativity. So the unambiguous grammar is (the names are changed for readability):

  • A → A+ B| A - B | B
  • B → B * C | B / C | C
  • C → D ^ C | -D | D
  • D → ( A ) | INT

CFG for boolean

The normal precedence order is: then then ¬ then () . The operators (and, or) are left associative. So the unambiguous grammar is:

  • A → A ∨ B| B
  • B → B ∧ C | C
  • C → ¬ D | D
  • D → ( A ) | atom

Dangling else problem

  • stmt → if expr then stmt | if expr then stmt else stmt | other

The other represent the various statements of the grammar that are not relevant on this issue.

A string that generates a conflict is: if E₁ then if E₂ then S₁ else S₂ .

Where the Eᵢ represent a boolean expression and the Sᵢ represent a generic statement of the programming language. The two parse tree generated by the string:

In all programming languages, the first parse tree is the correct one. The idea behind it, is “Match each else with the closest unmatched then” . Following this rule we want to prevent the situation of the second tree: the statement appearing between a then and an else must be “matched”, expressed in an other way, the interior statement must not end with an unmatched (open) then. A matched statement is either an if-then-else containing matched statements or it’s any other kind of unconditional statement ( the not considered statement external to the grammar).

So in the new grammar we will have matched statements and unmatched statements. To clarify the situation, it’s possible to list all the allowed productions:

  • if expr then matched_stmt
  • if expr then open_stmt
  • if expr then matched_stmt else matched_stmt
  • if expr then matched_stmt else open_stmt

The productions not allowed are:

  • if expr then open_stmt else matched_stmt
  • if expr then open_stmt else open_stmt

So now we can create the rules:

  • matched_stmt → if expr then matched_stmt else matched_stmt | other
  • open_stmt → if expr then matched_stmt |if expr then open_stmt | if expr then matched_stmt else open_stmt

And the first rule of the grammar is the one that allow us to create both kind of statements:

  • stmt → open_stmt | matched_stmt

It’s possible to simplify the open_stmt rule observing that it’s possible to merge the first two production so the grammar becomes:

  • stmt → open_stmt | matched_stmt
  • matched_stmt → if expr then matched_stmt else matched_stmt | other
  • open_stmt → if expr then stmt | if expr then matched_stmt else open_stmt

That’s all, I hope this post has been useful.

Too lazy to open up a blog