# How to remove ambiguity

How to simplify all kinds of ambiguous grammars.

Prerequisites: the concepts of grammar, language and parse tree.

**Note: **The transformation of the grammar is the first (optional) step for the creation of a parser of every kind: LL or LR or LALR. This will solve the ambiguity for LR and LALR, but for the LL parser, you’ll still need to remove epsilon production, remove left recursion and remove left factorization.

Teaching tool for all kinds of parsers: jsmachines.

In this post, only the transformation will be treated, but if you’re writing a parser, this article should provide some insight.

## Table of content:

As you all know a context free grammar ( CFG ) is ambiguous if there is a string in the language that can be generated by two or more, parse trees. A CFG is unambiguous if every string generated has exactly one parse tree.

The parser generators can handle *only some* type of ambiguity, in particular the ambiguity generated by the priority and associativity of the operators. An example of the associativity problem is:

For this simple grammar, the string INT + INT + INT generates two parse trees.

Even if this kind of ambiguity is handled by a parser generator, I’m going to show you how to create an equivalent unambiguous grammar. Equivalent means that the new grammar will generate the same language as the old one. **It’s often the case that a string in the new grammar will have a different parse tree than the one in the old grammar.**

The transformation is composed of 3 steps:

- Find where the ambiguity is.
- Decide which parse tree is the correct one.
- Create an equivalent unambiguous grammar.

## Example left associativity

The grammar in the picture above is *Exp → Exp + Exp | INT . *We find where the ambiguity is by looking at the parse tree, and based on the left associativity rule, the right tree is the left one. This is because we want to sum the expression from left to right, as 1+2+3 = (1+2+3), and since the evaluation of the result from the tree is bottom-up, the left tree represents the calculation. We need to allow the first *Exp *in the production and block the second *Exp *with the introduction of a terminal. The rewritten grammar is:

*Exp → Exp + INT — —*for enforcing the left associativity.*Exp → INT — —*for producing the string*INT .*

## Example right associativity

Another example from algebra is the right associativity of the power operator: 1^2^3 = 1^(2^3) . The basic grammar is: *Exp → Exp ^ Exp | INT .*

The correct tree is the right one. The rewritten grammar is:

*Exp → INT ^ Exp — —*for enforcing the right associativity.*Exp → INT — —*for producing the string*INT .*

## Example no associativity

If we want to write a simple inequality for the *less than* the operator, there is no associativity since it’s allowed only a single operator. So a string like *INT < INT < INT* is not permitted. Below is the parse tree for the string in the grammar *Exp → Exp < Exp | INT *.

They are both incorrect. Since this is a toy example, we could rewrite the grammar as *Exp → INT< INT *. In a typical case, we need the application of the rule only once, so we need to introduce a new non-terminal; for example:

*Exp → Exp’ < Exp’**Exp’ → Exp’ + INT | INT*

## Example priority (precedence)

This is the case where some symbols have precedence over others, such as multiplication above sum. From the simple grammar: *Exp → Exp * Exp | Exp + Exp | INT *, it’s possible to find a string that produces two trees as: *INT + INT * INT *. But we all know that the product has precedence, so the correct tree is the right one.

In this case, we have to introduce a new non-terminal. It’s important to remember that the tree is evaluated bottom-up, so the priority in the grammar is inverted; in the first rule, there is the lowest priority.

*Exp → Exp + Term | Term**Term → Term * INT | INT*

Both the operators are left associative.

## CFG for arithmetic

The grammar for writing simple expressions is a classic in textbooks since it shows a mix of the cases above.

*Exp → Exp * Exp | Exp / Exp | Exp + Exp | Exp - Exp |Exp ^ Exp | (Exp) | -Exp | INT*

The order precedence from the lowest to the highest is:* + -* then ** / *then* *^ then unary minus and *()*. Also, the operator power has right associativity, while the unary minus has no associativity; all the others have left associativity. So the unambiguous grammar is (the names are changed for readability):

*A → A+ B| A - B | B**B → B * C | B / C | C**C → D ^ C | -D | D**D → ( A ) | INT*

## CFG for boolean

*F → F ∧ F | F ∨ F | ¬ F | ( F ) | atom*

The standard precedence order is: *∨ *then *∧ *then* ¬ *then* () . *The operators (and, or) are left associative. So the unambiguous grammar is:

*A → A ∨ B| B**B → B ∧ C | C*- C
*→ ¬ D | D* - D
*→ ( A ) | atom*

## Dangling else problem

This is the problem of determining, in a series of **if **terminating with an **else**, to which **if** the **else** is referring. The ambiguous grammar is:

*stmt → if expr then stmt | if expr then stmt else stmt | other*

The *other *represents the various statements of the grammar that are not relevant to this issue.

A string that generates a conflict is: *if E₁ then if E₂ then S₁ else S₂ .*

Where the *Eᵢ* represents a boolean expression and the *Sᵢ *represents a generic statement of the programming language. The two parse trees generated by the string:

In all programming languages, the first parse tree is the correct one. The idea behind it is “Match each else with the closest unmatched then”. Following this rule, we want to prevent the situation of the second tree: the statement appearing between a **then **and an **else **must be “matched”, expressed in another way, the interior statement must not end with an unmatched (open) **then**. A matched statement is either an **if-then-else** containing matched statements, or it’s any other kind of unconditional statement ( the not considered statement external to the grammar).

So in the new grammar, we will have matched and unmatched statements. To clarify the situation, it’s possible to list all the allowed productions:

*if expr then matched_stmt**if expr then open_stmt**if expr then matched_stmt else matched_stmt**if expr then matched_stmt else open_stmt*

The productions **not** allowed are:

*if expr then open_stmt else matched_stmt**if expr then open_stmt else open_stmt*

So now we can create the rules:

*matched_stmt → if expr then matched_stmt else matched_stmt | other**open_stmt → if expr then matched_stmt |if expr then open_stmt | if expr then matched_stmt else open_stmt*

And the first rule of the grammar is the one that allows us to create both kinds of statements:

*stmt → open_stmt | matched_stmt*

It’s possible to simplify the *open_stmt *rule by observing that it’s possible to merge the first two productions, so the grammar becomes:

*stmt → open_stmt | matched_stmt**matched_stmt → if expr then matched_stmt else matched_stmt | other**open_stmt → if expr then stmt | if expr then matched_stmt else open_stmt*

That’s all; I hope this post has been helpful.

by the author.