Context-Free Grammars

Quirl supports context-free grammars to define languages. You can write yours and let Quirl check whether texts belong to that language. The term language is quite broad here: it could refer to a programming language, chess notation or the language of valid email addresses among other things.

Formally, a context-free grammar consists of four components:

You only need to write production rules though. The other ingredients are identified implicitly in Quirl.

Production Rules

Let us look at a first example for a production rule:

BIT ~ "0" | "1" ;

What do we see here?

So the above rule says that BIT can be replaced by either "0" or "1". Further replacements are not possible, since these are terminal symbols.

Note on Names

A nonterminal symbol, a grammar and a language are technically three different things. However, a grammar is implicitly given by all terminal and nonterminal symbols and their production rules which can be reached from a nonterminal start symbol through rule-based replacements.

All texts that are recognized by (can be generated from) a grammar make up a language then. For practical purposes, there is no need to invent separate names for a grammar, its start symbol or the language generated by it. So we use the nonterminal symbol BIT to refer to the implied language as well.

A Bigger Language

Our BIT language is rather limited. It has only two valid texts, "0" and "1". Let us create a bigger language by writing one additional production rule:

BITSTRING ~ "" | BIT BITSTRING ;

Now the BITSTRING language contains infinitely many texts. All information stored on computers worldwide is part of this language. Wow! But how?

The following graph shows from top to bottom how the text "011011" can be derived from BITSTRING by replacing the nonterminal symbols BITSTRING and BIT by their respective alternatives.

BITSTRING
 ┌──┴────┐
BIT  BITSTRING
 │    ┌──┴────┐
 │   BIT  BITSTRING
 │    │    ┌──┴────┐
 │    │   BIT  BITSTRING
 │    │    │    ┌──┴────┐
 │    │    │   BIT  BITSTRING
 │    │    │    │    ┌──┴────┐
 │    │    │    │   BIT  BITSTRING
 │    │    │    │    │    ┌──┴────┐
 │    │    │    │    │   BIT  BITSTRING
 │    │    │    │    │    │       │
"0"  "1"  "1"  "0"  "1"  "1"     ""

Adding all the terminal symbols in the last line together yields the desired text "011011". You can use the in core function to verify that "011011" is a valid BITSTRING, if you have also entered the two production rules from above:

in("011011" BITSTRING) # returns T for true

It should be obvious that any other sequence of 0s and 1s can be generated similarly.

Codepoint Ranges

Typing both possible alternatives for a binary digit was easy. What if we need a production rule for a Unicode character though? Far more than one hundred thousand different characters have been defined already. More than a million are possible. It would be impractical to list them all.

Quirl supports Unicode code point ranges. U+0000..U+10FFFF covers all possible Unicode characters. Allowing all characters but one or a few exceptions can be done with multiple ranges. Here is a grammar that recognizes all characters except @ whose code point is U+0040:

NOT_AT ~ U+0000..U+0039 | U+0041..U+10FFFF;

The last example shows a production rule for any letter from standard German orthography spread over multiple lines:

GERMAN_LETTER
~ U+0041..U+005A  # uppercase letters A to Z
| U+0061..U+007A  # lowercase letters a to z
| "Ä" | "Ö" | "Ü" # uppercase umlauts
| "ä" | "ö" | "ü" # lowercase umlauts
| "ß" | U+1E9E    # lowercase and uppercase sharp s
;

It is possible to refer to a terminal character by a single Unicode code point like U+1E9E, which is equivalent to writing "ẞ" in Quirl. You can find code points and code point ranges for various scripts and other blocks at https://www.unicode.org/charts/.