Context-Free Grammars
Quirl supports context-free grammars to define languages. You can write yours and let Quirl check whether texts belong to that language. The term language is quite broad here: it could refer to a programming language, chess notation or the language of valid email addresses among other things.
Formally, a context-free grammar consists of four components:
- a non-empty set of nonterminal symbols, also called syntactic variables
- a set of terminal symbols, collectively called vocabulary or alphabet
- a set of production rules, sometimes simply called productions, that determine how a nonterminal symbol can be replaced by sequences of nonterminal and/or terminal symbols
- a designated start symbol from the nonterminal symbols
You only need to write production rules though. The other ingredients are identified implicitly in Quirl.
Production Rules
Let us look at a first example for a production rule:
BIT ~ "0" | "1" ;
What do we see here?
- The tilde
~divides the production rule into a left side and a right side. You may find an arrow, colon, equals sign or something else in literature and other software instead. Don't let that irritate you. - The left side of a production consists of a single nonterminal symbol, in this case
BIT. That is just a name without special meaning in Quirl. It receives its meaning from the production rule. - The right side shows all possible alternatives for replacing the nonterminal symbol that was mentioned on the left. Alternatives are separated by a vertical line
|. - Terminal symbols are text in Quirl and hence wrapped in quotation marks. A terminal symbol does not have to be a single character:
"bow"and"wow"would be fine, too. - The semicolon
;is not part of the production rule. We could do without it, if that rule was all there is in our program. It is necessary to separate the production rule from further statements though.
So the above rule says that BIT can be replaced by either "0" or "1". Further replacements are not possible, since these are terminal symbols.
Note on Names
A nonterminal symbol, a grammar and a language are technically three different things. However, a grammar is implicitly given by all terminal and nonterminal symbols and their production rules which can be reached from a nonterminal start symbol through rule-based replacements.
All texts that are recognized by (can be generated from) a grammar make up a language then. For practical purposes, there is no need to invent separate names for a grammar, its start symbol or the language generated by it. So we use the nonterminal symbol BIT to refer to the implied language as well.
A Bigger Language
Our BIT language is rather limited. It has only two valid texts, "0" and "1". Let us create a bigger language by writing one additional production rule:
BITSTRING ~ "" | BIT BITSTRING ;
Now the BITSTRING language contains infinitely many texts. All information stored on computers worldwide is part of this language. Wow! But how?
- The left side of the production introduces the nonterminal symbol
BITSTRING. It did not mean anything beforehand; it only gains its meaning from our production rules. - The right side says that
BITSTRINGcan be either replaced by the terminal empty text""or by the sequence of the nonterminal symbolsBITandBITSTRING. - The empty text
""contributes nothing to a text. We could leave it away. The empty alternative between~and|would still permit to replaceBITSTRINGby nothing then.""just makes this replace-by-nothing option visually explicit. - The second alternative for
BITSTRINGcontainsBITSTRINGitself, making the production recursive. This allows potentially infinitely many replacements and the generation of arbitrarily long texts.
The following graph shows from top to bottom how the text "011011" can be derived from BITSTRING by replacing the nonterminal symbols BITSTRING and BIT by their respective alternatives.
BITSTRING ┌──┴────┐ BIT BITSTRING │ ┌──┴────┐ │ BIT BITSTRING │ │ ┌──┴────┐ │ │ BIT BITSTRING │ │ │ ┌──┴────┐ │ │ │ BIT BITSTRING │ │ │ │ ┌──┴────┐ │ │ │ │ BIT BITSTRING │ │ │ │ │ ┌──┴────┐ │ │ │ │ │ BIT BITSTRING │ │ │ │ │ │ │ "0" "1" "1" "0" "1" "1" ""
Adding all the terminal symbols in the last line together yields the desired text "011011". You can use the in core function to verify that "011011" is a valid BITSTRING, if you have also entered the two production rules from above:
in("011011" BITSTRING) # returns T for true
It should be obvious that any other sequence of 0s and 1s can be generated similarly.
Codepoint Ranges
Typing both possible alternatives for a binary digit was easy. What if we need a production rule for a Unicode character though? Far more than one hundred thousand different characters have been defined already. More than a million are possible. It would be impractical to list them all.
Quirl supports Unicode code point ranges. U+0000..U+10FFFF covers all possible Unicode characters. Allowing all characters but one or a few exceptions can be done with multiple ranges. Here is a grammar that recognizes all characters except @ whose code point is U+0040:
NOT_AT ~ U+0000..U+0039 | U+0041..U+10FFFF;
The last example shows a production rule for any letter from standard German orthography spread over multiple lines:
GERMAN_LETTER
~ U+0041..U+005A # uppercase letters A to Z
| U+0061..U+007A # lowercase letters a to z
| "Ä" | "Ö" | "Ü" # uppercase umlauts
| "ä" | "ö" | "ü" # lowercase umlauts
| "ß" | U+1E9E # lowercase and uppercase sharp s
;
It is possible to refer to a terminal character by a single Unicode code point like U+1E9E, which is equivalent to writing "ẞ" in Quirl. You can find code points and code point ranges for various scripts and other blocks at https://www.unicode.org/charts/.