Pattern Matching vs Pattern Specification

Xah Lee, 2008-05-01

[this page is a personal notes, on a realization that pattern matching (regx and Mathematica), is not suitable for specifying grammar.]

In computing, there's pattern matching↗. e.g. those in Mathematica, for pattern matching symbolic expressions (See PatternMatching↗), and those in perl, for pattern matching strings. These pattern matching domain-specific languages are designed to match patterns, but i have realized, that they are rather unfit, when used as a mean to specify forms. In short, what i came to understand is that pattern matching facilities (may it be regex for strings or Mathematica's for symbolic expressions), although powerful, but is incapable of specifying formal grammar↗, even very simple ones. (this seems obvious now, when stated as above. But it was a realization.)

Often, when writing doc or in verbal communication, on computer programing or doing math with Mathematica, one is tempted to use a pattern to communicate a particular form a expression may have. For example, in computing, one would like to say that email address has the form xyz, where xyz is a perl regex. (e.g. “[A-z0-9]+@[A-z]+\.com”) Similarly, one might use regex to communicate the form of a url, domain name, file path, etc. In math (especially in Hilbert's Formal perspective or in computer algebra systems), i often want to say that certain function has the form xyz, where xyz is a Mathematica pattern that indicates the function's parameter (and arity), output, and their types. (typically, a computable version of traditional notation like “f:C^2→R^3”). However, in practice, using regex or Mathematica's patterns for the purpose of specifying a form, is often unsatisfactory. Typically, using patterns only gives a general idea, but isn't a correct description. For the purpose of giving a general idea, verbal English description is almost as good, almost.

For example, in the official doc for E-mail address↗ (RFC 2822), it does not use a regex to specify email address's syntax. Instead, it uses a variant of BNF↗ mixed with many pages of explanations, for human reading.

(There are published regex expressions that matches mail address's syntax as specified in RFC. The regex is one or several hundred lines long, as a perl program. The existence of such a regex partly a joke)

It is quite desirable to have a general grammar language, designed in a human-readible way, and concise. With such a language, we could use it to verify if a text is a valid form. We could use it for human communication. Conceivably, it could replace pattern matching.

This brings the question: to what degree, pattern matching language and grammar specification language are orthogonal, in both theoretical and practical considerations.

one grammar spec lang is BNF. However, as far as i've seen, it is a lang for human reading, not for machine to process. (in the same way Math notations is a language for human consumption) As a human-to-human lang, it is imprecise and filled with errors. Is there a general purpose, widely used, BNF-like language that is designed to run as executable program, but also has qualities for human reading?

Many grammar spec for various langs, actually takes a human form, and are extremely complex. Examples are HTML (as part of SGML), and even the much simpler XML's official specifications of its syntax, are wildly complex with dense and verbose descriptions, in tens of pages. (as far as i know, there is no computer language that specifies XML's grammar)

The grammar of XML is extremely simple. It seems something is lacking, by the fact that its grammar ( Lexical grammar↗ ) isn't specified in any computer language. Conceivable it would be within 50 lines of such a language, the bulk of which is probably just 10 lines.

The above issues are subjects in the field of parsing in computer science.

I don't have any experience nor have had any study on parsing. It appears that Yacc↗ and Flex lexical analyser↗ is something i'm looking for.

I'll be spending time studying parsing in the near future to fill my void in this important topic. As of now, my impression of yacc and lex is that they don't seems to be the system i'm looking for. For example, apparently, SGML, XML, do not use them to specify the grammar. Nor, most languages i know of. They seems just to be a very technical tool and not something general enough to be used for human communication as well in wide number of computer languages.


Addendum: 2008-05-09

I have found what i wanted!! See Parsing expression grammar↗.

Apparently, it's still not in wide use. There are one or two implementations in languages like Haskell and Perl6. I would like to see this in perl, python, php, javascript, and most of all, emacs lisp.

A Emacs Lisp implementation seems to be at: http://www.emacswiki.org/cgi-bin/wiki/ParserCompiler, (2008) by Mike Mattie.


Study Notes

Some Wikipedia learning and notes:

BNFs, CFL, Generative Grammar

The Backus–Naur form↗ has several variants used today. The original BNF discussed in 1959 is hardly used today. Major variants are, perhaps in order of popularity:

BNF and variants are so-called Metasyntax↗. Metasyntax means roughly a syntax that is used to describe the formal grammar of other langs. In particular, the BNF (and variants, henceforth BNFs) are used to describe a class of language called Context-free languages↗ (CFL).

Context-free langs are those can be described using Context-free grammar, and context-free grammar is roughly those that can be described by a set of certain transformation rules (often called production rules). BNFs are syntaxes for these production rules.

A grammar (formal grammar), is a high-level (typically human-level) spec that describes a language's possible strings (i.e. the source code). There are many classes of grammars. One classification is whether they are context-free. And, of the context-free ones, there are few ways to describe them. One way, by listing several transformation rules, is called Generative grammar↗. Another way, by mathematical qualifications, is called Analytic grammar↗. (So, for example, BNFs are used for generative grammar.)

BNF Examples

As far as i know, none of the following examples are written for machine to parse. Also, the BNF extention is non-standard.

parsing

parser flow

Parsing↗ can be broken down to few steps.

Lexical analysis↗, which breaks the text stream into tokens. Then, syntactic analysis tokens the stream of tokens into a parse tree.

In lexical analysis, it can be separated into 2 steps: scanner and tokenizer.

Algorithms for parsing can be grouped into types, based on their approach. In one classification, there's Top-down parsing↗ and Bottom-up parsing↗.

formal language

formal grammar↗ specifies the all possible string of a language.

Lexical grammar↗ specifies the syntax of tokens. For example, XML is specified in lexical grammar. (since here is not one specific language, but a lexical structure that can spawn off new langs)

Once a formal grammar is given, the reverse problem of determining whether a string belongs to that lang, is theoretically studied in a field named Automata theory↗.


Some links to check out:


PS what lead me to wrote this article, is this: i was study algebraic curves “elementary geometry of algebraic curves” by C G Gibson, and while reading the book, in conjunction of working on my Visual Dictionary of Special Plane Curves website, and also my long time interest in formal mathematics (see The Codification of Mathematics), i wanted to express the definition and theorems of algebraic curves as a sequence of symbols. (and hopefully show proofs as derivation of these symbol strings according to some given rules. In other words, my wish is to put the human element out)

In other words, i was foraging into formal systems. This is when i _explicitly_ realized a long experience, that pattern matching as by regex or Mathematica isn't sufficient to specify even simple forms (i.e. grammar). This got me started to study formal languages, grammar, BNF, CFL, PEG, parsers, automata, etc. I have actually never studied these subjects. (I recall, back in 1997, i was naively trying to device rules for the Mathematica language. I was, effectively, devising BNF for Mathematica's FullForm (which is just nested list like lisp).)

Another reason that i dived into the subject, is that for the past 2 or so years i've been wanting to have a lisp source code reformatter in emacs based on a lexical analysis. (See A Simple Lisp Code Formatter) Getting acquainted with parser theory will help me.

Also of interest: there are days or months worth of articles on Wikipedia on formal systems, computer algebra systems, automated theorem proving, theorem verification assistant, mathematical logic ... all of which i'm intensely interested. (e.g. start with Automated theorem proving↗). I have discovered: Coq↗, which is a formal system. I'll be spending time to study coq in the near future.

Page created: 2008-05.
© 2008 by Xah Lee.
Xah Signet