A Text Editor Feature: Syntax Tree Walk

Xah Lee, 2006-09-05

I want to write a emacs lisp function such that, when run, highlight a region between the nearest left and right delimiters. Delimiters are any of parentheses, square brackets, or double quotes. When the function is run again, it extends the selection to the outer enclosing delimiters, and so on.

So, in this way, a user can repeatedly press a keyboard shortcut and extend the selection.

This is a feature of the Mathematica editor. In Mathematica, pressing “Ctrl+.” will start a selection at the point the cursor is on to cover a token in the language. Pressing “Ctrl+.” again will extend the selection to the next level of the syntactical unit. Here is a example of Mathematica code with highlights showing its extend selection behavior, starting at the n inside the braces, extend outwards to cover higher level syntactical unit.

Table[n/(n + 1), {n, 1, 10, 1/2}]

What i wanted this in emacs for is mostly in editing HTML/XML, where one press can select the content, another press will include the enclosing tags, another press extends the selection to the next outer content, and another press include that tags too, and so on.

I think this would be a great feature for any language, where a keypress will highlight more syntactical units in any language's mode. For example, suppose in C-like language:

function f ( arg1, arg2) { 
line1; 
line2;
}

if the cursor is at arg1, then first press will highlight the word “arg1”, another press will highlight “arg1, arg2”, another press includes the parens “(arg1, arg2)”, another press will include the whole function definition: “function f (arg1, arg2) { line1; line2;}”. If the cursor is at line1, then it selects that word in the line, then the line, then the whole function def body, then including {}, then the whole function definition.

For a language with nested syntax, suppose we have this XML example:

<entry> 
  <title>Gulliver's Travels</title>
  <id>tag:xahlee.org,2006-08-21:030437</id> 
  <updated>2006-08-20T20:04:41-07:00</updated> 
  <summary>Annotated a chapter of Gulliver's Travels</summary> 
  <link rel="alternate" href="../p/Gullivers_Travels/gt3ch05.html"/>
</entry>

If the cursor is inside a tag's enclosing content, say, on the letter T in the string “Gulliver's Travels” inside the <title> tag, then the repeated extension is obvious. But however, suppose the cursor is at t in the “alternate” inside the “link” tag, then it would first select the whole “alternate” word, then expand to the double quotes “"alternate"”, then the whole property “rel="alternate"”, then the whole link tag, then the whole content of the entry tag, then including the “<entry>” tags itself.

In summary, this highlighting feature is a syntax-tree walker. Each invocation will go up on the syntax tree. And, this is built in in the powerful integrated editor in Mathematica↗.

Emacs and the Lisp Expression

For the lisp language, where the language syntax is just nested parentheses, this facility is almost trivial to implement. In fact, it is built-in in the emacs editor.

In Emacs, when editing lisp code, the following commands are available:

Shortcut Command name Purpose
Ctrl+Meta+← backward-sexp Move to previous sibling
(move to the (beginning of) previous sexp unit)
Ctrl+Meta+→ forward-sexp Move to next sibling
(move to the (end of) next sexp unit)
Ctrl+Meta+↑ backward-up-list Move to parent
(move to the (beginning of) outer paren pair)
Ctrl+Meta+↓ down-list Move to first children
(move into the (beginning of) first inner paren pair)
Ctrl+Meta+SPC mark-sexp same as forward-sexp but selecting the text

These are nice features but more or less only work for simple nested syntax such as in lisp code. A slightly more complex nested syntax, such as XML, the simple lexical scan based tree-walker breaks down. (And, of course, it doesn't work for any other languages that do not use regularly nested syntax. For this to work for languages with ad-hoc syntax such as C, Python, Java, or the syntax soup of Perl and unix imperative shells, the implementation essentially needs a parser to do the tree-walking.).

XML also has a fairly very regular recursive syntax. In Emac's XML mode, forward-sexp, backward-sexp, mark-sexp are available, but not the most important backward-up-list. And, the forward-sexp etc do NOT behave according to the syntax-tree, but merely jumping between some prominent characters such as the angle brakets and quote marks. After a few invocation, it will therefore result in highlighted region that is not a syntactical unit.

For some explanation about the syntax tree of a language, see Wikipedia: Parse tree↗.


Related essays:


Page created: 2006-09.
© 2006 by Xah Lee.
Xah Signet