If you enjoyed this site, please consider donating $3. Any amount is appreciated. Thanks!

How To Write A Emacs Major Mode For Syntax Coloring

Xah Lee, 2008-11, 2009-12-10

This page gives a practical example of writing a emacs major mode to do syntax coloring of your own language. You should have few months experience of coding emacs lisp.

The Problem

Your company uses its own in-house language. You want to write a major mode for that language, so that the keywords of the language will be highlighted.

Solution

Suppose your language source code looks like this:

Sin[x]^2 + Cos[y]^2 == 1
Pi^2/6 == Sum[1/x^2,{x,1,Infinity}]

You want the words “Sin”, “Cos”, “Sum”, colored as functions, and “Pi” and “Infinity” colored as constants.

Here's how you define the mode:

(setq myKeywords 
 '(("Sin\\|Cos\\|Sum" . font-lock-function-name-face)
   ("Pi\\|Infinity" . font-lock-constant-face)
  )
)

(define-derived-mode math-lang-mode fundamental-mode
  (setq font-lock-defaults '(myKeywords))
  (setq mode-name "math lang")
)

The string “"Sin\\|Cos\\|Sum"” is a regex, the “font-lock-function-name-face” is a pre-defined variable that holds the value for the default font face used for function keywords.

The line “define-derived-mode” defines your mode, named math-lang-mode, based on the fundamental-mode (which is the most basic mode). The line (setq font-lock-defaults '(myKeywords)) sets up the syntax highlighting for your mode.

The line “(setq mode-name "math lang")” gives the mode a name to be displayed on the status line. otherwise it'll show as “*invalid*”.

That's all there is to it. Now, when you invoke “math-lang-mode”, emacs will now syntax color the buffer's text. (you must have font-lock-mode on, if not, do “Alt+x font-lock-mode”.) Here's what it looks like:

Sin[x]^2 + Cos[y]^2 == 1
Pi^2/6 == Sum[1/x^2,{x,1,Infinity}]

O My GOD, Emacs is beautiful!

Hundreds Of Keywords

Typically, a language may have hundreds of keywords. Elisp provide a way to generate regex for your keywords.

Suppose you are writing a mode for the Linden Scripting Language, which has close to ~553 keywords. Here's a example of how to code it.

;; define several class of keywords
(defvar mylsl-keywords
  '("break" "default" "do" "else" "for" "if" "return" "state" "while")
  "LSL keywords.")

(defvar mylsl-types
  '("float" "integer" "key" "list" "rotation" "string" "vector")
  "LSL types.")

(defvar mylsl-constants
  '("ACTIVE" "AGENT" "ALL_SIDES" "ATTACH_BACK")
  "LSL constants.")

(defvar mylsl-events
  '("at_rot_target" "at_target" "attach")
  "LSL events.")

(defvar mylsl-functions
  '("llAbs" "llAcos" "llAddToLandBanList" "llAddToLandPassList")
  "LSL functions.")

In the above, first we define several lists, each one is a class of keywords in the language. Note that the keyword list in the above is truncated. Each list can have hundreds of keywords.

;; create the regex string for each class of keywords
(defvar mylsl-keywords-regexp (regexp-opt mylsl-keywords 'words))
(defvar mylsl-type-regexp (regexp-opt mylsl-types 'words))
(defvar mylsl-constant-regexp (regexp-opt mylsl-constants 'words))
(defvar mylsl-event-regexp (regexp-opt mylsl-events 'words))
(defvar mylsl-functions-regexp (regexp-opt mylsl-functions 'words))

In the above, we generate the regex for each keyword class, using the built-in function “regexp-opt”. We gave regexp-opt a second optional argument “'words”. This will create a regex that also match on word boundary. So that, when a word is contained inside a longer word, it will not be highlighted. (For example, “for” is usually a keyword for looping, but if you have a user created function named “inform”, you don't want part of the word colored as “for”.)

(info "(elisp) Regexp Functions")

;; clear memory
(setq mylsl-keywords nil)
(setq mylsl-types nil)
(setq mylsl-constants nil)
(setq mylsl-events nil)
(setq mylsl-functions nil)

In the above, we clear the lists to save memory, since we don't need it anymore.

;; create the list for font-lock.
;; each class of keyword is given a particular face
(setq mylsl-font-lock-keywords
  `(
    (,mylsl-type-regexp . font-lock-type-face)
    (,mylsl-constant-regexp . font-lock-constant-face)
    (,mylsl-event-regexp . font-lock-builtin-face)
    (,mylsl-functions-regexp . font-lock-function-name-face)
    (,mylsl-keywords-regexp . font-lock-keyword-face)
    ;; note: order above matters. “mylsl-keywords-regexp” goes last because
    ;; otherwise the keyword “state” in the function “state_entry”
    ;; would be highlighted.
))

In the above, we create a list in preparation to feed it to “font-lock-defaults”.

Note that the highlighting mechanism of font-lock-defaults is based on first-come-first-serve basis, and once a piece of text got its coloring, it won't be changed. So, the order of your list is important. Make sure the smallest lengthed text goes last. (this won't fix all cases where a keyword matches part of other keywords. If your language has a lot such keywords, you need to use other forms to solve this problem. (info "(elisp) Search-based Fontification"))

The “`( ,a ,b ...)” is a lisp special syntax to evaluate parts of elements inside the list. Inside the paren, elements preceded by a “,” will be evaluated.

Finally, we define our mode like this:

;; define the mode
(define-derived-mode mylsl-mode fundamental-mode
  "lsl mode"
  "Major mode for editing LSL (Linden Scripting Language)..."
  ;; ...

  ;; code for syntax highlighting
  (setq font-lock-defaults '((mylsl-font-lock-keywords)))

  ;; clear memory
  (setq mylsl-keywords-regexp nil)
  (setq mylsl-types-regexp nil)
  (setq mylsl-constants-regexp nil)
  (setq mylsl-events-regexp nil)
  (setq mylsl-functions-regexp nil)

  ;; ...
)

In the above, we based our mode on fundamental-mode, which is the most basic mode. If you are actually writing a mode for LSL, it makes sense to base it on c-mode, since the syntax is similar. Basing on a similar language's mode will save you time in coding many features, such as handling comment and indentation.

Syntax Coloring For Comments

For comment syntax coloring, you need to use syntax table. To have a commant that does commenting and uncommenting, you'll need to write your own function. For detail, see: How To Add Comment Handling In Your Major Mode.

How To Name Your Mode

There are several issues in naming of a major mode.

Value of Variable “major-mode”

The buffer local variable “major-mode”. When in a major mode, user can type “Ctrl+h v major-mode”, and it shows its value. This is most important, technical, name for a mode. This value is also the command name user types in “Alt+x” to activate the mode.

You do not need to set this variable. The value of “major-mode” is automatically supplied by elisp as the first argument to define-derived-mode, or when you define a independent mode using defun, for example: “(defun my-html-mode ...)”.

When choosing a name, try to come up with a name so that user can tell what it is for, but also unique from other modes for the same language. For example, if your mode is for html, you don't want just name it “html-mode”, since there are a lot others. There is no absolute principle in choosing your mode name. Do not name it like “john-mode”, or “fancy-mode”, because user won't be able to tell what they are just from the name. Do not name it generic like html-mode or perl-mode, since a lot other lisp programers might also wrote those modes. Only modes distributed with GNU Emacs might want to use such a generic name.

Here are some example names from existing modes for consideration: for perl, there are modes named perl-mode and cperl-mode. For XML, there are modes named xml-mode and nxml-mode. For HTML, there are html-mode, html-helper-mode, nxhtml-mode. For JavaScript: javascript-mode, js-mode, js2-mode. For IRC modes, they are variously named rcirc-mode, erc-mode. For email, there's rmail, vm, gnus.)

Prefix Your Mode-Name To Your Symbol's Names

Emacs lisp does not support Namespace, nor does it support lexical scope. Practically speaking, all variables and function in emacs are all in a global name space.

The symbols in your package should have unique names to avoid multiple packages having the same name that can override each other. The conventional practice is that all your symbol's names in your mode should be prefixed by your mode name, or a abbreviation of your mode name.

This is also important reason that you should choose a unique name for your mode.

For example, if you look at the source code of “cperl-mode”, all their names starts with “cperl”. For “html-mode”, all names starts with “sgml”. For “python-mode”, all names starts with “python”, etc.

Value of Variable “mode-name”

The buffer local variable “mode-name”. When in a major mode, the value of mode-name is displayed in the status bar. This value is for human reading. For example, in html-mode, the value is “HTML”. You can also type “Ctrl+h v mode-name” to see its value. You need to set this variable, in the body of define-derived-mode or your “(defun my-html-mode ...)”. For example, “(setq mode-name "LSL")”.

The Elisp File Name

Also, there is no technical relation of your mode's name and the file name. For example, your can have your mode named “mylsl-mode”, while the file name can be “lsl_mode_by_John.el” or “lsl_mode_v1.4.el”. This is because elisp does not enforce a relation between the package name and file name, unlike Java.

Normally, you just name your file the same as the value of the variable “major-mode” defined in your package. So, for example, if you have “mylsl-mode”, then the file can be “mylsl-mode.el”. This is how majority of major mode package's files are named.

Elisp file's name can be a little flexible. A version number in the file name is common. If your package has more than one file, you can name them anything appropriate to the file's purpose.

Full Featured Language Mode

In this tutorial, we only covered syntax coloring of a set of fixed keywords. Besides syntax coloring, a full featured language mode should also handle comments, indentation, keyword completion, integrated documentation lookup, function template insertion, graphical menus, supporting emacs's customize-group scheme, or any other features that may be useful for coding your language.

This tutorial only gives a basic template for writing a language mode with syntax coloring. There are a lot details and conventions you need to know if you want to make your mode full featured.

(info "(elisp) Major Mode Conventions")

For a example source code of a full featured mode that are written from scratch and does not depend on other modes, see any of: xbbcode-mode, xlsl-mode, xahk-mode.el.

2008-11
© 2008 by Xah Lee.