If you enjoyed this site, please consider donating $3. Any amount is appreciated. Thanks!

Using Emacs To Syntax Color Source Code In HTML

Xah Lee, 2007-10, 2009-01

This page shows a example of writing a emacs lisp function that process a block of text to syntax color it by HTML tags. If you don't know elisp, first take a look at Emacs Lisp Basics.

The Problem

Summary

I want to write a elisp function, such that when invoked, the block of text the cursor is on, will have various HTML's “<span class="tokenType">” tags wrapped around them. This is for the purpose of publishing programing language code in HTML on the web.

Detail

I write a lot computer programing tutorials for several computer languages. For example: Perl and Python tutorial, Java tutorial, Emacs Lisp tutorial, Javascript tutorial. In these tutorials, often there are code snippets. These code need to be syntax colored in HTML.

For example, here's a elisp code snippet:

(if (< 3 2)  (message "yes") )

Here's what i actually want as raw HTML:

(<span class="keyword">if</span> (&lt; 3 2)  (message <span class="string">"yes"</span>) )

Which should look like this in a web browser:

(if (< 3 2)  (message "yes") )

There is a emacs package that turns a syntax-colored text in emacs to HTML form. This is extremely nice. The package is called htmlize.el and is written (1997,...,2006) by Hrvoje Niksic, available at http://fly.srk.fer.hr/~hniksic/emacs/htmlize.el.

This program provides you with a few new emacs commands. Primarily, it has htmlize-region, htmlize-buffer, htmlize-file. The region and buffer commands will output HTML code in a new buffer, and the htmlize-file version will take a input file name and output into a file.

When i need to include a code snippet in my tutorial, typically, i write the code in a separate file (e.g. “temp.java”, “temp.py”), run it to make sure the code is correct (compile, if necessary), then, copy the file content into my tutorial page, put it inside a “pre” tag.

In this scheme, the best way for me to use htmlize.el is to use the “html-buffer” command on my “temp.java” file, then copy the htmlized output and paste that into my tutorial page inside a “pre” tag. Since many of my tutorials are written haphazardly over the years before seeing the need for syntax coloration, most source code snippet exist inside “pre” tags already without a temp code file. So, in most cases, what i do is to select the text inside the “pre” tag, paste into a temp buffer and invoke the right mode for the language (so the text will be fontified correctly), then do htmlize-buffer, then copy the html output, then paste back to replace the selected text.

This process is tedious. A page may have several code snippets. For each, i will need to select text, create a buffer, switch mode, do htmlize, select again, switch buffer, then paste. Each text-selection step involves multiple keystrokes with deliberate eye-balling or precision mousing. I have a few hundred pages for potential colorization.

It would be wonderful, if i can place the cursor on a code block, then press a button, and have emacs magically replace the code block with htmlized version colorized for that language. We proceed to write this function.

Solution

For those elisp experts who have worked with emacs fontification, the solution would be to write a function that maps the string's fontification info into html tags. This is what htmlize.el does exactly. Since it is already written, a elisp expert might find the essential code in htmlize.el. (the code is licensed under GPL)

Unfortunately, my lisp experience isn't so great. I spent maybe 30 minutes tried to look in htmlize.el in hope to find a function something like htmlize-str that is the essence, but wasn't successful. I figured, it is actually faster if i took the dumb and inefficient approach, by writing a elisp code that extracts the output from the htmlize-buffer command. Here's the outline of the plan of my function:

To achieve the above, i decided on 2 steps:

Here's the code of my htmlize-string function:

(defun htmlize-string (sourceCodeStr langModeName)
  "Take SOURCECODESTR and return a htmlized version using LANGMODENAME.
This function requries the htmlize.el by Hrvoje Niksic, 2005"
  (require 'htmlize)
  (let (htmlizeOutputBuf x1 x2 resultS)

    ;; put code in a temp buffer, set the mode, fontify
    (with-temp-buffer
      (insert sourceCodeStr)
      (funcall (intern langModeName))
      (font-lock-fontify-buffer)
      (setq htmlizeOutputBuf (htmlize-buffer))
      )

    ;; extract the fontified source code in htmlize output
    (with-current-buffer htmlizeOutputBuf
      (setq x1 (search-forward "<pre>"))
      (setq x2 (search-forward "</pre>"))
      (setq resultS (buffer-substring-no-properties (+ x1 1) (- x2 6))))

    (kill-buffer htmlizeOutputBuf)
    resultS
    ))

The major part in this code is knowing how to create a temp buffer, set a mode, kill buffer, and how to grab the text you want in a buffer.

The “(with-temp-buffer ‹body›)” will create a temporary buffer and switch to it as current buffer. Any commands in the ‹body› will assume the temp buffer as the current buffer. When the code in ‹body› finished, the temp buffer disappears. The value of last expression in ‹body› is returned. The “with-temp-buffer” is a extremely useful function.

Inside the “with-temp-buffer”, we insert the source code string, then set the mode to the given major mode, then call htmlize-buffer.

The code for setting a major mode is this “(funcall (intern langModeName))”. The “funcall” invokes a function. The first argument of funcall must be a lisp symbol. Our variable “langModeName” evaluates to a string, then the “intern” function takes the string and returns a lisp symbol of that string.

When a buffer is set to a major mode, usually it is fontified automatically, but not always. We use “(font-lock-fontify-buffer)” to fontify it. Then, this expression “(setq htmlizeOutputBuf (htmlize-buffer))” will htmlize the buffer. The return value of htmlize-buffer is a object of buffer datatype. (when called interactively, you are automatically switched to this buffer. But when called in a program, no auto switch.) This return value is set to the variable htmlizeOutputBuf, so that we can use it later.

Now, we have a buffer object htmlizeOutputBuf, which contains the htmlized text. We want to grab part of the text that is the htmlized source code. (that is, excluding the usual html header and footer)

The “(with-current-buffer ‹buffer x› ‹body›)” will temporarily switch to ‹buffer x›, then run ‹body›, return its last value, and restore current buffer. So, we temporarily switch to the buffer htmlizeOutputBuf, then use search-forward to find the beginning and ending positions of the opening/closing “pre” tags, then we use buffer-substring-no-properties to grab the text between them.

Emacs's buffer related commands can often take a argument that is either a buffer name (of type “string”) or a buffer object itself (of type “buffer”).

(info "(elisp) Buffers")

Emacs's string can contain information called properties, which contains info about font and coloring. To grab a string in a buffer, you can use “buffer-substring” or “buffer-substring-no-properties”. Most emacs commands that take a string as argument can accept string with or without properties.

(info "(elisp) Text Properties")

Here's the code of my htmlize-block function:

(defun htmlize-block ()
  "Replace the region enclosed by <pre> tag to htmlized code.
For example, if the cursor somewhere inside the pre tags:

<pre class=\"code\">
mySourceCode...
</pre>

after calling, the “mySourceCode...” block of text will be htmlized.
That is, wrapped with many <span> tags.

The opening tag must be of the form <pre class=\"lang-str\">.
The “lang-str” determines what emacs mode is used to colorize
the code.
This function requires htmlize.el by Hrvoje Niksic."

  (interactive)
  (let (mycode tagBegin styclass codeBegin codeEnd tagEnd mymode)

    (setq tagBegin (re-search-backward "<pre class=\"\\([A-z-]+\\)\""))
    (setq styclass (match-string 1))
    (setq codeBegin (search-forward ">"))
    (search-forward "</pre>")
    (setq codeEnd (search-backward "<"))
    (setq tagEnd (search-forward "</pre>"))
    (setq mycode (buffer-substring-no-properties codeBegin codeEnd))

    (cond
     ((string= styclass "haskell") (setq mymode "haskell-mode"))
     ((string= styclass "ocaml") (setq mymode "tuareg-mode"))
     ((string= styclass "elisp") (setq mymode "emacs-lisp-mode"))
     ((string= styclass "scheme") (setq mymode "scheme-mode"))
     ((string= styclass "javascript") (setq mymode "js2-mode"))
     ((string= styclass "python") (setq mymode "python-mode"))
     ((string= styclass "ruby") (setq mymode "ruby-mode"))
     ((string= styclass "perl") (setq mymode "cperl-mode"))
     ((string= styclass "php") (setq mymode "php-mode"))
     ((string= styclass "c") (setq mymode "c-mode"))
     ((string= styclass "java") (setq mymode "java-mode"))
     ((string= styclass "html") (setq mymode "html-mode"))
     ((string= styclass "xml") (setq mymode "xml-mode"))
     ((string= styclass "css") (setq mymode "css-mode"))
     ((string= styclass "povray") (setq mymode "pov-mode"))
     ((string= styclass "lsl") (setq mymode "xlsl-mode"))
     )

    (save-excursion
      (delete-region codeBegin codeEnd)
      (goto-char codeBegin)
      (insert (htmlize-string mycode mymode))
      )
    )
  )

The outline of this function is to grab the text inside the “pre” block, call htmlize-string, then insert the result replacing text.

Originally, i plan to determine the extent of the code block by matching for “<pre>...</pre>” tags, then use some heuristics on the text to determine what language it is (by a simple regex match for certain strings particular to the lang), then call htmlize-string with the mode-name passed to it. However, since my html pages already have the language information as the pre tag's attribute: “<pre class="perl">” (for CSS reasons), so, now i search text by that form, and use the “class”'s value to determine a mode.

Emacs is beautiful.

Note: quote from htmlize.el:

htmlize supports three types of HTML output, selected by setting “htmlize-output-type”: “css”, “inline-css”, and “font”. ... “css” mode is the default.

My htmlize-block and htmlize-string assumes the css mode too. This means, you'll have to do a one-time manual process of grabbing the CSS from the htmlized output and place in your own CSS page. You can use my CSS code for language here: http://xahlee.org/lang.css.

If your html is in unicode, you might add the following to your emacs init file:

(setq htmlize-convert-nonascii-to-entities nil)
(setq htmlize-html-charset "utf-8")

They will prevent htmlize creating ugly html entities. For example, if you have a bullet char “•” (Unicode U+2022), you will see it as is instead of “&#x2022”

If you are not up to date with CSS or HTML, see:


Piecemeal Process

Postscript:

The story given above is slightly simplified. For example, when i began my language notes and commentaries, they were not planned to be some systematic or sizable tutorial. As the pages grew, more quality are added in editorial process. So, a plain un-colored code inside “pre” started to have “language comment” strings colorized (e.g. “<span class="cmt">#...</span>), by using a simple elisp code that wraps a tag on them, and this function is assigned to a shortcut key for easy execution. As pages and languages grew, i find colorizing comment isn't enough, then i started to look for a syntax-coloring html solution. There are solutions in Perl, Python, PHP, but I find emacs solution best suites my needs, in particular because it is integrated with emacs's interactive nature, and my writing work is done in a accumulative, piecemeal, editorial process.

Once i found and decided to use htmlize.el, i use commands htmlize-region and htmlize-buffer when i write new tutorial pages. Note that this is still a laborious process involving multiple deliberate copy-paste operations. Gradually i need to colorized my existing tutorial pages. The problem is that many already contain my home cooked “span class="cmt"” tags, and strings common in computer languages such as “x < y” have already been transformed into required html encoding “x &lt; y”. So, the elisp code will first need to “un-htmlize” these. So, initially in my htmlize-block code contain lines to un-htmlized those specific tags and entities strings. After many months, when all my existing code have been so newly colorized, the part of code to transform strings for un-htmlize is no longer necessary, so they are taken out in htmlize-block and resumes a cleaner state. Also, htmlize-block went thru many revisions over the year. Sometimes in recent past, i had one code wrapper for each language. For example, i had htmlize-me-perl, htmlize-me-python, htmlize-me-java, etc. The thought for unification into a single coherent wrapper code didn't materialize. In general, it is my experience, in particular in writing elisp customization for emacs, that tweaking code periodically thru the year is practical, because it adapts to the constant changes of requirements, environment, work process. For example, eventually i might write my own htmlize.el, if i happen to need more flexibility, or if my elisp experience sufficiently makes the job relatively easy.

DeHtmlize Text

2009-01-24

The raw html of htmlized language code is usually unreadable. Here's a illustration of htmlized ocaml code of just 2 lines:

<span
class="tuareg-font-lock-governing">let</span> <span
class="function-name">myComposition</span><span
class="variable-name"> f g </span><span
class="tuareg-font-lock-operator">=</span> <span
class="tuareg-font-lock-operator">(</span><span
class="keyword">fun</span> <span
class="variable-name">x </span><span
class="tuareg-font-lock-operator">-></span> f <span
class="tuareg-font-lock-operator">(</span>g x<span
class="tuareg-font-lock-operator">)</span> <span
class="tuareg-font-lock-operator">);;</span>
myComposition <span
class="tuareg-font-lock-operator">(</span><span
class="keyword">fun</span> <span
class="variable-name">x </span><span
class="tuareg-font-lock-operator">-></span> x <span
class="tuareg-font-lock-operator">^</span> <span
class="string">"c"</span><span
class="tuareg-font-lock-operator">)</span> <span
class="tuareg-font-lock-operator">(</span><span
class="keyword">fun</span> <span
class="variable-name">x </span><span
class="tuareg-font-lock-operator">-></span> x <span
class="tuareg-font-lock-operator">^</span> <span
class="string">"b"</span><span
class="tuareg-font-lock-operator">)</span> <span
class="string">"a"</span><span
class="tuareg-font-lock-operator">;;</span>

The actual code is just this:

let myComposition f g = (fun x -> f (g x) );;
myComposition (fun x -> x ^ "c") (fun x -> x ^ "b") "a";;

Suppose you want to modify the ocaml code presented in html. Usually, you view it in a browser, then copy the ocaml code. Then create a new buffer, paste the code, to edit it. When done, you copy the newly edited text, close temp buffer, delete the htmlized version in your html file, paste the new in, then htmlize it again. This process is painful.

It would be nice, if you can press a button, then the htmlized source code in your html will become plain. So you can modify it. Press a button again to have it htmlized again.

Here are 2 elisp code to dehtmlize. The dehtmilze-region will dehtmilze a selected region. The dehtmlize-block will dehtmlize code inside a pre block of the form “<pre class="langName">”.

(defun dehtmlize-block ()
  "Delete span tags inside a <pre> region.
For example, if the cursor somewhere inside the tag:

<pre class=\"code\">
codeXYZ...
</pre>

after calling, the “codeXYZ...” block of text's span tags will be removed.
dehtmlize-block in the reverse of htmlize-block."
  (interactive)
  (let (mycode tag-begin code-begin code-end tag-end mymode)
    (progn
      (setq tag-begin (re-search-backward "<pre class=\"\\([A-z-]+\\)\""))
      (setq code-begin (re-search-forward ">"))
      (re-search-forward "</pre>")
      (setq code-end (re-search-backward "<"))
      (setq tag-end (re-search-forward "</pre>"))
      )

    (let (myStr)
      (setq myStr (buffer-substring code-begin code-end))
      (setq myStr (replace-regexp-in-string "<span class=\"[^\"]+\">" "" myStr))
      (setq myStr (replace-regexp-in-string "</span>" "" myStr))
      (setq myStr (replace-regexp-in-string "&amp;" "&" myStr))
      (setq myStr (replace-regexp-in-string "&lt;" "<" myStr))
      (setq myStr (replace-regexp-in-string "&gt;" ">" myStr))
      (delete-region code-begin code-end)
      (goto-char code-begin)
      (insert myStr)
      )
    )
  )
(defun dehtmlize-span-region (p1 p2)
  "Delete HTML “span” tags on a region.
Note: only certain span tags are deleted."
  (interactive "r")

  (let (mystr)
    (setq mystr (buffer-substring p1 p2))

    (setq mystr
          (with-temp-buffer
            (insert mystr)
            
            (goto-char (point-min))
            (while (search-forward-regexp "<span class=\"[^\"]+\">" nil t) (replace-match ""))

            (goto-char (point-min))
            (while (search-forward "</span>" nil t) (replace-match ""))

            (goto-char (point-min))
            (while (search-forward "&amp;" nil t) (replace-match "&"))

            (goto-char (point-min))
            (while (search-forward "&lt;" nil t) (replace-match "<"))

            (goto-char (point-min))
            (while (search-forward "&gt;" nil t) (replace-match ">"))

            (buffer-string)
            ))
    (delete-region p1 p2)
    (insert mystr))
  )

JavaScript Solution

2009-01-20

Google has a open source technology that uses javascript to color code in html on the fly instead of using the bulky markup. For detail, see: Google-code-prettify Examples.


Related essays:

2007-10
© 2007 by Xah Lee.