Elisp Lesson: Text Processing HTML

Xah Lee, 2007-11

This page shows a real-world example using emacs lisp to process a HTML file. If you don't know elisp, first take a look at Emacs Lisp Basics.

The Problem

Summary

I want to write a elisp program, that process a HTML file in a somewhat complex way. Specifically, certain strings must be replaced only if they appear inside a tag and or only if they are first child.

Detail

I have many web pages that has a Questions And Answers format. The following is a sample screenshot.

website QA screenshot

The following is a example of the raw HTML:

<p class="q">Q: Why ...</p>
<p class="a">A: Because ...</p>
<p class="a">You need to do ...</p>
...
<p class="q">Q: How ...</p>
<p class="a">A: Do this ...</p>
<p class="a">And that ...</p>
...

Basically, each Question section is a paragraph of class “q”, and each Answer section is several “<p>” tags with class “a”.

After a few years with this format, i started to use a better format. Specifically, a Answer section should just be wrapped with a single “<div class="a"></div>”. And, the “Q: ” and “A: ” texts are removed from content (because CSS can insert that automatically). Here's a example of the new format:

<p class="q">Why ...</p>
<div class="a">
<p>Because ...</p>
<p>You need to do ...</p>
...
</div>

The task i have now, is to transform existing pages to this new format. Here's what needs to be done precisely:

For any consecutive blocks of “<p class="a">...</p>”, wrap them with a “<div class="a">” and “</div>”, then replace those “<p class="a">” by “<p>”. Also, remove those “Q: ” and “A: ”.

Although this is simple in principle, but without using a HTML parser, it's hard to code it as described. Using a HTML parser has its own problems. The HTML/DOM model would make the code much more complex, and the output will change the placement of whitspaces. Unless we are doing XML transformation, the HTML/DOM parser is usually not what we want. A text-based search-and-replace algorithm to achieve the above is as follows:

For each occurance of “<p class="q">, do the following:

Now do:

We proceed to write a elisp code to solve this problem.

Solution

The algorithm described above are based on global text-replacement. However, since emacs has buffer representation of files with a pointer that can move back and forth, the algorithm is slightly simplified.

Suppose one of the file we want to work on is elisp_process_html_sample.html.zip.

First, we write a prototype that just works on a single file “elisp_process_html_sample.html”. Here's the code:

(defun xx ()
  "temp test function"
  (interactive)
  (find-file "elisp_process_html_sample.html")
  (beginning-of-buffer)

;; add opening and closing tags for answer section
;; this is done by locating the opening question tag, 
;; then move to the end of tag, then insert <div class="a">
;; then, locate the next opening question tag but move backward to </p>,
;; then insert </div>
  (while (search-forward "<p class=\"q\">" nil t)
    (search-forward "<p class=\"a\">")
    (replace-match "<div class=\"a\">\n<p class=\"a\">")
    (if (search-forward "<p class=\"q\">" nil t)
        (progn 
          (search-backward "</p>")
          (forward-char 4)
          (insert "\n</div>")
          )
      )
    )

;; add the last closing tag for answer section
  (end-of-buffer)
  (search-backward "<p class=\"a\">")
  (search-forward "</p>")
  (insert "\n</div>")

;; take out the “Q: ” and “A: ” and replace “<p class="a">” by “<p>”.
  (beginning-of-buffer)
  (while (search-forward "<p class=\"q\">Q: " nil t)
    (replace-match "<p class=\"q\">"))
  (while (search-forward "<p class=\"a\">A: " nil t)
    (replace-match "<p>"))
)

This is a simple code. It uses emac's power of buffer data structure for files, by moving a pointer back and forth to a desired place, then do search and replace text or insert. With the ability of moving a point to a particular string, we are able to locate the places we want the tag insertion to happen, without explicitly going by the DOM model of parent-child relationship of tags.

In the above code, the “search-forward” function moves the cursor to the end of matched text. It returns “nil” if not found. The “search-backward” works similarly, but put the point on the beginning of matched text.

The “replace-match” just replaces previously matched text. The “end-of-buffer” moves the point to the end of buffer. Similarly for “beginning-of-buffer”.

Reference: Elisp Manual: String-Search.

Now, if we want to process many files, first we need to change the code to take a file path, and add code to save buffer and close buffer. Like this:

(defun my-process-html (fpath)
  "a better doc string here..."
  (let (mybuffer)
    (setq mybuffer (find-file fpath))
    ; code body here
    (save-buffer)
    (kill-buffer mybuffer)
  )
)

To get the list of files containing the Q and A section, we can simply use unix's “find” and “grep”, like this: “find . -name "*\.html" -exec grep -l '<p class="q">' {} \;”

Then, place the list of files into a list and loop over the list, like this:

(mapcar 'my-process-html
        (list
"/Users/xah/web/emacs/emacs_adv_tips.html"
"/Users/xah/web/emacs/emacs_display_faq.html"
"/Users/xah/web/emacs/emacs_esoteric.html"
"/Users/xah/web/emacs/emacs_html.html"
"/Users/xah/web/emacs/emacs_n_unicode.html"
"/Users/xah/web/emacs/emacs_unix.html"
"/Users/xah/web/emacs/keyboard_shortcuts.html"
"/Users/xah/web/emacs/modernization.html"
"/Users/xah/web/img/imagemagic.html"
"/Users/xah/web/java-a-day/abstract_class.html"
"/Users/xah/web/sl/build_q.html"
"/Users/xah/web/sl/q.html"
"/Users/xah/web/UnixResource_dir/macosx.html"
"/Users/xah/web/UnixResource_dir/unix_tips.html"
"/Users/xah/web/UnixResource_dir/writ/mshatredfaq.html"
"/Users/xah/web/UnixResource_dir/writ/tabs_vs_spaces.html"
         )
)

The mapcar is a lisp idiom of looping thru a list. The first argument is a function. The function will be applied to every element in the list. The single quote in front of the function is necessary. It prevents the function from being evaluated.

Emacs is beautiful!


Related essays:


Page created: 2007-11.
© 2007 by Xah Lee.
Xah Signet