Xah Lee, 2008-07
This page shows how to use elisp to create a sitemap. If you don't know elisp, first take a look at Emacs Lisp Basics.
I want to use elisp to create a sitemap. More specifically, generate a list of all files of a given directory including its subdirectories, for each file create a url string in some particualr XML form, and put the whole result into a file with proper header and footer texts.
A sitemap is a XML file that lists urls of all files in a website for web crawlers to crawl. If you are not familiar with it, see Google Sitemaps↗ and http://www.sitemaps.org/.
A sitemap file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>
The file can have many “<url>...</url>” tag pairs. Each “<url>” container represent a file and other info for the web crawler to crawl. There can be tens of thousands “<url>”s (Max of 50k urls per each sitemap file). The “<loc>” is a URL of the file. The “<lastmod>”, “<changefreq>”, “<priority>” are optional.
The purpose of sitemap file is so that web crawlers can get to know all files that exists on your site, without it having to find out by the haphazard process of extracting links from pages it happens to know. This helps web crawling efficiency. Once a crawler knows all your files, it can then decide which page it actually wants to crawl for content.
My website xahlee.org has close to 4000 html files. I want to use elisp to generate a sitemap file.
Some of the files under my website document dir are temp files not meant for public access. These files or dir's names start with “xx”. I don't want these included in the sitemap. Also, some files whose content contains a particular string shouldn't be in the sitemap neither. So, my elisp program will need to be able to check on the file's name, and the file's content too.
The general plan is very simple. Create a new file, insert XML header tags, then traverse my web dir. For each file, determine whether it should be listed in the sitemap. If so, generate the proper url tag and insert it into the new file. Then, insert ending XML tags. Save the file, then done.
First, i define some parameters for the program.
;; full path to web's doc root. Must end in a slash. (setq webroot "/Users/xah/web/") ;; file name of sitemap file, relative to webroot, without “.xml” suffix. (setq sitemapFileName "sitemap") ;; gzip it or not. t for true, nil for false. (setq gzip-it-p t)
I plan to generate a fresh sitemap regularly since my website have few new files each week. So, if a sitemap file already exist, i want to back it up and generate a new one. Here's the code:
; rename file to backup ~ if already exist (let (f1 f2) (setq f1 (concat webroot sitemapFileName ".xml")) (setq f2 (concat f1 ".gz")) (when (file-exists-p f1) (rename-file f1 (concat f1 "~") t) ) (when (file-exists-p f2) (rename-file f2 (concat f2 "~") t) ) )
Note that the “rename-file” function takes a 3rd argument. If true, it means just override existing file at the new name.
The next step, is to open a buffer sitemapBuf, insert the sitemap header tags, then, for each file in my web dir, insert its url into the sitemapBuf, then add the ending tags, save, then done. Here's the code:
;; filePath is the full path to the sitemap file ;; sitemapBuf is the buffer of the sitemap file (let (filePath sitemapBuf) (setq filePath (concat webroot sitemapFileName ".xml")) ;; open file and save a handle to the buffer (setq sitemapBuf (find-file filePath)) ;; insert header tags (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"> ") ;; for each file in my site, insert its url (require 'find-lisp) (mapc (lambda (x) (my-process-file x sitemapBuf)) (find-lisp-find-files webroot "\\.html$")) ;; insert ending tag (insert "</urlset>") ;; some post processing to add some optional tags (goto-char 0) (search-forward "http://xahlee.org/Periodic_dosage_dir/pd.html</loc>") (insert "<changefreq>daily</changefreq>") (save-buffer) ;; gzip it (when gzip-it-p (shell-command (concat "gzip " filePath)) ) )
In the above, first we generate the full path to the sitemap file to be created. The full path is saved as string in “filePath”. Then we open the file, effective creating a new buffer. The buffer instance is saved as the variable sitemapBuf. (note: “buffer” is a elisp datatype, or a instance of the datatype. Normally we we say “buffer”, we actually mean the “buffer's content”. )
The interesting part in the above code is the traverse directory section. The “find-lisp-find-files” line returns a list full path of all html files. The “mapc” maps a function to each element of the list. The lambda line is the function that will be applied to each full path.
So, for example, if a element is “/User/xah/web/emacs/emacs.html”, then the lambda function will get that as argument, and execute “(my-process-file "/User/xah/web/emacs/emacs.html" sitemapBuf)”.
The my-process-file is a function that takes a file full path and a buffer. So that, it can open the file and see whether the file should be added to the sitemap file. If so, it will add to the sitemapBuf buffer.
my-process-file is defined this way:
(defun my-process-file (fpath destBuff) "process the file at fullpath fpath. Write result to buffer destBuff." (let (fBuf) (message fpath) ; show to user what the program is currently doing (when (not (string-match "/xx" fpath)) ; skip dir/file starting with xx (setq fBuf (find-file fpath)) ; open file (goto-char (point-min)) (when (not (search-forward "<meta http-equiv=\"refresh\"" nil "noerror")) (with-current-buffer destBuff ; insert url to sitemap buffer (insert "<url><loc>") (insert (concat "http://xahlee.org/" (substring fpath (length webroot)))) (insert "</loc></url>\n") )) (kill-buffer fBuf) ; close file )))
It takes 2 arguments. The “fpath” is the path to a html file, and “destBuff” is the buffer holding the sitemap file.
First it checks if the file path contains any “/xx”. On my website, file names starting with “xx” is meant to be temp files. So, if a file or dir starts with “/xx”, then skip it.
Otherwise, open the file and check if the file contains a html meta redirect tag. Google's webmaster guide says Google doesn't like url in sitemap that points to a file that redirects with a html meta tag. So, if the html file is a redirect, then don't generate a sitemap url for it.
Finally, the code calls “(with-current-buffer destBuff ...)” to insert the proper url tag into the sitemap buffer.
The function “(with-current-buffer ‹buffer› ‹code›)” will temporarily make the “‹buffer›” the current buffer and execute “‹code›”. When the execution of “‹code›” is done, the current buffer returns to whatever it was.
Once we are done with inserting a url into the sitemap, we close the opened html file by the “kill-buffer” function.
The whole complete code put together is this:
;; 2008-07-02 ;; sitemap_generator.el ;; this script generates the sitemap file for xahlee.org ;; see http://en.wikipedia.org/wiki/Site_map ;; http://www.sitemaps.org/ ;;;-------------------------------------------------- ;;; parameters ;; full path to web's doc root. Must end in a slash. (setq webroot "/Users/xah/web/") ;; file name of sitemap file, relative to webroot, without “.xml” suffix. (setq sitemapFileName "sitemap") ;; gzip it or not. t for true, nil for false. (setq gzip-it-p t) ;;;-------------------------------------------------- (defun my-process-file (fpath destBuff) "process the file at fullpath fpath. Write result to buffer destBuff." (let (fBuf) (message fpath) (when (not (string-match "/xx" fpath)) ; dir/file starting with xx are not public (setq fBuf (find-file fpath)) (goto-char (point-min)) (when (not (search-forward "<meta http-equiv=\"refresh\"" nil "noerror")) (with-current-buffer destBuff (insert "<url><loc>") (insert (concat "http://xahlee.org/" (substring fpath (length webroot)))) (insert "</loc></url>\n") )) (kill-buffer fBuf) ))) ;;;-------------------------------------------------- ;;; main ; rename file to backup ~ if already exist (let (f1 f2) (setq f1 (concat webroot sitemapFileName ".xml")) (setq f2 (concat f1 ".gz")) (when (file-exists-p f1) (rename-file f1 (concat f1 "~") t) ) (when (file-exists-p f2) (rename-file f2 (concat f2 "~") t) ) ) (let (filePath sitemapBuf) (setq filePath (concat webroot sitemapFileName ".xml")) (setq sitemapBuf (find-file filePath)) (erase-buffer) (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"> ") (require 'find-lisp) (mapc (lambda (x) (my-process-file x sitemapBuf)) (find-lisp-find-files webroot "\\.html$")) (insert "</urlset>") (goto-char 0) (search-forward "http://xahlee.org/Periodic_dosage_dir/pd.html</loc>") (insert "<changefreq>daily</changefreq>") (save-buffer) (when gzip-it-p (shell-command (concat "gzip " filePath)) ) )
You can either run it in a buffer by calling “eval-buffer” or in OS command line by “emacs --script generate_sitemap.el”.
Note: Google also offers a sitemap generating script written in Python. See: https://www.google.com/webmasters/tools/docs/en/sitemap-generator.html
Emacs is super!
Related essays:
