On Unix Filename Characters Problem

Xah Lee, 2006-05

On 2008-08-15, someone wrote (paraphrased):

Sometimes i save documents to disk from the web.

I wish to embed the article title and url in the saved filename.

e.g. if article titled “News for the next Century” at http://www.example.com/news/something.html

i want to save it in the filename such as “News for the next Century http://www.example.com/news/something.html”. But the special chars there causes problems.

is there some general char transformation scheme, so that special chars in url and title of article are replaced by other chars and can be used as a filename?

Hmmm. Maybe “---” for “/”? What about “:”? And what about “~”? Plus other chars I've not thought of?

P.S.: Oh, I forgot. tar shouldn't barf on the name.

What you want to do is pretty hopeless. Chars in url is confusing enough, with its percent-encoding↗ (such as “%20” for space and “%7E” for “~”), and when used in html as link, there's also another layer of encoding the CDATA↗ (such as “&” for “&”) . Depending on the browser, or whatever tool you are using, the url you get may or may not be processed to eliminate a variety of encodings, and the encoding spec itself is not crystal clear and in practice lots of actually invalid uri anyway.

Chars in file names itself is also confusing. Different file systems allow different char sets with different special char meanings, and each generation of file system changes slightly. (e.g. windows has “C:\\” and “\” and if you are using cygwin you also get “/” ... mac has “:” in OS9, and “/” in OSX and there's complex char transform magic underneath. Unix is the worst, they pretty much just allow alphanumerics and underscore “_” and not even space. If you have anything like “= ( ) , ; ' " " # $ & - ~” etc, you can expect most shell tools to erase you disk)

The best thing to do is just to create a file and name it info.txt or readme.txt, then in that file put in the url, date, or keywords and annotation. That's what i do.

Nikolaj Schumacher wrote:

Actually unix systems allow pretty much every character except / and the null character.

To say that unix allows much wider chars in file names is like saying mud is the best medium for sculpture.

Unix file names, for much of its history up to perhaps mid 2000s, effectively just allow alphanumerics plus underscore “_” (hyphen “-” and space can occationally be seen.). As a contrast for comparison, Mac's file names often contain punctuations such as “ , $ # ! * ( )” and space, but also allows non-ascii such as:

etc since the early 1990 or before. (to see the full range, see: How To Create Your Own Keybinding In Mac Os X)

ascii punctuations chars and non-ascii chars such as above are also allowed in filenames in Windows since about Microsoft Windows NT in late 1990s or earlier. Tools in MacOS (such as AppleScript) and Windows, support, expect, these chars in file names.

Sure, you can use many non-alphanumeric chars besides hyphen and underscore in unix, but the system is simply not designed for it. Majority of unix tools, including file name listing, will chock and break if your filename contain these chars. The chocking doesn't actually give you a nice error message, but silently break and often resulting in unexpected and unpredicable behavior. In short, it's just not designed for it.

Issues like these often perpetuate the myth that unix is “powerful”, but in fact it's just raw and no-design.


Related essays:

2008-08
© 2008 by Xah Lee.