Unicode in Perl and Python

Xah Lee, 2005-01.

Python

Python supports unicode in source file by putting a file encoding declaration as the first line.

#-*- coding: utf-8 -*-
print "look chinese chars: 请你不要哭"

Note, however, identifiers cannot use unicode chars. For example, variable names cannot contain unicode chars.

If you are going to do any processing with a unicode string, such as substring extracting or string pattern matching, then you need to put “u” in front of the string. For example,

#-*- coding: utf-8 -*-
$str = u"look Chinese chars: 请你不要哭"

The “#-*- coding: utf-8 -*-” declaration in the first line is a convention adopted from the text editor Emacs. It tells any program reading the file that the file is encoded using a particular character set. For example, it serves a purpose similar to HTML's “<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8">”.

Sometimes when you print unicode strings, you may get a error like this:

# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128)

In such a case, you need to encode your string. Here's a example:

#-*- coding: utf-8 -*-
# python

alpha=u'α'

# Bad
print u'Unicode alpha: ', alpha

# Good
print u'Unicode alpha: ', (alpha).encode('utf-8')

Reference: Python Doc↗.

Reference: Python Doc↗.

In Python, often you'll encounter this error message:.

«UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 37: ordinal not in range(128) »

The solution, is often to encode or decode your line into a particular encoding. Because, when reading a file as lines, to Python a line is just a sequence of bytes. For example:

    myString=myString.decode("utf-8") or
    myString=myString.encode("utf-8")

Perl

use bytes; # Larry can take Unicode and shove it up his ass sideways. 
            # Perl 5.8.0 causes us to start getting incomprehensible 
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage↗ (1998),
                by Jamie W Zawinski↗ (~b1971) 

In Perl, dealing with unicode is quite different from Python. Perl's Unicode support starts to be somewhat usable with Perl 5.8. Perl provides the “-C” option in the command line, which changes input and output behaviors of Perl to work UTF-8. It is uncessarily complex, because it is hacked up thru the years since Perl 5.6.

Perl 5.8 (2002-07) can have unicode chars used as variable's name or function name. You need to say “use utf8;” in your code. Example:

# perl

use utf8; # necessary if you want to use unicode in function or var names

# processing unicode string
$s = 'I ★ you'; $s =~ s///;
print $s;

# variable with unicode char
$愛=4;  print $愛;

# function with unicode char
sub f愛 { return 2;}  print f愛();

Because you are outputing utf8 unicode string in the above code, you need to run it with the -C option, example: “perl -C7 myCode.pl”.

Reference: perldoc perluniintro↗.

Reference: perldoc perlunicode↗.


See also:


Page created: 2005-01.
© 2005 by Xah Lee.
Xah Signet