Xah Lee, 2005-01.
Python supports unicode in source file by putting a file encoding declaration as the first line.
#-*- coding: utf-8 -*- print "look chinese chars: 请你不要哭"
Note, however, identifiers cannot use unicode chars. For example, variable names cannot contain unicode chars.
If you are going to do any processing with a unicode string, such as substring extracting or string pattern matching, then you need to put “u” in front of the string. For example,
#-*- coding: utf-8 -*- $str = u"look Chinese chars: 请你不要哭"
The “#-*- coding: utf-8 -*-” declaration in the first line is a convention adopted from the text editor Emacs. It tells any program reading the file that the file is encoded using a particular character set. For example, it serves a purpose similar to HTML's “<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8">”.
Sometimes when you print unicode strings, you may get a error like this:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128)
In such a case, you need to encode your string. Here's a example:
#-*- coding: utf-8 -*- # python alpha=u'α' # Bad print u'Unicode alpha: ', alpha # Good print u'Unicode alpha: ', (alpha).encode('utf-8')
Reference: Python Doc↗.
Reference: Python Doc↗.
In Python, often you'll encounter this error message:.
«UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 37: ordinal not in range(128) »
The solution, is often to encode or decode your line into a particular encoding. Because, when reading a file as lines, to Python a line is just a sequence of bytes. For example:
myString=myString.decode("utf-8") or
myString=myString.encode("utf-8")
use bytes; # Larry can take Unicode and shove it up his ass sideways.
# Perl 5.8.0 causes us to start getting incomprehensible
# errors about UTF-8 all over the place without this.
—from the source code of WebCollage↗ (1998),
by Jamie W Zawinski↗ (~b1971)
In Perl, dealing with unicode is quite different from Python. Perl's Unicode support starts to be somewhat usable with Perl 5.8. Perl provides the “-C” option in the command line, which changes input and output behaviors of Perl to work UTF-8. It is uncessarily complex, because it is hacked up thru the years since Perl 5.6.
Perl 5.8 (2002-07) can have unicode chars used as variable's name or function name. You need to say “use utf8;” in your code. Example:
# perl use utf8; # necessary if you want to use unicode in function or var names # processing unicode string $s = 'I ★ you'; $s =~ s/★/♥/; print $s; # variable with unicode char $愛=4; print $愛; # function with unicode char sub f愛 { return 2;} print f愛();
Because you are outputing utf8 unicode string in the above code, you need to run it with the -C option, example: “perl -C7 myCode.pl”.
Reference: perldoc perluniintro↗.
Reference: perldoc perlunicode↗.
See also:
Page created: 2005-01. © 2005 by Xah Lee.