Xah Lee, , …,
If your source code contains Unicode characters, you must declare the file's encoding in the first line. Like this:
#-*- coding: utf-8 -*- print "look Chinese chars: 请你不要哭"
The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. (➲ Emacs and Unicode Tips) It tells any program reading the file that the file is encoded using a particular character set. Its purpose is similar to HTML's
<meta http-equiv="content-type" content="text/html; charset=utf-8" />.
(See: Character Sets and Encoding in HTML ◇
UNICODE Basics.)
If you are going to do any processing with Unicode string, such as substring extracting or string pattern matching, then you need to put u in front of the string. For example,
#-*- coding: utf-8 -*- $str = u"look Chinese chars: 请你不要哭"
Note, however, variable or function names cannot contain Unicode chars.
Sometimes when you print Unicode strings, you may get a error like this:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).
The solution is to use the “.encode()” or “.decode()” method. Because, when reading a file as lines, otherwise it is just a sequence of bytes to Python. For example:
myString = myString.decode("utf-8") or
myString = myString.encode("utf-8")
#-*- coding: utf-8 -*- # python myStr = u'α' # Bad. This is a error. print 'Greek alpha: ', myStr # Good print 'Greek alpha: ', myStr.encode('utf-8')
When using regex on Unicode string, and you want the pattern characters {\w, \W, \b, \B} dependent on the Unicode character properties , you need to add the Unicode flag re.U when calling regex functions.
# -*- coding: utf-8 -*- # python import re result = re.search(r'\w+', u'真善美αβγ!', re.U) if result: print result.group().encode('utf8') else: print "no match" # prints 「真善美αβγ」, but if re.U is not used, it prints 「no match」 because the 「\w+」 pattern for “word” only consider ASCII letters
See: Python Regex Flags.
use bytes; # Larry can take Unicode and shove it up his ass sideways.
# Perl 5.8.0 causes us to start getting incomprehensible
# errors about UTF-8 all over the place without this.
—from the source code of WebCollage (1998),
by Jamie W Zawinski (~b1971)
𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊 🌞 🐪🐫🐪 🌴
In Perl, when calling scripts that process Unicode, call it with -C option in the command line.
If your Perl script contains Unicode, then use use utf8;. You can have unicode in string, also in variable names. Example:
# -*- coding: utf-8 -*- # perl use strict; use utf8; # necessary if you want to use unicode in function or var names # processing unicode string my $s = 'I ★ you'; $s =~ s/★/♥/; print "$s\n"; # variable with unicode char my $愛 = 4; print "$愛\n"; # function with unicode char sub f愛 { return 2;} print f愛();
2011-07-29. Three bleeding-edge perl articles on Unicode.
According to one of the article above, at least one of them is inspired from this stackoverflow question: Why does modern Perl avoid UTF-8 by default? Source stackoverflow.com