Unicode in Perl & Python

Advertise Here

, , …,

Python

Unicode Encoded Source Code

If your source code contains Unicode characters, you must declare the file's encoding in the first line. Like this:

#-*- coding: utf-8 -*-
print "look Chinese chars: 请你不要哭"

The #-*- coding: utf-8 -*- declaration in the first line is a convention adopted from the text editor Emacs. (➲ Emacs and Unicode Tips) It tells any program reading the file that the file is encoded using a particular character set. Its purpose is similar to HTML's <meta http-equiv="content-type" content="text/html; charset=utf-8" />. (See: Character Sets and Encoding in HTMLUNICODE Basics.)

Text Processing with Unicode Strings

If you are going to do any processing with Unicode string, such as substring extracting or string pattern matching, then you need to put u in front of the string. For example,

#-*- coding: utf-8 -*-
$str = u"look Chinese chars: 请你不要哭"

Note, however, variable or function names cannot contain Unicode chars.

Sometimes when you print Unicode strings, you may get a error like this:

# UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 16: ordinal not in range(128).

The solution is to use the “.encode()” or “.decode()” method. Because, when reading a file as lines, otherwise it is just a sequence of bytes to Python. For example:

 myString = myString.decode("utf-8") or
 myString = myString.encode("utf-8")
#-*- coding: utf-8 -*-
# python

myStr = u'α'

# Bad. This is a error.
print 'Greek alpha: ', myStr

# Good
print 'Greek alpha: ', myStr.encode('utf-8')

Unicode in Regex

When using regex on Unicode string, and you want the pattern characters {\w, \W, \b, \B} dependent on the Unicode character properties , you need to add the Unicode flag re.U when calling regex functions.

# -*- coding: utf-8 -*-
# python
import re
result = re.search(r'\w+', u'真善美αβγ!', re.U)
if result:
    print result.group().encode('utf8')
else:
    print "no match"
# prints 「真善美αβγ」, but if re.U is not used, it prints 「no match」 because the 「\w+」 pattern for “word” only consider ASCII letters

See: Python Regex Flags.

Perl

use bytes; # Larry can take Unicode and shove it up his ass sideways. 
            # Perl 5.8.0 causes us to start getting incomprehensible 
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage (1998),
                by Jamie W Zawinski (~b1971) 

𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊 🌞 🐪🐫🐪 🌴

In Perl, when calling scripts that process Unicode, call it with -C option in the command line.

If your Perl script contains Unicode, then use use utf8;. You can have unicode in string, also in variable names. Example:

# -*- coding: utf-8 -*-
# perl

use strict;
use utf8; # necessary if you want to use unicode in function or var names

# processing unicode string
my $s = 'I ★ you';
$s =~ s///;
print "$s\n";

# variable with unicode char
my $愛 = 4;
print "$愛\n";

# function with unicode char
sub f愛 { return 2;}
print f愛();

2011-07-29. Three bleeding-edge perl articles on Unicode.

According to one of the article above, at least one of them is inspired from this stackoverflow question: Why does modern Perl avoid UTF-8 by default? Source stackoverflow.com

blog comments powered by Disqus