2005-03-14
Python has a unicodedata module Here's a example using it:
#-*- coding: utf-8 -*- # python from unicodedata import * # each unicode char has a unique name. # one can use the “lookup” func to find it mychar=lookup('greek cApital letter sIgma') # note letter case doesn't matter print mychar.encode('utf-8') m=lookup('CJK UNIFIED IDEOGRAPH-5929') # for some reason, case must be right here. print m.encode('utf-8') # to find a char's name, use the “name” function print name(u'天') # to get the ordinal of the unicode char, use the standard function ord: print ord(u'天')
Basically, in unicode, each char has a number of attributes (called properties) besides its name. These attributes provides necessary info to form letters, words, or processing such as sorting, capitalization, etc, of varous human scripts. For example, Latin alphabets has two forms of upper case and lower case. Korean alphabets are stacked together. While many symbols corresponds to numbers, and there are also combining forms used for example to put a bar over any letter or character. Also some writings systems are directional. In order to form these symbols for display or process them for computing, info of these on each char is necessary.
The rest of functions in unicodedata returns these attributes.
Reference: Python Doc↗.
Official doc on unicode character properties: http://www.unicode.org/uni2book/ch04.pdf
* * *
Here's a snippet of code that prints a range of unicode chars, along with their ordinal in hex, and name.
chars without a name are skipped. (some of such are undefined code points.)
On Microsoft Windows the encoding might need to be changed to utf-16.
Change the range to see different unicode chars.
# python from unicodedata import * l=[] for i in range(0x0000, 0x0fff): l.append(eval('u"\\u%04x"' % i)) for x in l: if name(x,'-')!='-': print x.encode('utf-8'),'|', "%04x"%(ord(x)), '|', name(x,'-')
See also:
Page created: 2005-01. © 2005 by Xah Lee.