Unicodedata — Python Unicode database

File handling | Python Methods and Functions | unicodedata

Functions defined by the module:

  • unicodedata.lookup (name)
    This function looks for a character by name. If a symbol with the given name is found in the database, then the corresponding symbol is returned, otherwise a Keyerror is thrown.

    Example:

    import unicodedata

     

    print unicodedata.lookup ( 'LEFT CURLY BRACKET' )

    print unicodedata.lookup ( ' RIGHT CURLY BRACKET' )

    print unicodedata .lookup ( 'ASTERISK' )

      
    # gives error as is
    # no symbol named ASTRA
    # print unicodedata.lookup (& # 39; ASTER & # 39;)

    Output:

     {} * 
  • unicodedata.name (chr [, default])
    This function returns the name assigned to the given character as a string. If no name is specified, the function returns a default value, otherwise a ValueError is thrown if no name is specified.

    Example:

    import unicodedata

      

    print unicodedata.name (u '/' )

    print unicodedata.name (u '|' )

    print unicodedata. name (u ':' )

    Output:

     SOLIDUS VERTICAL LINE COLON 
  • unicodedata.decimal (chr [, by default])
    This function returns the decimal value assigned to the given character as an integer. If no value is specified, the function returns the default value, otherwise ValueError is raised if no value is specified.

    Example:

    import unicodedata

      

    print unicodedata.decimal (u '9' )

    print unicodedata.decimal (u 'a' )

    Output:

     9 Traceback (most recent call last): File "7e736755dd176cd0169eeea6f5d32057.py", line 4, in print unicodedata.decimal (u'a') ValueError: not a decimal 
  • unicodedata.digit (chr [, default])
    This function returns the numeric value assigned to the given character as an integer. If no value is specified, the function returns the default value, otherwise ValueError is raised if no value is specified.

    Example:

    import unicodedata

     

    print unicodedata.decimal (u '9' )

    print unicodedata.decimal (u '143' )

    Output:

     9 Traceback (most recent call last): File "ad47ae996380a777426cc1431ec4a8cd.py", line 4, in print unicodedata. decimal (u'143') TypeError: need a single Unicode character as parameter 
  • un icodedata.numeric (chr [, default])
    This function returns the numeric value assigned to the given character as an integer. If no value is specified, the function returns the default value, otherwise ValueError is raised if no value is specified.

    Example:

    import unicodedata

      

    print unicodedata.decimal (u '9' )

    print unicodedata.decimal (u '143' )

    Output:

     9 Traceback (most recent call last): File "ad47ae996380a777426cc1431ec4a8cd.py", line 4, in print unicodedata.decimal (u'143') TypeError: need a single Unicode character as parameter 
  • unicodeda ta.category (CHR)
    This function returns the general category assigned to the given character as a string. For example, it returns L for a letter and U for uppercase letters.

    Example:

    import unicodedata

     

    print unicodedata.category (u 'A' )

    print unicodedata.category (u 'b' )

    Output:

     Lu Ll 
  • unicodedata.bidirectional (CHR)
    This function returns a bi-directional class assigned to the given character as a string. For example, it returns "A" for Arabic and "N" for number. This function returns an empty string if no such value is defined.

    Example:

    import unicodedata

     

    print unicodedata.bidirectional (u 'u0660' )

    Output:

     AN 
  • unicodedata.normalize (form, uniform)
    This function returns the normal form form for a Unicr Unicode string. Valid values ​​for the form are NFC, NFKC, NFD, and NFKD.

    Example:

    from unicodedata import normalize

     

    print '% r' % normalize ( 'NFD' , u 'u00C7' )

    print '% r' % normalize ( 'NFC' , u ' Cu0327' )

    print  '% r' % normalize ( ' NFKD' , u 'u2460' )

    Output:

     u'Cu0327' u'xc7' u'1' 

This article courtesy of Aditi Gupta . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting the article [email protected] ... See my article appearing on the Python.Engineering homepage and help other geeks.

Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.





Unicodedata — Python Unicode database: StackOverflow Questions

Answer #1

Comparing strings in a case insensitive way seems trivial, but it"s not. I will be using Python 3, since Python 2 is underdeveloped here.

The first thing to note is that case-removing conversions in Unicode aren"t trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

"ß".lower()
#>>> "ß"

"ß".upper().lower()
#>>> "ss"

But let"s say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that"s the newer capital form. The recommended way is to use casefold:

str.casefold()

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. [...]

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn"t:

"ê" == "ê"
#>>> False

This is because the accent on the latter is a combining character.

import unicodedata

[unicodedata.name(char) for char in "ê"]
#>>> ["LATIN SMALL LETTER E WITH CIRCUMFLEX"]

[unicodedata.name(char) for char in "eÃÇ"]
#>>> ["LATIN SMALL LETTER E", "COMBINING CIRCUMFLEX ACCENT"]

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
#>>> True

To finish up, here this is expressed in functions:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

Answer #2

There"s many useful things in Python"s unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don"t get the results you"re after.

Answer #3

It"s mostly about unicode classifications. Here"s some examples to show discrepancies:

>>> def spam(s):
...     for attr in "isnumeric", "isdecimal", "isdigit":
...         print(attr, getattr(s, attr)())
...         
>>> spam("¬Ω")
isnumeric True
isdecimal False
isdigit False
>>> spam("³")
isnumeric True
isdecimal False
isdigit True

Specific behaviour is in the official docs here.

Script to find all of them:

import sys
import unicodedata
from collections import defaultdict

d = defaultdict(list)
for i in range(sys.maxunicode + 1):
    s = chr(i)
    t = s.isnumeric(), s.isdecimal(), s.isdigit()
    if len(set(t)) == 2:
        try:
            name = unicodedata.name(s)
        except ValueError:
            name = f"codepoint{i}"
        print(s, name)
        d[t].append(s)

Answer #4

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize("NFKD", title).encode("ascii", "ignore")
"Kluft skrams infor pa federal electoral groe"

Answer #5

Use the str.isspace() method:

Return True if there are only whitespace characters in the string and there is at least one character, False otherwise.

A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.

Combine that with a special case for handling the empty string.

Alternatively, you could use str.strip() and check if the result is empty.

Answer #6

xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u"xa0", u" ")

When .encode("utf-8"), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, xa0 is represented by 2 bytes xc2xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

Answer #7

How about this:

import unicodedata
def strip_accents(s):
   return "".join(c for c in unicodedata.normalize("NFD", s)
                  if unicodedata.category(c) != "Mn")

This works on greek letters, too:

>>> strip_accents(u"A u00c0 u0394 u038E")
u"A A u0394 u03a5"
>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark"s answer (I didn"t think of unicodedata.combining, but it is probably the better solution, because it"s more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

Answer #8

You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.

The Django text utils define a function, slugify(), that"s probably the gold standard for this kind of thing. Essentially, their code is the following.

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if "allow_unicode" is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren"t alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize("NFKC", value)
    else:
        value = unicodedata.normalize("NFKD", value).encode("ascii", "ignore").decode("ascii")
    value = re.sub(r"[^ws-]", "", value.lower())
    return re.sub(r"[-s]+", "-", value).strip("-_")

And the older version:

def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
    value = unicode(re.sub("[^ws-]", "", value).strip().lower())
    value = unicode(re.sub("[-s]+", "-", value))
    # ...
    return value

There"s more, but I left it out, since it doesn"t address slugification, but escaping.

Answer #9

Likely, your problem is that you parsed it okay, and now you"re trying to print the contents of the XML and you can"t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode("ascii", "ignore")

the "ignore" part will tell it to just skip those characters. From the python docs:

>>> # Python 2: u = unichr(40960) + u"abcd" + unichr(1972)
>>> u = chr(40960) + u"abcd" + chr(1972)
>>> u.encode("utf-8")
"xeax80x80abcdxdexb4"
>>> u.encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: "ascii" codec can"t encode character "ua000" in position 0: ordinal not in range(128)
>>> u.encode("ascii", "ignore")
"abcd"
>>> u.encode("ascii", "replace")
"?abcd?"
>>> u.encode("ascii", "xmlcharrefreplace")
"&#40960;abcd&#1972;"

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what"s going on. After the read, you"ll stop feeling like you"re just guessing what commands to use (or at least that happened to me).

Answer #10

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize("NFKD", input_str)
    only_ascii = nfkd_form.encode("ASCII", "ignore")
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize("NFKD", input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it"s a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)