Unicode strings passed to C libraries



To illustrate the solution below, two C functions operate on string data and output it for debugging and experimentation.

Code # 1: Uses bytes, represented in the form char * , int

void print_chars ( char * s, int len )

{

int n = 0; 

while (n & lt; len)

{

printf ( "% 2x" , (unsigned char ) s [n]); 

n ++; 

}

printf ( " " ); 

}

Code # 2: Uses wide characters in the form wchar_t *, int

void print_wchars ( wchar_t * s, int len)

{

int n = 0; 

while (n & lt; len)

{

printf ( "% x" , s [n]); 

n ++; 

}

printf ( " " ); 

}

Python strings must be converted to a suitable byte encoding such as UTF-8 for the print_chars () byte function. The code below is a simple extension function for the ultimate goal.

Code # 3:

static PyObject * py_print_chars (PyObject * self, PyObject * args)

{

char * s; 

Py_ssize_t len; 

if (! PyArg_ParseTuple ( args, "s #" , & amp; s, & amp; len))

{

return NULL; 

}

print_chars (s, len); 

Py_RETURN_NONE; 

}

For library functions that work with the machine type wchar_t , the C extension code can be written as —

Code # 4:

static PyObject * py_print_wchars (PyObject * self , PyObject * args)

{

wchar_t * s; 

Py_ssize_t len

if (! PyArg_ParseTuple (args , "u #" , & amp; s, & amp; len ))

{

return NULL; 

}

print_wchars (s, len ); 

Py_RETURN_NONE; 

}

The code below now checks how the extension functions work.

Observe how the print_wchars() -oriented function print_chars () gets the data in UTF-8, while print_wchars() gets the Unicode code point values.

Code # 5:

s = `Spicy Jalapeu00f1o`

print (print_chars ( s))

 

print ( "" , print_wchars (s))

< / p>

Output:

 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f 53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f 

Let`s check the nature of the C library being accessed. For many C libraries, it might make sense to pass bytes instead of a string. Let`s use the conversion code below to do this.

Code # 6:

static PyObject * py_print_chars (PyObject * self, PyObject * args)

{

char * s; 

Py_ssize_t len; 

 

// accepts bytes, bytearray, or other byte objects

  

  if (! PyArg_ParseTuple (args, " y # " , & amp; s, & amp; len))

{

return NULL; 

}

print_chars (s, len); 

Py_RETURN_NONE; 

}

If you still want to pass strings, care should be taken to ensure that Python3 uses an adaptable string representation that is not very easy to map directly to C libraries using the standard char * or wchar_t * . Thus, to represent string data in C, some kind of conversion is almost always necessary. The format codes s # and u # for PyArg_ParseTuple () safely perform such conversions. 
Whenever a conversion is performed, a copy of the converted data is attached to the original string object so that it can be used later, as shown in the code below.

Code # 7:

import sys

 

s = ` Spicy Jalapeu00f1o`

print ( "Size: " , sys.getsizeof (s))

  

print ( "" , print_chars (s))

 

print ( "Size:" , sys.getsizeof (s))

 

print ( "" , print_wchars (s))

 

print ( " Size: " , sys.getsizeof (s))

Output:

 Size: 87 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f Size: 103 53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f Size: 163