Passing null character strings to C libraries



The code below has a C function that we will illustrate and test. The C function ( code # 1 ) simply prints the hexadecimal representation of individual characters so that the passed strings can be easily debugged.

Code # 1:

void print_chars ( char * s)

{

  while (* s)

  {

printf ( "% 2x" , (unsigned char ) * s); 

s ++; 

}

printf ( " " ); 

}

 

print_chars ( "Hello" ); 

Output:

 48 65 6c 6c 6f 

There are several options to call such a C function from Python. First, it can be limited to — work with bytes only, using the conversion code “y” to PyArg_ParseTuple() as shown in the code below.

Code # 2:

static PyObject * py_print_chars (PyObject * self, PyObject * args)

{

char * s; 

if (! PyArg_ParseTuple ( args, "y" , & amp; s))

{

return NULL; 

}

print_chars (s); 

Py_RETURN_NONE; 

}

Let`s see how the resulting function works and how bytes with embedded NULL bytes and Unicode strings are discarded.

Code # 3:

print (print_chars (b ` Hello World` ))

 

print ( "" , print_chars (b `Hellox00World` ))

  

print ( "" , print_chars ( `Hello World`  ))

Output:

 48 65 6c 6c 6f 20 57 6f 72 6c 64 Traceback (most recent call last): File "", line 1, in TypeError: must be bytes without null bytes, not bytes Traceback (most recent call last): File "", line 1, in TypeError: `str` does not support the buffer interface 

If you want to pass Unicode strings instead, use format code “s” for PyArg_ParseTuple() as below.

Code # 4:

static PyObject * py_print_chars (PyObject * self, PyObject * args)

{

char * s; 

if (! PyArg_ParseTuple ( args, "s" , & amp; s))

{

return NULL; 

}

print_chars (s); 

Py_RETURN_NONE; 

}

Using the above code ( code # 4 ) will automatically convert all strings to null-terminated UTF-8 encoding. As shown in the code below.

Code # 5:

print (print_chars ( `Hello World` ))

 
# UTF-8 encoding

print ( "" , print_chars ( `Spicy Jalapeu00f1o` ))

  

print ( "" , print_chars ( `Hellox00World` ))

 

print ( "" , print_chars (b `Hello World` ))

Output:

 48 65 6c 6f 20 57 6f 72 6c 64 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f Traceback ( most recent call last): File "", line 1, in TypeError: must be str without null characters, not str Traceback (most recent call last): File "", line 1, in TypeError: must be str, not bytes  

If you are working with PyObject * and cannot use PyArg_ParseTuple () , the code below explains how to check and extract a suitable reference char * from both bytes and from a string object.

Code # 6: Converting from Bytes

// some Python object
PyObject * obj; 

 
// Convert from bytes
{

char * s; 

s = PyBytes_AsString (o); 

if (! s)

{

/ * TypeError has already been raised * *

  return NULL; 

}

print_chars (s); 

}

Code # 7: Convert to UTF-8 bytes from string

{

 

PyObject * bytes; 

char * s; 

 

if (! PyUnicode_Check (obj))

  {

PyErr_SetString (PyExc_TypeError, "Expected string" ); 

return NULL; 

}

 

  bytes = PyUnicode_AsUTF8String (obj); 

s = PyBytes_AsString (bytes); 

print_chars (s); 

Py_DECREF (bytes); 

}

Both code conversions guarantee null-terminated data, but there is no check for embedded NULL bytes elsewhere in the string. This should be checked if important.

Note: There is a hidden memory overhead associated with using the "s" format code for PyArg_ParseTuple () which is easy to miss. When you write code that uses this conversion, a UTF-8 string is created that is permanently attached to the original string object, which, if it contains non-ASCII characters, increases the size of the string until it is garbage collected.

Code # 8:

import sys

s = `Spicy Jalapeu00f1o`

print ( "Size:" , sys.getsizeof (s))

 
# passed string

print ( "" , print_chars (s) )

 
# increase size

print ( "Size:" , sys.getsizeof (s))

Output:

 Size: 87 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f Size: 103