The code below has a C function that we will illustrate and test. The C function ( code # 1 ) simply prints the hexadecimal representation of individual characters so that the passed strings can be easily debugged.
Code # 1:
48 65 6c 6c 6f
There are several options to call such a C function from Python. First, it can be limited to — work with bytes only, using the conversion code "y" to
PyArg_ParseTuple() as shown in the code below.
Code # 2:
Let’s see how the resulting function works and how bytes with embedded NULL bytes and Unicode strings are discarded.
Code # 3:
48 65 6c 6c 6f 20 57 6f 72 6c 64 Traceback (most recent call last): File "", line 1, in TypeError: must be bytes without null bytes, not bytes Traceback (most recent call last): File "", line 1, in TypeError: ’str’ does not support the buffer interface
If you want to pass Unicode strings instead, use format code "s" for
PyArg_ParseTuple() as below.
Code # 4:
Using the above code ( code # 4 ) will automatically convert all strings to null-terminated UTF-8 encoding. As shown in the code below.
Code # 5:
48 65 6c 6f 20 57 6f 72 6c 64 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f Traceback ( most recent call last): File "", line 1, in TypeError: must be str without null characters, not str Traceback (most recent call last): File "", line 1, in TypeError: must be str, not bytes
If you are working with
PyObject * and cannot use
PyArg_ParseTuple () , the code below explains how to check and extract a suitable reference
char * from both bytes and from a string object.
Code # 6: Converting from Bytes
Code # 7: Convert to UTF-8 bytes from string
Both code conversions guarantee null-terminated data, but there is no check for embedded NULL bytes elsewhere in the string. This should be checked if important.
Note: There is a hidden memory overhead associated with using the "s" format code for
PyArg_ParseTuple () which is easy to miss. When you write code that uses this conversion, a UTF-8 string is created that is permanently attached to the original string object, which, if it contains non-ASCII characters, increases the size of the string until it is garbage collected.
Code # 8:
Size: 87 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f Size: 103