Change language

Python vs C++ speed comparison

|

There are a million reasons to like Python (especially if you're a data scientist). But how different is Python from low-level languages like C and C++? In this article I'm going to do a speed comparison between Python and C++, using a very simple example.

We are going to generate every possible k-measure of DNA, for a fixed value of "k". I'll talk about what k-measures are a little later. This example was chosen because many data processing and analysis tasks related to the genome are considered resource intensive. Therefore, many bioinformatics related data scientists are interested in C++ (in addition to Python).

Important note: it is not the purpose of this article to compare the speed of C++ and Python when they are most efficient. The code of the proposed programs can be made much faster. The purpose of this article is to compare the two languages using the same algorithm and code.

Introduction to k-measures of DNA

DNA is a long chain of nucleotides. These nucleotides can be of four types: A, C, G and T. The species Homo sapiens has about 3 billion pairs of nucleotides. Here is a small piece of human DNA:

ACTAGGGATCATGAAGATAATGTTGGTGTTTGTATGGTTTTCAGACAATT

To get a k-measure out of it, you have to split the string into parts:

ACTA, CTAG, TAGG, AGGG, GGGA, etc.

These sequences of four characters are called k-mers which are four in length (4-mers).

The challenge

We will generate all possible 13-mers. Mathematically it is a permutation with a substitution problem. Hence we have 4 to the power of 13 (67,108,864) variants of 13 measures.

Speed comparison between Python and C++

We will use the same algorithm for the two languages. The code in both languages is intentionally written similarly and simply. I haven't used complex data structures or third party libraries. Here is the code of the Python program:

def convert(c):
    if (c == 'A'): return 'C'
    if (c == 'C'): return 'G'
    if (c == 'G'): return 'T'
    if (c == 'T'): return 'A'

print("Start")

opt = "ACGT"
s = ""
s_last = ""
len_str = 13

for i in range(len_str):
    s += opt[0]

for i in range(len_str):
    s_last += opt[-1]

pos = 0
counter = 1
while (s != s_last):
    counter += 1
    # print(s)
    change_next = True
    for i in range(len_str):
        if (change_next):
            if (s[i] == opt[-1]):
                s = s[:i] + convert(s[i]) + s[i+1:]
                change_next = True
            else:
                s = s[:i] + convert(s[i]) + s[i+1:]
                break

# print(s)
print("Number of generated k-mers: {}".format(counter))
print("Finish!")

This program will take 61.23 seconds to execute. During this time, 67 million 13-mers will be generated. To not increase the running time of the program I commented out the code displaying the results (lines 25 and 37). If you want to run this code and display the results, be aware that it's going to take a very long time. To stop the program you can press CTRL+C.

Now let's see the same algorithm in C++:

#include
#include

using namespace std;

char convert(char c)
{
    if (c == 'A') return 'C';
    if (c == 'C') return 'G';
    if (c == 'G') return 'T';
    if (c == 'T') return 'A';
    return ' ';
}

int main()
{
    cout << "Start" << endl;

    string opt = "ACGT";
    string s = "";
    string s_last = "";
    int len_str = 13;
    bool change_next;

    for (int i=0; i<len_str; i++)
    {
        s += opt[0];
    }

    for (int i=0; i<len_str; i++)
    {
        s_last += opt.back();
    }

    int pos = 0;
    int counter = 1;
    while (s != s_last)
    {
        counter ++;
        // cout << s << endl; change_next = true; for (int i=0; i<len_str; i++) { if (change_next) { if (s[i] == opt.back()) { s[i] = convert(s[i]); change_next = true; } else { s[i] = convert(s[i]); break; } } } } //  // cout << s << endl; cout << "Number of generated k-mers: " << counter << endl; cout << "Finish!" << endl; return 0; }

After compiling, this code will execute in 2.42 seconds. It turns out that Python takes 25 times longer to do this task. I repeated the experiment with 14 and 15 measures (this can be specified on line 12 in Python and on line 22 in C++). Now we see that the performance of these two languages, when performing the same task, differs significantly.

I repeat, both programs are far from perfect and could be significantly opimized. For example, we did not use parallel computing on CPU or GPU. But it is necessary for such tasks. We also don't store the results. Although memory management in Python and C++ significantly affects performance.

This example, and thousands of other tasks, confirm that data scientists should pay attention to C++ and similar languages when they need to work with large data sets or performance-hungry processes.

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically