Python | Flip the list

flip | Python Methods and Functions

Python provides us with various ways to modify the list. We`ll look at some of the many techniques for modifying a list in Python. 
Examples:

 Input: list = [10, 11, 12, 13, 14, 15] Output: [15, 14, 13, 12, 11, 10] Input: list = [ 4, 5, 6, 7, 8, 9] Output: [9, 8, 7, 6, 5, 4] 

Method 1: using the built-in reversed () function .
In this method, we do not reverse the list in place (modify the original list) and do not create any copy of the list. Instead, we get a reverse iterator, which we use to cycle through the list.

# Reverse the list with reversed ()

def Reverse (lst):

return [ele for ele in reversed (lst)]

 
Driver code

lst = [ 10 , 11 , 12 , 13 , 14 , 15 ]

print (Reverse (lst))

Output:

 [15, 14, 13, 12, 11, 10] 

Method 2. Using the built-in reverse () function .
With the back () method we can completely change the contents of the list object in place i.e. we don`t need to create a new list instead we just copy the existing elements of the original list in reverse order. This method directly modifies the original list.

# Reverse the list using reverse ()

def Reverse (lst):

  lst.reverse ()

  return lst

  

lst = [ 10 , 11 , 12 , 13 , 14 , 15 ]

print (Reverse (lst))

Output:

 [15, 14, 13, 12, 11, 10] 

Method 3: Using the slicing technique.
This method creates a copy of the list and does not sort the list in place. Making a copy requires more storage space for all existing items. This wastes more memory.

# Flip the list using slicing technique

def Reverse (lst):

new_lst = lst [:: - 1 ]

return new_lst

 

lst = [ 10 , 11 , 12 , 13 , 14 , 15 ]

print (Reverse (lst))

Output:

 [15, 14, 13 , 12, 11, 10] 

For a better understanding of the slicing technique, refer to the in Python .





Python | Flip the list: StackOverflow Questions

Answer #1

Here"s a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()]

How it works:

Suppose the columns of the data frame are ["alpha","beta","alpha"]

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

Note: the above only checks columns names, not column values.

Answer #2

NEVER grow a DataFrame!

Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?

Here are the most important reasons, taken from my post here.

  1. It is always cheaper/faster to append to a list and create a DataFrame in one go.
  2. Lists take up less memory and are a much lighter data structure to work with, append, and remove.
  3. dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
  4. An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.

This is The Right Way‚Ñ¢ to accumulate your data

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=["A", "B", "C"])

These options are horrible

  1. append or concat inside a loop

    append and concat aren"t inherently bad in isolation. The problem starts when you iteratively call them inside a loop - this results in quadratic memory usage.

    # Creates empty DataFrame and appends
    df = pd.DataFrame(columns=["A", "B", "C"])
    for a, b, c in some_function_that_yields_data():
        df = df.append({"A": i, "B": b, "C": c}, ignore_index=True)  
        # This is equally bad:
        # df = pd.concat(
        #       [df, pd.Series({"A": i, "B": b, "C": c})], 
        #       ignore_index=True)
    
  2. Empty DataFrame of NaNs

    Never create a DataFrame of NaNs as the columns are initialized with object (slow, un-vectorizable dtype).

    # Creates DataFrame of NaNs and overwrites values.
    df = pd.DataFrame(columns=["A", "B", "C"], index=range(5))
    for a, b, c in some_function_that_yields_data():
        df.loc[len(df)] = [a, b, c]
    

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


It"s posts like this that remind me why I"m a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you"re only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.

Answer #3

As mentioned above,

a[::-1]

really only creates a view, so it"s a constant-time operation (and as such doesn"t take longer as the array grows). If you need the array to be contiguous (for example because you"re performing many vector operations with it), ascontiguousarray is about as fast as flipud/fliplr:

enter image description here


Code to generate the plot:

import numpy
import perfplot


perfplot.show(
    setup=lambda n: numpy.random.randint(0, 1000, n),
    kernels=[
        lambda a: a[::-1],
        lambda a: numpy.ascontiguousarray(a[::-1]),
        lambda a: numpy.fliplr([a])[0],
    ],
    labels=["a[::-1]", "ascontiguousarray(a[::-1])", "fliplr"],
    n_range=[2 ** k for k in range(25)],
    xlabel="len(a)",
)

Answer #4

Adding a quick snippet to have it ready to execute:

Source: myparser.py

import argparse
parser = argparse.ArgumentParser(description="Flip a switch by setting a flag")
parser.add_argument("-w", action="store_true")

args = parser.parse_args()
print args.w

Usage:

python myparser.py -w
>> True

Answer #5

As you have it, the argument w is expecting a value after -w on the command line. If you are just looking to flip a switch by setting a variable True or False, have a look here (specifically store_true and store_false)

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-w", action="store_true")

where action="store_true" implies default=False.

Conversely, you could haveaction="store_false", which implies default=True.

Answer #6

You can flip it around and list the dependencies in setup.py and have a single character — a dot . — in requirements.txt instead.


Alternatively, even if not advised, it is still possible to parse the requirements.txt file (if it doesn"t refer any external requirements by URL) with the following hack (tested with pip 9.0.1):

install_reqs = parse_requirements("requirements.txt", session="hack")

This doesn"t filter environment markers though.


In old versions of pip, more specifically older than 6.0, there is a public API that can be used to achieve this. A requirement file can contain comments (#) and can include some other files (--requirement or -r). Thus, if you really want to parse a requirements.txt you can use the pip parser:

from pip.req import parse_requirements

# parse_requirements() returns generator of pip.req.InstallRequirement objects
install_reqs = parse_requirements(<requirements_path>)

# reqs is a list of requirement
# e.g. ["django==1.5.1", "mezzanine==1.4.6"]
reqs = [str(ir.req) for ir in install_reqs]

setup(
    ...
    install_requires=reqs
)

Answer #7

int(True) is 1.

1 is:

00000001

and ~1 is:

11111110

Which is -2 in Two"s complement1

1 Flip all the bits, add 1 to the resulting number and interpret the result as a binary representation of the magnitude and add a negative sign (since the number begins with 1):

11111110 ‚Üí 00000001 ‚Üí 00000010 
         ‚Üë          ‚Üë 
       Flip       Add 1

Which is 2, but the sign is negative since the MSB is 1.


Worth mentioning:

Think about bool, you"ll find that it"s numeric in nature - It has two values, True and False, and they are just "customized" versions of the integers 1 and 0 that only print themselves differently. They are subclasses of the integer type int.

So they behave exactly as 1 and 0, except that bool redefines str and repr to display them differently.

>>> type(True)
<class "bool">
>>> isinstance(True, int)
True

>>> True == 1
True
>>> True is 1  # they"re still different objects
False

Answer #8

The new magic super() behaviour was added to avoid violating the D.R.Y. (Don"t Repeat Yourself) principle, see PEP 3135. Having to explicitly name the class by referencing it as a global is also prone to the same rebinding issues you discovered with super() itself:

class Foo(Bar):
    def baz(self):
        return super(Foo, self).baz() + 42

Spam = Foo
Foo = something_else()

Spam().baz()  # liable to blow up

The same applies to using class decorators where the decorator returns a new object, which rebinds the class name:

@class_decorator_returning_new_class
class Foo(Bar):
    def baz(self):
        # Now `Foo` is a *different class*
        return super(Foo, self).baz() + 42

The magic super() __class__ cell sidesteps these issues nicely by giving you access to the original class object.

The PEP was kicked off by Guido, who initially envisioned super becoming a keyword, and the idea of using a cell to look up the current class was also his. Certainly, the idea to make it a keyword was part of the first draft of the PEP.

However, it was in fact Guido himself who then stepped away from the keyword idea as "too magical", proposing the current implementation instead. He anticipated that using a different name for super() could be a problem:

My patch uses an intermediate solution: it assumes you need __class__ whenever you use a variable named "super". Thus, if you (globally) rename super to supper and use supper but not super, it won"t work without arguments (but it will still work if you pass it either __class__ or the actual class object); if you have an unrelated variable named super, things will work but the method will use the slightly slower call path used for cell variables.

So, in the end, it was Guido himself that proclaimed that using a super keyword did not feel right, and that providing a magic __class__ cell was an acceptable compromise.

I agree that the magic, implicit behaviour of the implementation is somewhat surprising, but super() is one of the most mis-applied functions in the language. Just take a look at all the misapplied super(type(self), self) or super(self.__class__, self) invocations found on the Internet; if any of that code was ever called from a derived class you"d end up with an infinite recursion exception. At the very least the simplified super() call, without arguments, avoids that problem.

As for the renamed super_; just reference __class__ in your method as well and it"ll work again. The cell is created if you reference either the super or __class__ names in your method:

>>> super_ = super
>>> class A(object):
...     def x(self):
...         print("No flipping")
... 
>>> class B(A):
...     def x(self):
...         __class__  # just referencing it is enough
...         super_().x()
... 
>>> B().x()
No flipping

Answer #9

HDF5 Advantages: Organization, flexibility, interoperability

Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it"s very useful in practice.

Another advantage of HDF is that the datasets can be either fixed-size or flexibly sized. Therefore, it"s easy to append data to a large dataset without having to create an entire new copy.

Additionally, HDF5 is a standardized format with libraries available for almost any language, so sharing your on-disk data between, say Matlab, Fortran, R, C, and Python is very easy with HDF. (To be fair, it"s not too hard with a big binary array, too, as long as you"re aware of the C vs. F ordering and know the shape, dtype, etc of the stored array.)

HDF advantages for a large array: Faster I/O of an arbitrary slice

Just as the TL/DR: For an ~8GB 3D array, reading a "full" slice along any axis took ~20 seconds with a chunked HDF5 dataset, and 0.3 seconds (best-case) to over three hours (worst case) for a memmapped array of the same data.

Beyond the things listed above, there"s another big advantage to a "chunked"* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.

*(HDF5 doesn"t have to be a chunked data format. It supports chunking, but doesn"t require it. In fact, the default for creating a dataset in h5py is not to chunk, if I recall correctly.)

Basically, your best case disk-read speed and your worst case disk read speed for a given slice of your dataset will be fairly close with a chunked HDF dataset (assuming you chose a reasonable chunk size or let a library choose one for you). With a simple binary array, the best-case is faster, but the worst-case is much worse.

One caveat, if you have an SSD, you likely won"t notice a huge difference in read/write speed. With a regular hard drive, though, sequential reads are much, much faster than random reads. (i.e. A regular hard drive has long seek time.) HDF still has an advantage on an SSD, but it"s more due its other features (e.g. metadata, organization, etc) than due to raw speed.


First off, to clear up confusion, accessing an h5py dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it"s sliced. (Similar to memmap, but not identical.) Have a look at the h5py introduction for more information.

Slicing the dataset will load a subset of the data into memory, but presumably you want to do something with it, at which point you"ll need it in memory anyway.

If you do want to do out-of-core computations, you can fairly easily for tabular data with pandas or pytables. It is possible with h5py (nicer for big N-D arrays), but you need to drop down to a touch lower level and handle the iteration yourself.

However, the future of numpy-like out-of-core computations is Blaze. Have a look at it if you really want to take that route.


The "unchunked" case

First off, consider a 3D C-ordered array written to disk (I"ll simulate it by calling arr.ravel() and printing the result, to make things more visible):

In [1]: import numpy as np

In [2]: arr = np.arange(4*6*6).reshape(4,6,6)

In [3]: arr
Out[3]:
array([[[  0,   1,   2,   3,   4,   5],
        [  6,   7,   8,   9,  10,  11],
        [ 12,  13,  14,  15,  16,  17],
        [ 18,  19,  20,  21,  22,  23],
        [ 24,  25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34,  35]],

       [[ 36,  37,  38,  39,  40,  41],
        [ 42,  43,  44,  45,  46,  47],
        [ 48,  49,  50,  51,  52,  53],
        [ 54,  55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64,  65],
        [ 66,  67,  68,  69,  70,  71]],

       [[ 72,  73,  74,  75,  76,  77],
        [ 78,  79,  80,  81,  82,  83],
        [ 84,  85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94,  95],
        [ 96,  97,  98,  99, 100, 101],
        [102, 103, 104, 105, 106, 107]],

       [[108, 109, 110, 111, 112, 113],
        [114, 115, 116, 117, 118, 119],
        [120, 121, 122, 123, 124, 125],
        [126, 127, 128, 129, 130, 131],
        [132, 133, 134, 135, 136, 137],
        [138, 139, 140, 141, 142, 143]]])

The values would be stored on-disk sequentially as shown on line 4 below. (Let"s ignore filesystem details and fragmentation for the moment.)

In [4]: arr.ravel(order="C")
Out[4]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])

In the best case scenario, let"s take a slice along the first axis. Notice that these are just the first 36 values of the array. This will be a very fast read! (one seek, one read)

In [5]: arr[0,:,:]
Out[5]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

Similarly, the next slice along the first axis will just be the next 36 values. To read a complete slice along this axis, we only need one seek operation. If all we"re going to be reading is various slices along this axis, then this is the perfect file structure.

However, let"s consider the worst-case scenario: A slice along the last axis.

In [6]: arr[:,:,0]
Out[6]:
array([[  0,   6,  12,  18,  24,  30],
       [ 36,  42,  48,  54,  60,  66],
       [ 72,  78,  84,  90,  96, 102],
       [108, 114, 120, 126, 132, 138]])

To read this slice in, we need 36 seeks and 36 reads, as all of the values are separated on disk. None of them are adjacent!

This may seem pretty minor, but as we get to larger and larger arrays, the number and size of the seek operations grows rapidly. For a large-ish (~10Gb) 3D array stored in this way and read in via memmap, reading a full slice along the "worst" axis can easily take tens of minutes, even with modern hardware. At the same time, a slice along the best axis can take less than a second. For simplicity, I"m only showing "full" slices along a single axis, but the exact same thing happens with arbitrary slices of any subset of the data.

Incidentally there are several file formats that take advantage of this and basically store three copies of huge 3D arrays on disk: one in C-order, one in F-order, and one in the intermediate between the two. (An example of this is Geoprobe"s D3D format, though I"m not sure it"s documented anywhere.) Who cares if the final file size is 4TB, storage is cheap! The crazy thing about that is that because the main use case is extracting a single sub-slice in each direction, the reads you want to make are very, very fast. It works very well!


The simple "chunked" case

Let"s say we store 2x2x2 "chunks" of the 3D array as contiguous blocks on disk. In other words, something like:

nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
    for j in range(0, ny, 2):
        for k in range(0, nz, 2):
            slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))

chunked = np.hstack([arr[chunk].ravel() for chunk in slices])

So the data on disk would look like chunked:

array([  0,   1,   6,   7,  36,  37,  42,  43,   2,   3,   8,   9,  38,
        39,  44,  45,   4,   5,  10,  11,  40,  41,  46,  47,  12,  13,
        18,  19,  48,  49,  54,  55,  14,  15,  20,  21,  50,  51,  56,
        57,  16,  17,  22,  23,  52,  53,  58,  59,  24,  25,  30,  31,
        60,  61,  66,  67,  26,  27,  32,  33,  62,  63,  68,  69,  28,
        29,  34,  35,  64,  65,  70,  71,  72,  73,  78,  79, 108, 109,
       114, 115,  74,  75,  80,  81, 110, 111, 116, 117,  76,  77,  82,
        83, 112, 113, 118, 119,  84,  85,  90,  91, 120, 121, 126, 127,
        86,  87,  92,  93, 122, 123, 128, 129,  88,  89,  94,  95, 124,
       125, 130, 131,  96,  97, 102, 103, 132, 133, 138, 139,  98,  99,
       104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])

And just to show that they"re 2x2x2 blocks of arr, notice that these are the first 8 values of chunked:

In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0,  1],
        [ 6,  7]],

       [[36, 37],
        [42, 43]]])

To read in any slice along an axis, we"d read in either 6 or 9 contiguous chunks (twice as much data as we need) and then only keep the portion we wanted. That"s a worst-case maximum of 9 seeks vs a maximum of 36 seeks for the non-chunked version. (But the best case is still 6 seeks vs 1 for the memmapped array.) Because sequential reads are very fast compared to seeks, this significantly reduces the amount of time it takes to read an arbitrary subset into memory. Once again, this effect becomes larger with larger arrays.

HDF5 takes this a few steps farther. The chunks don"t have to be stored contiguously, and they"re indexed by a B-Tree. Furthermore, they don"t have to be the same size on disk, so compression can be applied to each chunk.


Chunked arrays with h5py

By default, h5py doesn"t created chunked HDF files on disk (I think pytables does, by contrast). If you specify chunks=True when creating the dataset, however, you"ll get a chunked array on disk.

As a quick, minimal example:

import numpy as np
import h5py

data = np.random.random((100, 100, 100))

with h5py.File("test.hdf", "w") as outfile:
    dset = outfile.create_dataset("a_descriptive_name", data=data, chunks=True)
    dset.attrs["some key"] = "Did you want some metadata?"

Note that chunks=True tells h5py to automatically pick a chunk size for us. If you know more about your most common use-case, you can optimize the chunk size/shape by specifying a shape tuple (e.g. (2,2,2) in the simple example above). This allows you to make reads along a particular axis more efficient or optimize for reads/writes of a certain size.


I/O Performance comparison

Just to emphasize the point, let"s compare reading in slices from a chunked HDF5 dataset and a large (~8GB), Fortran-ordered 3D array containing the same exact data.

I"ve cleared all OS caches between each run, so we"re seeing the "cold" performance.

For each file type, we"ll test reading in a "full" x-slice along the first axis and a "full" z-slize along the last axis. For the Fortran-ordered memmapped array, the "x" slice is the worst case, and the "z" slice is the best case.

The code used is in a gist (including creating the hdf file). I can"t easily share the data used here, but you could simulate it by an array of zeros of the same shape (621, 4991, 2600) and type np.uint8.

The chunked_hdf.py looks like this:

import sys
import h5py

def main():
    data = read()

    if sys.argv[1] == "x":
        x_slice(data)
    elif sys.argv[1] == "z":
        z_slice(data)

def read():
    f = h5py.File("/tmp/test.hdf5", "r")
    return f["seismic_volume"]

def z_slice(data):
    return data[:,:,0]

def x_slice(data):
    return data[0,:,:]

main()

memmapped_array.py is similar, but has a touch more complexity to ensure the slices are actually loaded into memory (by default, another memmapped array would be returned, which wouldn"t be an apples-to-apples comparison).

import numpy as np
import sys

def main():
    data = read()

    if sys.argv[1] == "x":
        x_slice(data)
    elif sys.argv[1] == "z":
        z_slice(data)

def read():
    big_binary_filename = "/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol"
    shape = 621, 4991, 2600
    header_len = 3072

    data = np.memmap(filename=big_binary_filename, mode="r", offset=header_len,
                     order="F", shape=shape, dtype=np.uint8)
    return data

def z_slice(data):
    dat = np.empty(data.shape[:2], dtype=data.dtype)
    dat[:] = data[:,:,0]
    return dat

def x_slice(data):
    dat = np.empty(data.shape[1:], dtype=data.dtype)
    dat[:] = data[0,:,:]
    return dat

main()

Let"s have a look at the HDF performance first:

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py z
python chunked_hdf.py z  0.64s user 0.28s system 3% cpu 23.800 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py x
python chunked_hdf.py x  0.12s user 0.30s system 1% cpu 21.856 total

A "full" x-slice and a "full" z-slice take about the same amount of time (~20sec). Considering this is an 8GB array, that"s not too bad. Most of the time

And if we compare this to the memmapped array times (it"s Fortran-ordered: A "z-slice" is the best case and an "x-slice" is the worst case.):

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py z
python memmapped_array.py z  0.07s user 0.04s system 28% cpu 0.385 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py x
python memmapped_array.py x  2.46s user 37.24s system 0% cpu 3:35:26.85 total

Yes, you read that right. 0.3 seconds for one slice direction and ~3.5 hours for the other.

The time to slice in the "x" direction is far longer than the amount of time it would take to load the entire 8GB array into memory and select the slice we wanted! (Again, this is a Fortran-ordered array. The opposite x/z slice timing would be the case for a C-ordered array.)

However, if we"re always wanting to take a slice along the best-case direction, the big binary array on disk is very good. (~0.3 sec!)

With a memmapped array, you"re stuck with this I/O discrepancy (or perhaps anisotropy is a better term). However, with a chunked HDF dataset, you can choose the chunksize such that access is either equal or is optimized for a particular use-case. It gives you a lot more flexibility.

In summary

Hopefully that helps clear up one part of your question, at any rate. HDF5 has many other advantages over "raw" memmaps, but I don"t have room to expand on all of them here. Compression can speed some things up (the data I work with doesn"t benefit much from compression, so I rarely use it), and OS-level caching often plays more nicely with HDF5 files than with "raw" memmaps. Beyond that, HDF5 is a really fantastic container format. It gives you a lot of flexibility in managing your data, and can be used from more or less any programming language.

Overall, try it and see if it works well for your use case. I think you might be surprised.

Get Solution for free from DataCamp guru