Progress indicator during pandas operations

| | |

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

I regularly perform pandas operations on data frames in excess of 15 million or so rows and I"d love to have access to a progress indicator for particular operations.

Does a text based progress indicator for pandas split-apply-combine operations exist?

For example, in something like:

df_users.groupby(["userID", "requestDate"]).apply(feature_rollup)

where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I"d like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.

So far, I"ve tried canonical loop progress indicators for Python but they don"t interact with pandas in any meaningful way.

I"m hoping there"s something I"ve overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.

Is this perhaps something that needs to be added to the library?

👻 Read also: what is the best laptop for engineering students?

Progress indicator during pandas operations mean: Questions

Meaning of @classmethod and @staticmethod for beginner?

5 answers

user1632861 By user1632861

Could someone explain to me the meaning of @classmethod and @staticmethod in python? I need to know the difference and the meaning.

As far as I understand, @classmethod tells a class that it"s a method which should be inherited into subclasses, or... something. However, what"s the point of that? Why not just define the class method without adding @classmethod or @staticmethod or any @ definitions?

tl;dr: when should I use them, why should I use them, and how should I use them?

1726

Answer #1

Though classmethod and staticmethod are quite similar, there"s a slight difference in usage for both entities: classmethod must have a reference to a class object as the first parameter, whereas staticmethod can have no parameters at all.

Example

class Date(object):

    def __init__(self, day=0, month=0, year=0):
        self.day = day
        self.month = month
        self.year = year

    @classmethod
    def from_string(cls, date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        date1 = cls(day, month, year)
        return date1

    @staticmethod
    def is_date_valid(date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        return day <= 31 and month <= 12 and year <= 3999

date2 = Date.from_string("11-09-2012")
is_date = Date.is_date_valid("11-09-2012")

Explanation

Let"s assume an example of a class, dealing with date information (this will be our boilerplate):

class Date(object):

    def __init__(self, day=0, month=0, year=0):
        self.day = day
        self.month = month
        self.year = year

This class obviously could be used to store information about certain dates (without timezone information; let"s assume all dates are presented in UTC).

Here we have __init__, a typical initializer of Python class instances, which receives arguments as a typical instancemethod, having the first non-optional argument (self) that holds a reference to a newly created instance.

Class Method

We have some tasks that can be nicely done using classmethods.

Let"s assume that we want to create a lot of Date class instances having date information coming from an outer source encoded as a string with format "dd-mm-yyyy". Suppose we have to do this in different places in the source code of our project.

So what we must do here is:

  1. Parse a string to receive day, month and year as three integer variables or a 3-item tuple consisting of that variable.
  2. Instantiate Date by passing those values to the initialization call.

This will look like:

day, month, year = map(int, string_date.split("-"))
date1 = Date(day, month, year)

For this purpose, C++ can implement such a feature with overloading, but Python lacks this overloading. Instead, we can use classmethod. Let"s create another "constructor".

    @classmethod
    def from_string(cls, date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        date1 = cls(day, month, year)
        return date1

date2 = Date.from_string("11-09-2012")

Let"s look more carefully at the above implementation, and review what advantages we have here:

  1. We"ve implemented date string parsing in one place and it"s reusable now.
  2. Encapsulation works fine here (if you think that you could implement string parsing as a single function elsewhere, this solution fits the OOP paradigm far better).
  3. cls is an object that holds the class itself, not an instance of the class. It"s pretty cool because if we inherit our Date class, all children will have from_string defined also.

Static method

What about staticmethod? It"s pretty similar to classmethod but doesn"t take any obligatory parameters (like a class method or instance method does).

Let"s look at the next use case.

We have a date string that we want to validate somehow. This task is also logically bound to the Date class we"ve used so far, but doesn"t require instantiation of it.

Here is where staticmethod can be useful. Let"s look at the next piece of code:

    @staticmethod
    def is_date_valid(date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        return day <= 31 and month <= 12 and year <= 3999

    # usage:
    is_date = Date.is_date_valid("11-09-2012")

So, as we can see from usage of staticmethod, we don"t have any access to what the class is---it"s basically just a function, called syntactically like a method, but without access to the object and its internals (fields and another methods), while classmethod does.

1726

Answer #2

Rostyslav Dzinko"s answer is very appropriate. I thought I could highlight one other reason you should choose @classmethod over @staticmethod when you are creating an additional constructor.

In the example above, Rostyslav used the @classmethod from_string as a Factory to create Date objects from otherwise unacceptable parameters. The same can be done with @staticmethod as is shown in the code below:

class Date:
  def __init__(self, month, day, year):
    self.month = month
    self.day   = day
    self.year  = year


  def display(self):
    return "{0}-{1}-{2}".format(self.month, self.day, self.year)


  @staticmethod
  def millenium(month, day):
    return Date(month, day, 2000)

new_year = Date(1, 1, 2013)               # Creates a new Date object
millenium_new_year = Date.millenium(1, 1) # also creates a Date object. 

# Proof:
new_year.display()           # "1-1-2013"
millenium_new_year.display() # "1-1-2000"

isinstance(new_year, Date) # True
isinstance(millenium_new_year, Date) # True

Thus both new_year and millenium_new_year are instances of the Date class.

But, if you observe closely, the Factory process is hard-coded to create Date objects no matter what. What this means is that even if the Date class is subclassed, the subclasses will still create plain Date objects (without any properties of the subclass). See that in the example below:

class DateTime(Date):
  def display(self):
      return "{0}-{1}-{2} - 00:00:00PM".format(self.month, self.day, self.year)


datetime1 = DateTime(10, 10, 1990)
datetime2 = DateTime.millenium(10, 10)

isinstance(datetime1, DateTime) # True
isinstance(datetime2, DateTime) # False

datetime1.display() # returns "10-10-1990 - 00:00:00PM"
datetime2.display() # returns "10-10-2000" because it"s not a DateTime object but a Date object. Check the implementation of the millenium method on the Date class for more details.

datetime2 is not an instance of DateTime? WTF? Well, that"s because of the @staticmethod decorator used.

In most cases, this is undesired. If what you want is a Factory method that is aware of the class that called it, then @classmethod is what you need.

Rewriting Date.millenium as (that"s the only part of the above code that changes):

@classmethod
def millenium(cls, month, day):
    return cls(month, day, 2000)

ensures that the class is not hard-coded but rather learnt. cls can be any subclass. The resulting object will rightly be an instance of cls.
Let"s test that out:

datetime1 = DateTime(10, 10, 1990)
datetime2 = DateTime.millenium(10, 10)

isinstance(datetime1, DateTime) # True
isinstance(datetime2, DateTime) # True


datetime1.display() # "10-10-1990 - 00:00:00PM"
datetime2.display() # "10-10-2000 - 00:00:00PM"

The reason is, as you know by now, that @classmethod was used instead of @staticmethod

1726

Answer #3

@classmethod means: when this method is called, we pass the class as the first argument instead of the instance of that class (as we normally do with methods). This means you can use the class and its properties inside that method rather than a particular instance.

@staticmethod means: when this method is called, we don"t pass an instance of the class to it (as we normally do with methods). This means you can put a function inside a class but you can"t access the instance of that class (this is useful when your method does not use the instance).

How do you split a list into evenly sized chunks?

5 answers

jespern By jespern

I have a list of arbitrary length, and I need to split it up into equal size chunks and operate on it. There are some obvious ways to do this, like keeping a counter and two lists, and when the second list fills up, add it to the first list and empty the second list for the next round of data, but this is potentially extremely expensive.

I was wondering if anyone had a good solution to this for lists of any length, e.g. using generators.

I was looking for something useful in itertools but I couldn"t find anything obviously useful. Might"ve missed it, though.

Related question: What is the most “pythonic” way to iterate over a list in chunks?

2632

Answer #1

Here"s a generator that yields the chunks you want:

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

If you"re using Python 2, you should use xrange() instead of range():

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in xrange(0, len(lst), n):
        yield lst[i:i + n]

Also you can simply use list comprehension instead of writing a function, though it"s a good idea to encapsulate operations like this in named functions so that your code is easier to understand. Python 3:

[lst[i:i + n] for i in range(0, len(lst), n)]

Python 2 version:

[lst[i:i + n] for i in xrange(0, len(lst), n)]

2632

Answer #2

If you want something super simple:

def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

Use xrange() instead of range() in the case of Python 2.x

2632

Answer #3

Directly from the (old) Python documentation (recipes for itertools):

from itertools import izip, chain, repeat

def grouper(n, iterable, padvalue=None):
    "grouper(3, "abcdefg", "x") --> ("a","b","c"), ("d","e","f"), ("g","x","x")"
    return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

The current version, as suggested by J.F.Sebastian:

#from itertools import izip_longest as zip_longest # for Python 2.x
from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

def grouper(n, iterable, padvalue=None):
    "grouper(3, "abcdefg", "x") --> ("a","b","c"), ("d","e","f"), ("g","x","x")"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

I guess Guido"s time machine works—worked—will work—will have worked—was working again.

These solutions work because [iter(iterable)]*n (or the equivalent in the earlier version) creates one iterator, repeated n times in the list. izip_longest then effectively performs a round-robin of "each" iterator; because this is the same iterator, it is advanced by each such call, resulting in each such zip-roundrobin generating one tuple of n items.

We hope this article has helped you to resolve the problem. Apart from Progress indicator during pandas operations, check other mean-related topics.

Want to excel in Python? See our review of the best Python online courses 2022. If you are interested in Data Science, check also how to learn programming in R.

By the way, this material is also available in other languages:



Oliver Krasiko

Prague | 2022-11-30

Thanks for explaining! I was stuck with Progress indicator during pandas operations for some hours, finally got it done 🤗. Will use it in my bachelor thesis

Ken Robinson

Abu Dhabi | 2022-11-30

Maybe there are another answers? What Progress indicator during pandas operations exactly means?. Will use it in my bachelor thesis

Javier Williams

Moscow | 2022-11-30

Thanks for explaining! I was stuck with Progress indicator during pandas operations for some hours, finally got it done 🤗. I just hope that will not emerge anymore

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically