Find indices of elements equal to zero in a NumPy array

find | nonzero | StackOverflow

NumPy has the efficient function/method nonzero() to identify the indices of non-zero elements in an ndarray object. What is the most efficient way to obtain the indices of the elements that do have a value of zero?

Answer rating: 252

numpy.where() is my favorite.

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.where(x == 0)[0]
array([1, 3, 5])




Find indices of elements equal to zero in a NumPy array: StackOverflow Questions

Finding the index of an item in a list

Given a list ["foo", "bar", "baz"] and an item in the list "bar", how do I get its index (1) in Python?

Find current directory and file"s directory

In Python, what commands can I use to find:

  1. the current directory (where I was in the terminal when I ran the Python script), and
  2. where the file I am executing is?

How to find if directory exists in Python

In the os module in Python, is there a way to find if a directory exists, something like:

>>> os.direxists(os.path.join(os.getcwd()), "new_folder")) # in pseudocode
True/False

How do I find the location of my Python site-packages directory?

Question by Daryl Spitzer

How do I find the location of my site-packages directory?

Find all files in a directory with extension .txt in Python

How can I find all the files in a directory having the extension .txt in python?

Find which version of package is installed with pip

Using pip, is it possible to figure out which version of a package is currently installed?

I know about pip install XYZ --upgrade but I am wondering if there is anything like pip info XYZ. If not what would be the best way to tell what version I am currently using.

error: Unable to find vcvarsall.bat

I tried to install the Python package dulwich:

pip install dulwich

But I get a cryptic error message:

error: Unable to find vcvarsall.bat

The same happens if I try installing the package manually:

> python setup.py install
running build_ext
building "dulwich._objects" extension
error: Unable to find vcvarsall.bat

How to use glob() to find files recursively?

This is what I have:

glob(os.path.join("src","*.c"))

but I want to search the subfolders of src. Something like this would work:

glob(os.path.join("src","*.c"))
glob(os.path.join("src","*","*.c"))
glob(os.path.join("src","*","*","*.c"))
glob(os.path.join("src","*","*","*","*.c"))

But this is obviously limited and clunky.

Python: Find in list

I have come across this:

item = someSortOfSelection()
if item in myList:
    doMySpecialFunction(item)

but sometimes it does not work with all my items, as if they weren"t recognized in the list (when it"s a list of string).

Is this the most "pythonic" way of finding an item in a list: if x in l:?

How to find out the number of CPUs using python

I want to know the number of CPUs on the local machine using Python. The result should be user/real as output by time(1) when called with an optimally scaling userspace-only program.

Answer #1

How to iterate over rows in a DataFrame in Pandas?

Answer: DON"T*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.


Further Reading

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Answer #2

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #3

os.listdir() - list in the current directory

With listdir in os module you get the files and the folders in the current dir

 import os
 arr = os.listdir()
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Looking in a directory

arr = os.listdir("c:\files")

glob from glob

with glob you can specify a type of file to list like this

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

glob in a list comprehension

mylist = [f for f in glob.glob("*.txt")]

get the full path of only files in the current directory

import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles) 

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]

Getting the full path name with os.path.abspath

You get the full path in return

 import os
 files_path = [os.path.abspath(x) for x in os.listdir()]
 print(files_path)
 
 ["F:\documentiapplications.txt", "F:\documenticollections.txt"]

Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
    for file in f:
        if file.endswith(".docx"):
            print(os.path.join(r, file))

os.listdir(): get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

 import os
 arr = os.listdir(".")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

To go up in the directory tree

# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")

Get files: os.listdir() in a particular directory (Python 2 and 3)

 import os
 arr = os.listdir("F:\python")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Get files of a particular subdirectory with os.listdir()

import os

x = os.listdir("./content")

os.walk(".") - current directory

 import os
 arr = next(os.walk("."))[2]
 print(arr)
 
 >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]

next(os.walk(".")) and os.path.join("dir", "file")

 import os
 arr = []
 for d,r,f in next(os.walk("F:\_python")):
     for file in f:
         arr.append(os.path.join(r,file))

 for f in arr:
     print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt

next(os.walk("F:\") - get the full path - list comprehension

 [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
 
 >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]

os.walk - get full path - all files in sub dirs**

x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

>>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]

os.listdir() - get only txt files

 arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
 print(arr_txt)
 
 >>> ["work.txt", "3ebooks.txt"]

Using glob to get the full path of the files

If I should need the absolute path of the files:

from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
    print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt

Using os.path.isfile to avoid directories in the list

import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]

Using pathlib from Python 3.4

import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
    if p.is_file():
        print(p)
        flist.append(p)

 >>> error.PNG
 >>> exemaker.bat
 >>> guiprova.mp3
 >>> setup.py
 >>> speak_gui2.py
 >>> thumb.PNG

With list comprehension:

flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]

Alternatively, use pathlib.Path() instead of pathlib.Path(".")

Use glob method in pathlib.Path()

import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
    print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py

Get all and only files with os.walk

import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
    for f in t:
        y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]

Get only files with next and walk in a directory

 import os
 x = next(os.walk("F://python"))[2]
 print(x)
 
 >>> ["calculator.bat","calculator.py"]

Get only directories with next and walk in a directory

 import os
 next(os.walk("F://python"))[1] # for the current dir use (".")
 
 >>> ["python3","others"]

Get all the subdir names with walk

for r,d,f in os.walk("F:\_python"):
    for dirs in d:
        print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints

os.scandir() from Python 3.5 and greater

import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
    for entry in i:
        if entry.is_file():
            print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG

Examples:

Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

import os

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"

Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
    "Searches for pptx (or other - pptx is the default) files and copies them"
    for pack in os.walk(dir):
        for f in pack[2]:
            if f.endswith(filetype):
                fullpath = pack[0] + "\" + f
                print(fullpath)
                shutil.copy(fullpath, destination)
                counter += 1
    if counter > 0:
        print("-" * 30)
        print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
    "searches for folders that starts with `_`"
    if dir[0] == "_":
        # copyfile(dir, filetype="pdf")
        copyfile(dir, filetype="txt")


>>> _compiti18Compito Contabilità 1conti.txt
>>> _compiti18Compito Contabilità 1modula4.txt
>>> _compiti18Compito Contabilità 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files

Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
    for eachfile in os.listdir():
        mylist += eachfile + "
"
    file.write(mylist)

Example: txt with all the files of an hard drive

"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
    for root, dirs, files in os.walk("D:\"):
        for file in files:
            listafile.append(file)
            percorso.append(root + "\" + file)
            testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
    for file in listafile:
        testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
    for file in percorso:
        file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")

All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
    for r, d, f in os.walk("C:\"):
        for file in f:
            filewrite.write(f"{r + file}
")

How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

import os

def searchfiles(extension=".ttf", folder="H:\"):
    "Create a txt file with all the file of a type"
    with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png

(New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
    "insert all files in the listbox"
    for r, d, f in os.walk(folder):
        for file in f:
            if file.endswith(extension):
                lb.insert(0, r + "\" + file)

def open_file():
    os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()

Answer #4

I just used the following which was quite simple. First open a console then cd to where you"ve downloaded your file like some-package.whl and use

pip install some-package.whl

Note: if pip.exe is not recognized, you may find it in the "Scripts" directory from where python has been installed. If pip is not installed, this page can help: How do I install pip on Windows?

Note: for clarification
If you copy the *.whl file to your local drive (ex. C:some-dirsome-file.whl) use the following command line parameters --

pip install C:/some-dir/some-file.whl

Answer #5

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(["col1","col2"]).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(["col1", "col2"]).size().reset_index(name="counts")


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let"s use .size() to get the row counts:

In [3]: df.groupby(["col1", "col2"]).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let"s use .size().reset_index(name="counts") to get the row counts:

In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(["col1", "col2"])
   ...: .agg({
   ...:     "col3": ["mean", "count"], 
   ...:     "col4": ["median", "min", "count"]
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(["col1", "col2"])
   ...: counts = gb.size().to_frame(name="counts")
   ...: (counts
   ...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
   ...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
   ...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["E", "F"],
   ...:         ["E", "F"],
   ...:         ["G", "H"] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
   ...: )
   ...: 
   ...: df[["col3", "col4", "col5", "col6"]] = 
   ...:     df[["col3", "col4", "col5", "col6"]].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

Answer #6

Using a for loop, how do I access the loop index, from 1 to 5 in this case?

Use enumerate to get the index with the element as you iterate:

for index, item in enumerate(items):
    print(index, item)

And note that Python"s indexes start at zero, so you would get 0 to 4 with the above. If you want the count, 1 to 5, do this:

count = 0 # in case items is empty and you need it after the loop
for count, item in enumerate(items, start=1):
    print(count, item)

Unidiomatic control flow

What you are asking for is the Pythonic equivalent of the following, which is the algorithm most programmers of lower-level languages would use:

index = 0            # Python"s indexing starts at zero
for item in items:   # Python"s for loops are a "for each" loop 
    print(index, item)
    index += 1

Or in languages that do not have a for-each loop:

index = 0
while index < len(items):
    print(index, items[index])
    index += 1

or sometimes more commonly (but unidiomatically) found in Python:

for index in range(len(items)):
    print(index, items[index])

Use the Enumerate Function

Python"s enumerate function reduces the visual clutter by hiding the accounting for the indexes, and encapsulating the iterable into another iterable (an enumerate object) that yields a two-item tuple of the index and the item that the original iterable would provide. That looks like this:

for index, item in enumerate(items, start=0):   # default is zero
    print(index, item)

This code sample is fairly well the canonical example of the difference between code that is idiomatic of Python and code that is not. Idiomatic code is sophisticated (but not complicated) Python, written in the way that it was intended to be used. Idiomatic code is expected by the designers of the language, which means that usually this code is not just more readable, but also more efficient.

Getting a count

Even if you don"t need indexes as you go, but you need a count of the iterations (sometimes desirable) you can start with 1 and the final number will be your count.

count = 0 # in case items is empty
for count, item in enumerate(items, start=1):   # default is zero
    print(item)

print("there were {0} items printed".format(count))

The count seems to be more what you intend to ask for (as opposed to index) when you said you wanted from 1 to 5.


Breaking it down - a step by step explanation

To break these examples down, say we have a list of items that we want to iterate over with an index:

items = ["a", "b", "c", "d", "e"]

Now we pass this iterable to enumerate, creating an enumerate object:

enumerate_object = enumerate(items) # the enumerate object

We can pull the first item out of this iterable that we would get in a loop with the next function:

iteration = next(enumerate_object) # first iteration from enumerate
print(iteration)

And we see we get a tuple of 0, the first index, and "a", the first item:

(0, "a")

we can use what is referred to as "sequence unpacking" to extract the elements from this two-tuple:

index, item = iteration
#   0,  "a" = (0, "a") # essentially this.

and when we inspect index, we find it refers to the first index, 0, and item refers to the first item, "a".

>>> print(index)
0
>>> print(item)
a

Conclusion

  • Python indexes start at zero
  • To get these indexes from an iterable as you iterate over it, use the enumerate function
  • Using enumerate in the idiomatic way (along with tuple unpacking) creates code that is more readable and maintainable:

So do this:

for index, item in enumerate(items, start=0):   # Python indexes start at zero
    print(index, item)

Answer #7

Getting some sort of modification date in a cross-platform way is easy - just call os.path.getmtime(path) and you"ll get the Unix timestamp of when the file at path was last modified.

Getting file creation dates, on the other hand, is fiddly and platform-dependent, differing even between the three big OSes:

Putting this all together, cross-platform code should look something like this...

import os
import platform

def creation_date(path_to_file):
    """
    Try to get the date that a file was created, falling back to when it was
    last modified if that isn"t possible.
    See http://stackoverflow.com/a/39501288/1709587 for explanation.
    """
    if platform.system() == "Windows":
        return os.path.getctime(path_to_file)
    else:
        stat = os.stat(path_to_file)
        try:
            return stat.st_birthtime
        except AttributeError:
            # We"re probably on Linux. No easy way to get creation dates here,
            # so we"ll settle for when its content was last modified.
            return stat.st_mtime

Answer #8

I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too.

Answer #9

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep="first")]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset="index", keep="first").set_index("index")
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep="first")]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to "last".

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul"s example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep="last")]
1000 loops, best of 3: 365 µs per loop

Answer #10

Here"s a concise solution which avoids regular expressions and slow in-Python loops:

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang"s solution is the clear winner, outperforming all others by at least 5x for the large example set.

(That answer"s words, not mine.)

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python"s string.find.

Find indices of elements equal to zero in a NumPy array: StackOverflow Questions

Answer #1

There are several ways to select rows from a Pandas dataframe:

  1. Boolean indexing (df[df["col"] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column "A" == "foo"

(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)


Setup

The first thing we"ll need is to identify a condition that will act as our criterion for selecting rows. We"ll start with the OP"s case column_name == some_value, and include some other common use cases.

Borrowing from @unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({"A": "foo bar foo bar foo bar foo foo".split(),
                   "B": "one one two three two two one three".split(),
                   "C": np.arange(8), "D": np.arange(8) * 2})

1. Boolean indexing

... Boolean indexing requires finding the true value of each row"s "A" column being equal to "foo", then using those truth values to identify which rows to keep. Typically, we"d name this series, an array of truth values, mask. We"ll do so here as well.

mask = df["A"] == "foo"

We can then use this mask to slice or index the data frame

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn"t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


2. Positional indexing

Positional indexing (df.iloc[...]) has its use cases, but this isn"t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

mask = df["A"] == "foo"
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

df.set_index("A", append=True, drop=False).xs("foo", level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

df.query("A == "foo"")

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask.

mask alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd.Series

mask = df["A"].values == "foo"

I"ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

%timeit mask = df["A"].values == "foo"
%timeit mask = df["A"] == "foo"

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

Next, we"ll look at the timing for slicing with one mask versus the other.

mask = df["A"].values == "foo"
%timeit df[mask]
mask = df["A"] == "foo"
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren"t as pronounced. We"ll see if this holds up over more robust testing.


mask alternative 2 We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

Instead of df[mask] we will do this

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

%%timeit
mask = d1["A"].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

%%timeit
mask = d1["A"].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.


mask alternative 3

@unutbu also shows us how to use pd.Series.isin to account for each element of df["A"] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely "foo". But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

mask = df["A"].isin(["foo"])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We"ll use np.in1d

mask = np.in1d(df["A"].values, ["foo"])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing

I"ll include other concepts mentioned in other posts as well for reference.

Code Below

Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You"ll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.

res.T.plot(loglog=True)

Enter image description here

Functions

def mask_standard(df):
    mask = df["A"] == "foo"
    return df[mask]

def mask_standard_loc(df):
    mask = df["A"] == "foo"
    return df.loc[mask]

def mask_with_values(df):
    mask = df["A"].values == "foo"
    return df[mask]

def mask_with_values_loc(df):
    mask = df["A"].values == "foo"
    return df.loc[mask]

def query(df):
    return df.query("A == "foo"")

def xs_label(df):
    return df.set_index("A", append=True, drop=False).xs("foo", level=-1)

def mask_with_isin(df):
    mask = df["A"].isin(["foo"])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df["A"].values, ["foo"])
    return df[mask]

Testing

res = pd.DataFrame(
    index=[
        "mask_standard", "mask_standard_loc", "mask_with_values", "mask_with_values_loc",
        "query", "xs_label", "mask_with_isin", "mask_with_in1d"
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = "{}(d)".format(i)
        setp = "from __main__ import d, {}".format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing

Looking at the special case when we have a single non-object dtype for the entire data frame.

Code Below

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn"t worth it past a few hundred rows.

spec.T.plot(loglog=True)

Enter image description here

Functions

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

def mask_with_values(df):
    mask = df["A"].values == "foo"
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df["A"].values, ["foo"])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df["A"].values, ["foo"])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=["mask_with_values", "mask_with_in1d", "reconstruct"],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = "{}(d)".format(i)
        setp = "from __main__ import d, {}".format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

Answer #2

Your array a defines the columns of the nonzero elements in the output array. You need to also define the rows and then use fancy indexing:

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max()+1))
>>> b[np.arange(a.size),a] = 1
>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

Answer #3

What about using numpy.count_nonzero, something like

>>> import numpy as np
>>> y = np.array([1, 2, 2, 2, 2, 0, 2, 3, 3, 3, 0, 0, 2, 2, 0])

>>> np.count_nonzero(y == 1)
1
>>> np.count_nonzero(y == 2)
7
>>> np.count_nonzero(y == 3)
3

Answer #4

To somewhat expand on the earlier answers here, there are a number of details which are commonly overlooked.

  • Prefer subprocess.run() over subprocess.check_call() and friends over subprocess.call() over subprocess.Popen() over os.system() over os.popen()
  • Understand and probably use text=True, aka universal_newlines=True.
  • Understand the meaning of shell=True or shell=False and how it changes quoting and the availability of shell conveniences.
  • Understand differences between sh and Bash
  • Understand how a subprocess is separate from its parent, and generally cannot change the parent.
  • Avoid running the Python interpreter as a subprocess of Python.

These topics are covered in some more detail below.

Prefer subprocess.run() or subprocess.check_call()

The subprocess.Popen() function is a low-level workhorse but it is tricky to use correctly and you end up copy/pasting multiple lines of code ... which conveniently already exist in the standard library as a set of higher-level wrapper functions for various purposes, which are presented in more detail in the following.

Here"s a paragraph from the documentation:

The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.

Unfortunately, the availability of these wrapper functions differs between Python versions.

  • subprocess.run() was officially introduced in Python 3.5. It is meant to replace all of the following.
  • subprocess.check_output() was introduced in Python 2.7 / 3.1. It is basically equivalent to subprocess.run(..., check=True, stdout=subprocess.PIPE).stdout
  • subprocess.check_call() was introduced in Python 2.5. It is basically equivalent to subprocess.run(..., check=True)
  • subprocess.call() was introduced in Python 2.4 in the original subprocess module (PEP-324). It is basically equivalent to subprocess.run(...).returncode

High-level API vs subprocess.Popen()

The refactored and extended subprocess.run() is more logical and more versatile than the older legacy functions it replaces. It returns a CompletedProcess object which has various methods which allow you to retrieve the exit status, the standard output, and a few other results and status indicators from the finished subprocess.

subprocess.run() is the way to go if you simply need a program to run and return control to Python. For more involved scenarios (background processes, perhaps with interactive I/O with the Python parent program) you still need to use subprocess.Popen() and take care of all the plumbing yourself. This requires a fairly intricate understanding of all the moving parts and should not be undertaken lightly. The simpler Popen object represents the (possibly still-running) process which needs to be managed from your code for the remainder of the lifetime of the subprocess.

It should perhaps be emphasized that just subprocess.Popen() merely creates a process. If you leave it at that, you have a subprocess running concurrently alongside with Python, so a "background" process. If it doesn"t need to do input or output or otherwise coordinate with you, it can do useful work in parallel with your Python program.

Avoid os.system() and os.popen()

Since time eternal (well, since Python 2.5) the os module documentation has contained the recommendation to prefer subprocess over os.system():

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

The problems with system() are that it"s obviously system-dependent and doesn"t offer ways to interact with the subprocess. It simply runs, with standard output and standard error outside of Python"s reach. The only information Python receives back is the exit status of the command (zero means success, though the meaning of non-zero values is also somewhat system-dependent).

PEP-324 (which was already mentioned above) contains a more detailed rationale for why os.system is problematic and how subprocess attempts to solve those issues.

os.popen() used to be even more strongly discouraged:

Deprecated since version 2.6: This function is obsolete. Use the subprocess module.

However, since sometime in Python 3, it has been reimplemented to simply use subprocess, and redirects to the subprocess.Popen() documentation for details.

Understand and usually use check=True

You"ll also notice that subprocess.call() has many of the same limitations as os.system(). In regular use, you should generally check whether the process finished successfully, which subprocess.check_call() and subprocess.check_output() do (where the latter also returns the standard output of the finished subprocess). Similarly, you should usually use check=True with subprocess.run() unless you specifically need to allow the subprocess to return an error status.

In practice, with check=True or subprocess.check_*, Python will throw a CalledProcessError exception if the subprocess returns a nonzero exit status.

A common error with subprocess.run() is to omit check=True and be surprised when downstream code fails if the subprocess failed.

On the other hand, a common problem with check_call() and check_output() was that users who blindly used these functions were surprised when the exception was raised e.g. when grep did not find a match. (You should probably replace grep with native Python code anyway, as outlined below.)

All things counted, you need to understand how shell commands return an exit code, and under what conditions they will return a non-zero (error) exit code, and make a conscious decision how exactly it should be handled.

Understand and probably use text=True aka universal_newlines=True

Since Python 3, strings internal to Python are Unicode strings. But there is no guarantee that a subprocess generates Unicode output, or strings at all.

(If the differences are not immediately obvious, Ned Batchelder"s Pragmatic Unicode is recommended, if not outright obligatory, reading. There is a 36-minute video presentation behind the link if you prefer, though reading the page yourself will probably take significantly less time.)

Deep down, Python has to fetch a bytes buffer and interpret it somehow. If it contains a blob of binary data, it shouldn"t be decoded into a Unicode string, because that"s error-prone and bug-inducing behavior - precisely the sort of pesky behavior which riddled many Python 2 scripts, before there was a way to properly distinguish between encoded text and binary data.

With text=True, you tell Python that you, in fact, expect back textual data in the system"s default encoding, and that it should be decoded into a Python (Unicode) string to the best of Python"s ability (usually UTF-8 on any moderately up to date system, except perhaps Windows?)

If that"s not what you request back, Python will just give you bytes strings in the stdout and stderr strings. Maybe at some later point you do know that they were text strings after all, and you know their encoding. Then, you can decode them.

normal = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True,
    text=True)
print(normal.stdout)

convoluted = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True)
# You have to know (or guess) the encoding
print(convoluted.stdout.decode("utf-8"))

Python 3.7 introduced the shorter and more descriptive and understandable alias text for the keyword argument which was previously somewhat misleadingly called universal_newlines.

Understand shell=True vs shell=False

With shell=True you pass a single string to your shell, and the shell takes it from there.

With shell=False you pass a list of arguments to the OS, bypassing the shell.

When you don"t have a shell, you save a process and get rid of a fairly substantial amount of hidden complexity, which may or may not harbor bugs or even security problems.

On the other hand, when you don"t have a shell, you don"t have redirection, wildcard expansion, job control, and a large number of other shell features.

A common mistake is to use shell=True and then still pass Python a list of tokens, or vice versa. This happens to work in some cases, but is really ill-defined and could break in interesting ways.

# XXX AVOID THIS BUG
buggy = subprocess.run("dig +short stackoverflow.com")

# XXX AVOID THIS BUG TOO
broken = subprocess.run(["dig", "+short", "stackoverflow.com"],
    shell=True)

# XXX DEFINITELY AVOID THIS
pathological = subprocess.run(["dig +short stackoverflow.com"],
    shell=True)

correct = subprocess.run(["dig", "+short", "stackoverflow.com"],
    # Probably don"t forget these, too
    check=True, text=True)

# XXX Probably better avoid shell=True
# but this is nominally correct
fixed_but_fugly = subprocess.run("dig +short stackoverflow.com",
    shell=True,
    # Probably don"t forget these, too
    check=True, text=True)

The common retort "but it works for me" is not a useful rebuttal unless you understand exactly under what circumstances it could stop working.

Refactoring Example

Very often, the features of the shell can be replaced with native Python code. Simple Awk or sed scripts should probably simply be translated to Python instead.

To partially illustrate this, here is a typical but slightly silly example which involves many shell features.

cmd = """while read -r x;
   do ping -c 3 "$x" | grep "round-trip min/avg/max"
   done <hosts.txt"""

# Trivial but horrible
results = subprocess.run(
    cmd, shell=True, universal_newlines=True, check=True)
print(results.stdout)

# Reimplement with shell=False
with open("hosts.txt") as hosts:
    for host in hosts:
        host = host.rstrip("
")  # drop newline
        ping = subprocess.run(
             ["ping", "-c", "3", host],
             text=True,
             stdout=subprocess.PIPE,
             check=True)
        for line in ping.stdout.split("
"):
             if "round-trip min/avg/max" in line:
                 print("{}: {}".format(host, line))

Some things to note here:

  • With shell=False you don"t need the quoting that the shell requires around strings. Putting quotes anyway is probably an error.
  • It often makes sense to run as little code as possible in a subprocess. This gives you more control over execution from within your Python code.
  • Having said that, complex shell pipelines are tedious and sometimes challenging to reimplement in Python.

The refactored code also illustrates just how much the shell really does for you with a very terse syntax -- for better or for worse. Python says explicit is better than implicit but the Python code is rather verbose and arguably looks more complex than this really is. On the other hand, it offers a number of points where you can grab control in the middle of something else, as trivially exemplified by the enhancement that we can easily include the host name along with the shell command output. (This is by no means challenging to do in the shell, either, but at the expense of yet another diversion and perhaps another process.)

Common Shell Constructs

For completeness, here are brief explanations of some of these shell features, and some notes on how they can perhaps be replaced with native Python facilities.

  • Globbing aka wildcard expansion can be replaced with glob.glob() or very often with simple Python string comparisons like for file in os.listdir("."): if not file.endswith(".png"): continue. Bash has various other expansion facilities like .{png,jpg} brace expansion and {1..100} as well as tilde expansion (~ expands to your home directory, and more generally ~account to the home directory of another user)
  • Shell variables like $SHELL or $my_exported_var can sometimes simply be replaced with Python variables. Exported shell variables are available as e.g. os.environ["SHELL"] (the meaning of export is to make the variable available to subprocesses -- a variable which is not available to subprocesses will obviously not be available to Python running as a subprocess of the shell, or vice versa. The env= keyword argument to subprocess methods allows you to define the environment of the subprocess as a dictionary, so that"s one way to make a Python variable visible to a subprocess). With shell=False you will need to understand how to remove any quotes; for example, cd "$HOME" is equivalent to os.chdir(os.environ["HOME"]) without quotes around the directory name. (Very often cd is not useful or necessary anyway, and many beginners omit the double quotes around the variable and get away with it until one day ...)
  • Redirection allows you to read from a file as your standard input, and write your standard output to a file. grep "foo" <inputfile >outputfile opens outputfile for writing and inputfile for reading, and passes its contents as standard input to grep, whose standard output then lands in outputfile. This is not generally hard to replace with native Python code.
  • Pipelines are a form of redirection. echo foo | nl runs two subprocesses, where the standard output of echo is the standard input of nl (on the OS level, in Unix-like systems, this is a single file handle). If you cannot replace one or both ends of the pipeline with native Python code, perhaps think about using a shell after all, especially if the pipeline has more than two or three processes (though look at the pipes module in the Python standard library or a number of more modern and versatile third-party competitors).
  • Job control lets you interrupt jobs, run them in the background, return them to the foreground, etc. The basic Unix signals to stop and continue a process are of course available from Python, too. But jobs are a higher-level abstraction in the shell which involve process groups etc which you have to understand if you want to do something like this from Python.
  • Quoting in the shell is potentially confusing until you understand that everything is basically a string. So ls -l / is equivalent to "ls" "-l" "/" but the quoting around literals is completely optional. Unquoted strings which contain shell metacharacters undergo parameter expansion, whitespace tokenization and wildcard expansion; double quotes prevent whitespace tokenization and wildcard expansion but allow parameter expansions (variable substitution, command substitution, and backslash processing). This is simple in theory but can get bewildering, especially when there are several layers of interpretation (a remote shell command, for example).

Understand differences between sh and Bash

subprocess runs your shell commands with /bin/sh unless you specifically request otherwise (except of course on Windows, where it uses the value of the COMSPEC variable). This means that various Bash-only features like arrays, [[ etc are not available.

If you need to use Bash-only syntax, you can pass in the path to the shell as executable="/bin/bash" (where of course if your Bash is installed somewhere else, you need to adjust the path).

subprocess.run("""
    # This for loop syntax is Bash only
    for((i=1;i<=$#;i++)); do
        # Arrays are Bash-only
        array[i]+=123
    done""",
    shell=True, check=True,
    executable="/bin/bash")

A subprocess is separate from its parent, and cannot change it

A somewhat common mistake is doing something like

subprocess.run("cd /tmp", shell=True)
subprocess.run("pwd", shell=True)  # Oops, doesn"t print /tmp

The same thing will happen if the first subprocess tries to set an environment variable, which of course will have disappeared when you run another subprocess, etc.

A child process runs completely separate from Python, and when it finishes, Python has no idea what it did (apart from the vague indicators that it can infer from the exit status and output from the child process). A child generally cannot change the parent"s environment; it cannot set a variable, change the working directory, or, in so many words, communicate with its parent without cooperation from the parent.

The immediate fix in this particular case is to run both commands in a single subprocess;

subprocess.run("cd /tmp; pwd", shell=True)

though obviously this particular use case isn"t very useful; instead, use the cwd keyword argument, or simply os.chdir() before running the subprocess. Similarly, for setting a variable, you can manipulate the environment of the current process (and thus also its children) via

os.environ["foo"] = "bar"

or pass an environment setting to a child process with

subprocess.run("echo "$foo"", shell=True, env={"foo": "bar"})

(not to mention the obvious refactoring subprocess.run(["echo", "bar"]); but echo is a poor example of something to run in a subprocess in the first place, of course).

Don"t run Python from Python

This is slightly dubious advice; there are certainly situations where it does make sense or is even an absolute requirement to run the Python interpreter as a subprocess from a Python script. But very frequently, the correct approach is simply to import the other Python module into your calling script and call its functions directly.

If the other Python script is under your control, and it isn"t a module, consider turning it into one. (This answer is too long already so I will not delve into details here.)

If you need parallelism, you can run Python functions in subprocesses with the multiprocessing module. There is also threading which runs multiple tasks in a single process (which is more lightweight and gives you more control, but also more constrained in that threads within a process are tightly coupled, and bound to a single GIL.)

Answer #5

Best way to check if a list is empty

For example, if passed the following:

a = []

How do I check to see if a is empty?

Short Answer:

Place the list in a boolean context (for example, with an if or while statement). It will test False if it is empty, and True otherwise. For example:

if not a:                           # do this!
    print("a is an empty list")

PEP 8

PEP 8, the official Python style guide for Python code in Python"s standard library, asserts:

For sequences, (strings, lists, tuples), use the fact that empty sequences are false.

Yes: if not seq:
     if seq:

No: if len(seq):
    if not len(seq):

We should expect that standard library code should be as performant and correct as possible. But why is that the case, and why do we need this guidance?

Explanation

I frequently see code like this from experienced programmers new to Python:

if len(a) == 0:                     # Don"t do this!
    print("a is an empty list")

And users of lazy languages may be tempted to do this:

if a == []:                         # Don"t do this!
    print("a is an empty list")

These are correct in their respective other languages. And this is even semantically correct in Python.

But we consider it un-Pythonic because Python supports these semantics directly in the list object"s interface via boolean coercion.

From the docs (and note specifically the inclusion of the empty list, []):

By default, an object is considered true unless its class defines either a __bool__() method that returns False or a __len__() method that returns zero, when called with the object. Here are most of the built-in objects considered false:

  • constants defined to be false: None and False.
  • zero of any numeric type: 0, 0.0, 0j, Decimal(0), Fraction(0, 1)
  • empty sequences and collections: "", (), [], {}, set(), range(0)

And the datamodel documentation:

object.__bool__(self)

Called to implement truth value testing and the built-in operation bool(); should return False or True. When this method is not defined, __len__() is called, if it is defined, and the object is considered true if its result is nonzero. If a class defines neither __len__() nor __bool__(), all its instances are considered true.

and

object.__len__(self)

Called to implement the built-in function len(). Should return the length of the object, an integer >= 0. Also, an object that doesn’t define a __bool__() method and whose __len__() method returns zero is considered to be false in a Boolean context.

So instead of this:

if len(a) == 0:                     # Don"t do this!
    print("a is an empty list")

or this:

if a == []:                     # Don"t do this!
    print("a is an empty list")

Do this:

if not a:
    print("a is an empty list")

Doing what"s Pythonic usually pays off in performance:

Does it pay off? (Note that less time to perform an equivalent operation is better:)

>>> import timeit
>>> min(timeit.repeat(lambda: len([]) == 0, repeat=100))
0.13775854044661884
>>> min(timeit.repeat(lambda: [] == [], repeat=100))
0.0984637276455409
>>> min(timeit.repeat(lambda: not [], repeat=100))
0.07878462291455435

For scale, here"s the cost of calling the function and constructing and returning an empty list, which you might subtract from the costs of the emptiness checks used above:

>>> min(timeit.repeat(lambda: [], repeat=100))
0.07074015751817342

We see that either checking for length with the builtin function len compared to 0 or checking against an empty list is much less performant than using the builtin syntax of the language as documented.

Why?

For the len(a) == 0 check:

First Python has to check the globals to see if len is shadowed.

Then it must call the function, load 0, and do the equality comparison in Python (instead of with C):

>>> import dis
>>> dis.dis(lambda: len([]) == 0)
  1           0 LOAD_GLOBAL              0 (len)
              2 BUILD_LIST               0
              4 CALL_FUNCTION            1
              6 LOAD_CONST               1 (0)
              8 COMPARE_OP               2 (==)
             10 RETURN_VALUE

And for the [] == [] it has to build an unnecessary list and then, again, do the comparison operation in Python"s virtual machine (as opposed to C)

>>> dis.dis(lambda: [] == [])
  1           0 BUILD_LIST               0
              2 BUILD_LIST               0
              4 COMPARE_OP               2 (==)
              6 RETURN_VALUE

The "Pythonic" way is a much simpler and faster check since the length of the list is cached in the object instance header:

>>> dis.dis(lambda: not [])
  1           0 BUILD_LIST               0
              2 UNARY_NOT
              4 RETURN_VALUE

Evidence from the C source and documentation

PyVarObject

This is an extension of PyObject that adds the ob_size field. This is only used for objects that have some notion of length. This type does not often appear in the Python/C API. It corresponds to the fields defined by the expansion of the PyObject_VAR_HEAD macro.

From the c source in Include/listobject.h:

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for "allocated" elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size

Response to comments:

I would point out that this is also true for the non-empty case though its pretty ugly as with l=[] then %timeit len(l) != 0 90.6 ns ± 8.3 ns, %timeit l != [] 55.6 ns ± 3.09, %timeit not not l 38.5 ns ± 0.372. But there is no way anyone is going to enjoy not not l despite triple the speed. It looks ridiculous. But the speed wins out
I suppose the problem is testing with timeit since just if l: is sufficient but surprisingly %timeit bool(l) yields 101 ns ± 2.64 ns. Interesting there is no way to coerce to bool without this penalty. %timeit l is useless since no conversion would occur.

IPython magic, %timeit, is not entirely useless here:

In [1]: l = []                                                                  

In [2]: %timeit l                                                               
20 ns ± 0.155 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

In [3]: %timeit not l                                                           
24.4 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit not not l                                                       
30.1 ns ± 2.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

We can see there"s a bit of linear cost for each additional not here. We want to see the costs, ceteris paribus, that is, all else equal - where all else is minimized as far as possible:

In [5]: %timeit if l: pass                                                      
22.6 ns ± 0.963 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [6]: %timeit if not l: pass                                                  
24.4 ns ± 0.796 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [7]: %timeit if not not l: pass                                              
23.4 ns ± 0.793 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Now let"s look at the case for an unempty list:

In [8]: l = [1]                                                                 

In [9]: %timeit if l: pass                                                      
23.7 ns ± 1.06 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [10]: %timeit if not l: pass                                                 
23.6 ns ± 1.64 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [11]: %timeit if not not l: pass                                             
26.3 ns ± 1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

What we can see here is that it makes little difference whether you pass in an actual bool to the condition check or the list itself, and if anything, giving the list, as is, is faster.

Python is written in C; it uses its logic at the C level. Anything you write in Python will be slower. And it will likely be orders of magnitude slower unless you"re using the mechanisms built into Python directly.

Answer #6

Why is x**4.0 faster than x**4 in Python 3*?

Python 3 int objects are a full fledged object designed to support an arbitrary size; due to that fact, they are handled as such on the C level (see how all variables are declared as PyLongObject * type in long_pow). This also makes their exponentiation a lot more trickier and tedious since you need to play around with the ob_digit array it uses to represent its value to perform it. (Source for the brave. -- See: Understanding memory allocation for large integers in Python for more on PyLongObjects.)

Python float objects, on the contrary, can be transformed to a C double type (by using PyFloat_AsDouble) and operations can be performed using those native types. This is great because, after checking for relevant edge-cases, it allows Python to use the platforms" pow (C"s pow, that is) to handle the actual exponentiation:

/* Now iv and iw are finite, iw is nonzero, and iv is
 * positive and not equal to 1.0.  We finally allow
 * the platform pow to step in and do the rest.
 */
errno = 0;
PyFPE_START_PROTECT("pow", return NULL)
ix = pow(iv, iw); 

where iv and iw are our original PyFloatObjects as C doubles.

For what it"s worth: Python 2.7.13 for me is a factor 2~3 faster, and shows the inverse behaviour.

The previous fact also explains the discrepancy between Python 2 and 3 so, I thought I"d address this comment too because it is interesting.

In Python 2, you"re using the old int object that differs from the int object in Python 3 (all int objects in 3.x are of PyLongObject type). In Python 2, there"s a distinction that depends on the value of the object (or, if you use the suffix L/l):

# Python 2
type(30)  # <type "int">
type(30L) # <type "long">

The <type "int"> you see here does the same thing floats do, it gets safely converted into a C long when exponentiation is performed on it (The int_pow also hints the compiler to put "em in a register if it can do so, so that could make a difference):

static PyObject *
int_pow(PyIntObject *v, PyIntObject *w, PyIntObject *z)
{
    register long iv, iw, iz=0, ix, temp, prev;
/* Snipped for brevity */    

this allows for a good speed gain.

To see how sluggish <type "long">s are in comparison to <type "int">s, if you wrapped the x name in a long call in Python 2 (essentially forcing it to use long_pow as in Python 3), the speed gain disappears:

# <type "int">
(python2) ‚ûú python -m timeit "for x in range(1000):" " x**2"       
10000 loops, best of 3: 116 usec per loop
# <type "long"> 
(python2) ‚ûú python -m timeit "for x in range(1000):" " long(x)**2"
100 loops, best of 3: 2.12 msec per loop

Take note that, though the one snippet transforms the int to long while the other does not (as pointed out by @pydsinger), this cast is not the contributing force behind the slowdown. The implementation of long_pow is. (Time the statements solely with long(x) to see).

[...] it doesn"t happen outside of the loop. [...] Any idea about that?

This is CPython"s peephole optimizer folding the constants for you. You get the same exact timings either case since there"s no actual computation to find the result of the exponentiation, only loading of values:

dis.dis(compile("4 ** 4", "", "exec"))
  1           0 LOAD_CONST               2 (256)
              3 POP_TOP
              4 LOAD_CONST               1 (None)
              7 RETURN_VALUE

Identical byte-code is generated for "4 ** 4." with the only difference being that the LOAD_CONST loads the float 256.0 instead of the int 256:

dis.dis(compile("4 ** 4.", "", "exec"))
  1           0 LOAD_CONST               3 (256.0)
              2 POP_TOP
              4 LOAD_CONST               2 (None)
              6 RETURN_VALUE

So the times are identical.


*All of the above apply solely for CPython, the reference implementation of Python. Other implementations might perform differently.

Answer #7

Per https://docs.python.org/3/reference/lexical_analysis.html#integer-literals:

Integer literals are described by the following lexical definitions:

integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"+
nonzerodigit   ::=  "1"..."9"
digit          ::=  "0"..."9"
octinteger     ::=  "0" ("o" | "O") octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
octdigit       ::=  "0"..."7"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"
bindigit       ::=  "0" | "1"

There is no limit for the length of integer literals apart from what can be stored in available memory.

Note that leading zeros in a non-zero decimal number are not allowed. This is for disambiguation with C-style octal literals, which Python used before version 3.0.

As noted here, leading zeros in a non-zero decimal number are not allowed. "0"+ is legal as a very special case, which wasn"t present in Python 2:

integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"
octinteger     ::=  "0" ("o" | "O") octdigit+ | "0" octdigit+

SVN commit r55866 implemented PEP 3127 in the tokenizer, which forbids the old 0<octal> numbers. However, curiously, it also adds this note:

/* in any case, allow "0" as a literal */

with a special nonzero flag that only throws a SyntaxError if the following sequence of digits contains a nonzero digit.

This is odd because PEP 3127 does not allow this case:

This PEP proposes that the ability to specify an octal number by using a leading zero will be removed from the language in Python 3.0 (and the Python 3.0 preview mode of 2.6), and that a SyntaxError will be raised whenever a leading "0" is immediately followed by another digit.

(emphasis mine)

So, the fact that multiple zeros are allowed is technically violating the PEP, and was basically implemented as a special case by Georg Brandl. He made the corresponding documentation change to note that "0"+ was a valid case for decimalinteger (previously that had been covered under octinteger).

We"ll probably never know exactly why Georg chose to make "0"+ valid - it may forever remain an odd corner case in Python.


UPDATE [28 Jul 2015]: This question led to a lively discussion thread on python-ideas in which Georg chimed in:

Steven D"Aprano wrote:

Why was it defined that way? [...] Why would we write 0000 to get zero?

I could tell you, but then I"d have to kill you.

Georg

Later on, the thread spawned this bug report aiming to get rid of this special case. Here, Georg says:

I don"t recall the reason for this deliberate change (as seen from the docs change).

I"m unable to come up with a good reason for this change now [...]

and thus we have it: the precise reason behind this inconsistency is lost to time.

Finally, note that the bug report was rejected: leading zeros will continue to be accepted only on zero integers for the rest of Python 3.x.

Answer #8

The not operator (logical negation)

Probably the best way is using the operator not:

>>> value = True
>>> not value
False

>>> value = False
>>> not value
True

So instead of your code:

if bool == True:
    return False
else:
    return True

You could use:

return not bool

The logical negation as function

There are also two functions in the operator module operator.not_ and it"s alias operator.__not__ in case you need it as function instead of as operator:

>>> import operator
>>> operator.not_(False)
True
>>> operator.not_(True)
False

These can be useful if you want to use a function that requires a predicate-function or a callback.

For example map or filter:

>>> lst = [True, False, True, False]
>>> list(map(operator.not_, lst))
[False, True, False, True]

>>> lst = [True, False, True, False]
>>> list(filter(operator.not_, lst))
[False, False]

Of course the same could also be achieved with an equivalent lambda function:

>>> my_not_function = lambda item: not item

>>> list(map(my_not_function, lst))
[False, True, False, True]

Do not use the bitwise invert operator ~ on booleans

One might be tempted to use the bitwise invert operator ~ or the equivalent operator function operator.inv (or one of the other 3 aliases there). But because bool is a subclass of int the result could be unexpected because it doesn"t return the "inverse boolean", it returns the "inverse integer":

>>> ~True
-2
>>> ~False
-1

That"s because True is equivalent to 1 and False to 0 and bitwise inversion operates on the bitwise representation of the integers 1 and 0.

So these cannot be used to "negate" a bool.

Negation with NumPy arrays (and subclasses)

If you"re dealing with NumPy arrays (or subclasses like pandas.Series or pandas.DataFrame) containing booleans you can actually use the bitwise inverse operator (~) to negate all booleans in an array:

>>> import numpy as np
>>> arr = np.array([True, False, True, False])
>>> ~arr
array([False,  True, False,  True])

Or the equivalent NumPy function:

>>> np.bitwise_not(arr)
array([False,  True, False,  True])

You cannot use the not operator or the operator.not function on NumPy arrays because these require that these return a single bool (not an array of booleans), however NumPy also contains a logical not function that works element-wise:

>>> np.logical_not(arr)
array([False,  True, False,  True])

That can also be applied to non-boolean arrays:

>>> arr = np.array([0, 1, 2, 0])
>>> np.logical_not(arr)
array([ True, False, False,  True])

Customizing your own classes

not works by calling bool on the value and negate the result. In the simplest case the truth value will just call __bool__ on the object.

So by implementing __bool__ (or __nonzero__ in Python 2) you can customize the truth value and thus the result of not:

class Test(object):
    def __init__(self, value):
        self._value = value

    def __bool__(self):
        print("__bool__ called on {!r}".format(self))
        return bool(self._value)

    __nonzero__ = __bool__  # Python 2 compatibility

    def __repr__(self):
        return "{self.__class__.__name__}({self._value!r})".format(self=self)

I added a print statement so you can verify that it really calls the method:

>>> a = Test(10)
>>> not a
__bool__ called on Test(10)
False

Likewise you could implement the __invert__ method to implement the behavior when ~ is applied:

class Test(object):
    def __init__(self, value):
        self._value = value

    def __invert__(self):
        print("__invert__ called on {!r}".format(self))
        return not self._value

    def __repr__(self):
        return "{self.__class__.__name__}({self._value!r})".format(self=self)

Again with a print call to see that it is actually called:

>>> a = Test(True)
>>> ~a
__invert__ called on Test(True)
False

>>> a = Test(False)
>>> ~a
__invert__ called on Test(False)
True

However implementing __invert__ like that could be confusing because it"s behavior is different from "normal" Python behavior. If you ever do that clearly document it and make sure that it has a pretty good (and common) use-case.

Answer #9

I"m getting an error in the IF conditional. What am I doing wrong?

There reason that you get a SyntaxError is that there is no && operator in Python. Likewise || and ! are not valid Python operators.

Some of the operators you may know from other languages have a different name in Python. The logical operators && and || are actually called and and or. Likewise the logical negation operator ! is called not.

So you could just write:

if len(a) % 2 == 0 and len(b) % 2 == 0:

or even:

if not (len(a) % 2 or len(b) % 2):

Some additional information (that might come in handy):

I summarized the operator "equivalents" in this table:

+------------------------------+---------------------+
|  Operator (other languages)  |  Operator (Python)  |
+==============================+=====================+
|              &&              |         and         |
+------------------------------+---------------------+
|              ||              |         or          |
+------------------------------+---------------------+
|              !               |         not         |
+------------------------------+---------------------+

See also Python documentation: 6.11. Boolean operations.

Besides the logical operators Python also has bitwise/binary operators:

+--------------------+--------------------+
|  Logical operator  |  Bitwise operator  |
+====================+====================+
|        and         |         &          |
+--------------------+--------------------+
|         or         |         |          |
+--------------------+--------------------+

There is no bitwise negation in Python (just the bitwise inverse operator ~ - but that is not equivalent to not).

See also 6.6. Unary arithmetic and bitwise/binary operations and 6.7. Binary arithmetic operations.

The logical operators (like in many other languages) have the advantage that these are short-circuited. That means if the first operand already defines the result, then the second operator isn"t evaluated at all.

To show this I use a function that simply takes a value, prints it and returns it again. This is handy to see what is actually evaluated because of the print statements:

>>> def print_and_return(value):
...     print(value)
...     return value

>>> res = print_and_return(False) and print_and_return(True)
False

As you can see only one print statement is executed, so Python really didn"t even look at the right operand.

This is not the case for the binary operators. Those always evaluate both operands:

>>> res = print_and_return(False) & print_and_return(True);
False
True

But if the first operand isn"t enough then, of course, the second operator is evaluated:

>>> res = print_and_return(True) and print_and_return(False);
True
False

To summarize this here is another Table:

+-----------------+-------------------------+
|   Expression    |  Right side evaluated?  |
+=================+=========================+
| `True` and ...  |           Yes           |
+-----------------+-------------------------+
| `False` and ... |           No            |
+-----------------+-------------------------+
|  `True` or ...  |           No            |
+-----------------+-------------------------+
| `False` or ...  |           Yes           |
+-----------------+-------------------------+

The True and False represent what bool(left-hand-side) returns, they don"t have to be True or False, they just need to return True or False when bool is called on them (1).

So in Pseudo-Code(!) the and and or functions work like these:

def and(expr1, expr2):
    left = evaluate(expr1)
    if bool(left):
        return evaluate(expr2)
    else:
        return left

def or(expr1, expr2):
    left = evaluate(expr1)
    if bool(left):
        return left
    else:
        return evaluate(expr2)

Note that this is pseudo-code not Python code. In Python you cannot create functions called and or or because these are keywords. Also you should never use "evaluate" or if bool(...).

Customizing the behavior of your own classes

This implicit bool call can be used to customize how your classes behave with and, or and not.

To show how this can be customized I use this class which again prints something to track what is happening:

class Test(object):
    def __init__(self, value):
        self.value = value

    def __bool__(self):
        print("__bool__ called on {!r}".format(self))
        return bool(self.value)

    __nonzero__ = __bool__  # Python 2 compatibility

    def __repr__(self):
        return "{self.__class__.__name__}({self.value})".format(self=self)

So let"s see what happens with that class in combination with these operators:

>>> if Test(True) and Test(False):
...     pass
__bool__ called on Test(True)
__bool__ called on Test(False)

>>> if Test(False) or Test(False):
...     pass
__bool__ called on Test(False)
__bool__ called on Test(False)

>>> if not Test(True):
...     pass
__bool__ called on Test(True)

If you don"t have a __bool__ method then Python also checks if the object has a __len__ method and if it returns a value greater than zero. That might be useful to know in case you create a sequence container.

See also 4.1. Truth Value Testing.

NumPy arrays and subclasses

Probably a bit beyond the scope of the original question but in case you"re dealing with NumPy arrays or subclasses (like Pandas Series or DataFrames) then the implicit bool call will raise the dreaded ValueError:

>>> import numpy as np
>>> arr = np.array([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> arr and arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> import pandas as pd
>>> s = pd.Series([1,2,3])
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> s and s
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In these cases you can use the logical and function from NumPy which performs an element-wise and (or or):

>>> np.logical_and(np.array([False,False,True,True]), np.array([True, False, True, False]))
array([False, False,  True, False])
>>> np.logical_or(np.array([False,False,True,True]), np.array([True, False, True, False]))
array([ True, False,  True,  True])

If you"re dealing just with boolean arrays you could also use the binary operators with NumPy, these do perform element-wise (but also binary) comparisons:

>>> np.array([False,False,True,True]) & np.array([True, False, True, False])
array([False, False,  True, False])
>>> np.array([False,False,True,True]) | np.array([True, False, True, False])
array([ True, False,  True,  True])

(1)

That the bool call on the operands has to return True or False isn"t completely correct. It"s just the first operand that needs to return a boolean in it"s __bool__ method:

class Test(object):
    def __init__(self, value):
        self.value = value

    def __bool__(self):
        return self.value

    __nonzero__ = __bool__  # Python 2 compatibility

    def __repr__(self):
        return "{self.__class__.__name__}({self.value})".format(self=self)

>>> x = Test(10) and Test(10)
TypeError: __bool__ should return bool, returned int
>>> x1 = Test(True) and Test(10)
>>> x2 = Test(False) and Test(10)

That"s because and actually returns the first operand if the first operand evaluates to False and if it evaluates to True then it returns the second operand:

>>> x1
Test(10)
>>> x2
Test(False)

Similarly for or but just the other way around:

>>> Test(True) or Test(10)
Test(True)
>>> Test(False) or Test(10)
Test(10)

However if you use them in an if statement the if will also implicitly call bool on the result. So these finer points may not be relevant for you.

Answer #10

THIS ANSWER: aims to provide a detailed, graph/hardware-level description of the issue - including TF2 vs. TF1 train loops, input data processors, and Eager vs. Graph mode executions. For an issue summary & resolution guidelines, see my other answer.


PERFORMANCE VERDICT: sometimes one is faster, sometimes the other, depending on configuration. As far as TF2 vs TF1 goes, they"re about on par on average, but significant config-based differences do exist, and TF1 trumps TF2 more often than vice versa. See "BENCHMARKING" below.


EAGER VS. GRAPH: the meat of this entire answer for some: TF2"s eager is slower than TF1"s, according to my testing. Details further down.

The fundamental difference between the two is: Graph sets up a computational network proactively, and executes when "told to" - whereas Eager executes everything upon creation. But the story only begins here:

  • Eager is NOT devoid of Graph, and may in fact be mostly Graph, contrary to expectation. What it largely is, is executed Graph - this includes model & optimizer weights, comprising a great portion of the graph.

  • Eager rebuilds part of own graph at execution; direct consequence of Graph not being fully built -- see profiler results. This has a computational overhead.

  • Eager is slower w/ Numpy inputs; per this Git comment & code, Numpy inputs in Eager include the overhead cost of copying tensors from CPU to GPU. Stepping through source code, data handling differences are clear; Eager directly passes Numpy, while Graph passes tensors which then evaluate to Numpy; uncertain of the exact process, but latter should involve GPU-level optimizations

  • TF2 Eager is slower than TF1 Eager - this is... unexpected. See benchmarking results below. Differences span from negligible to significant, but are consistent. Unsure why it"s the case - if a TF dev clarifies, will update answer.


TF2 vs. TF1: quoting relevant portions of a TF dev"s, Q. Scott Zhu"s, response - w/ bit of my emphasis & rewording:

In eager, the runtime needs to execute the ops and return the numerical value for every line of python code. The nature of single step execution causes it to be slow.

In TF2, Keras leverages tf.function to build its graph for training, eval and prediction. We call them "execution function" for the model. In TF1, the "execution function" was a FuncGraph, which shared some common component as TF function, but has a different implementation.

During the process, we somehow left an incorrect implementation for train_on_batch(), test_on_batch() and predict_on_batch(). They are still numerically correct, but the execution function for x_on_batch is a pure python function, rather than a tf.function wrapped python function. This will cause slowness

In TF2, we convert all input data into a tf.data.Dataset, by which we can unify our execution function to handle the single type of the inputs. There might be some overhead in the dataset conversion, and I think this is a one-time only overhead, rather than a per-batch cost

With the last sentence of last paragraph above, and last clause of below paragraph:

To overcome the slowness in eager mode, we have @tf.function, which will turn a python function into a graph. When feed numerical value like np array, the body of the tf.function is converted into static graph, being optimized, and return the final value, which is fast and should have similar performance as TF1 graph mode.

I disagree - per my profiling results, which show Eager"s input data processing to be substantially slower than Graph"s. Also, unsure about tf.data.Dataset in particular, but Eager does repeatedly call multiple of the same data conversion methods - see profiler.

Lastly, dev"s linked commit: Significant number of changes to support the Keras v2 loops.


Train Loops: depending on (1) Eager vs. Graph; (2) input data format, training in will proceed with a distinct train loop - in TF2, _select_training_loop(), training.py, one of:

training_v2.Loop()
training_distributed.DistributionMultiWorkerTrainingLoop(
              training_v2.Loop()) # multi-worker mode
# Case 1: distribution strategy
training_distributed.DistributionMultiWorkerTrainingLoop(
            training_distributed.DistributionSingleWorkerTrainingLoop())
# Case 2: generator-like. Input is Python generator, or Sequence object,
# or a non-distributed Dataset or iterator in eager execution.
training_generator.GeneratorOrSequenceTrainingLoop()
training_generator.EagerDatasetOrIteratorTrainingLoop()
# Case 3: Symbolic tensors or Numpy array-like. This includes Datasets and iterators 
# in graph mode (since they generate symbolic tensors).
training_generator.GeneratorLikeTrainingLoop() # Eager
training_arrays.ArrayLikeTrainingLoop() # Graph

Each handles resource allocation differently, and bears consequences on performance & capability.


Train Loops: fit vs train_on_batch, keras vs. tf.keras: each of the four uses different train loops, though perhaps not in every possible combination. keras" fit, for example, uses a form of fit_loop, e.g. training_arrays.fit_loop(), and its train_on_batch may use K.function(). tf.keras has a more sophisticated hierarchy described in part in previous section.


Train Loops: documentation -- relevant source docstring on some of the different execution methods:

Unlike other TensorFlow operations, we don"t convert python numerical inputs to tensors. Moreover, a new graph is generated for each distinct python numerical value

function instantiates a separate graph for every unique set of input shapes and datatypes.

A single tf.function object might need to map to multiple computation graphs under the hood. This should be visible only as performance (tracing graphs has a nonzero computational and memory cost)


Input data processors: similar to above, the processor is selected case-by-case, depending on internal flags set according to runtime configurations (execution mode, data format, distribution strategy). The simplest case"s with Eager, which works directly w/ Numpy arrays. For some specific examples, see this answer.


MODEL SIZE, DATA SIZE:

  • Is decisive; no single configuration crowned itself atop all model & data sizes.
  • Data size relative to model size is important; for small data & model, data transfer (e.g. CPU to GPU) overhead can dominate. Likewise, small overhead processors can run slower on large data per data conversion time dominating (see convert_to_tensor in "PROFILER")
  • Speed differs per train loops" and input data processors" differing means of handling resources.

BENCHMARKS: the grinded meat. -- Word Document -- Excel Spreadsheet


Terminology:

  • %-less numbers are all seconds
  • % computed as (1 - longer_time / shorter_time)*100; rationale: we"re interested by what factor one is faster than the other; shorter / longer is actually a non-linear relation, not useful for direct comparison
  • % sign determination:
    • TF2 vs TF1: + if TF2 is faster
    • GvE (Graph vs. Eager): + if Graph is faster
  • TF2 = TensorFlow 2.0.0 + Keras 2.3.1; TF1 = TensorFlow 1.14.0 + Keras 2.2.5

PROFILER:


PROFILER - Explanation: Spyder 3.3.6 IDE profiler.

  • Some functions are repeated in nests of others; hence, it"s hard to track down the exact separation between "data processing" and "training" functions, so there will be some overlap - as pronounced in the very last result.

  • % figures computed w.r.t. runtime minus build time

  • Build time computed by summing all (unique) runtimes which were called 1 or 2 times
  • Train time computed by summing all (unique) runtimes which were called the same # of times as the # of iterations, and some of their nests" runtimes
  • Functions are profiled according to their original names, unfortunately (i.e. _func = func will profile as func), which mixes in build time - hence the need to exclude it

TESTING ENVIRONMENT:

  • Executed code at bottom w/ minimal background tasks running
  • GPU was "warmed up" w/ a few iterations before timing iterations, as suggested in this post
  • CUDA 10.0.130, cuDNN 7.6.0, TensorFlow 1.14.0, & TensorFlow 2.0.0 built from source, plus Anaconda
  • Python 3.7.4, Spyder 3.3.6 IDE
  • GTX 1070, Windows 10, 24GB DDR4 2.4-MHz RAM, i7-7700HQ 2.8-GHz CPU

METHODOLOGY:

  • Benchmark "small", "medium", & "large" model & data sizes
  • Fix # of parameters for each model size, independent of input data size
  • "Larger" model has more parameters and layers
  • "Larger" data has a longer sequence, but same batch_size and num_channels
  • Models only use Conv1D, Dense "learnable" layers; RNNs avoided per TF-version implem. differences
  • Always ran one train fit outside of benchmarking loop, to omit model & optimizer graph building
  • Not using sparse data (e.g. layers.Embedding()) or sparse targets (e.g. SparseCategoricalCrossEntropy()

LIMITATIONS: a "complete" answer would explain every possible train loop & iterator, but that"s surely beyond my time ability, nonexistent paycheck, or general necessity. The results are only as good as the methodology - interpret with an open mind.


CODE:

import numpy as np
import tensorflow as tf
import random
from termcolor import cprint
from time import time

from tensorflow.keras.layers import Input, Dense, Conv1D
from tensorflow.keras.layers import Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
#from keras.layers import Input, Dense, Conv1D
#from keras.layers import Dropout, GlobalAveragePooling1D
#from keras.models import Model 
#from keras.optimizers import Adam
#import keras.backend as K

#tf.compat.v1.disable_eager_execution()
#tf.enable_eager_execution()

def reset_seeds(reset_graph_with_backend=None, verbose=1):
    if reset_graph_with_backend is not None:
        K = reset_graph_with_backend
        K.clear_session()
        tf.compat.v1.reset_default_graph()
        if verbose:
            print("KERAS AND TENSORFLOW GRAPHS RESET")

    np.random.seed(1)
    random.seed(2)
    if tf.__version__[0] == "2":
        tf.random.set_seed(3)
    else:
        tf.set_random_seed(3)
    if verbose:
        print("RANDOM SEEDS RESET")

print("TF version: {}".format(tf.__version__))
reset_seeds()

def timeit(func, iterations, *args, _verbose=0, **kwargs):
    t0 = time()
    for _ in range(iterations):
        func(*args, **kwargs)
        print(end="."*int(_verbose))
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_model_small(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 40, strides=4, padding="same")(ipt)
    x     = GlobalAveragePooling1D()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation="relu")(x)
    out   = Dense(1,  activation="sigmoid")(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), "binary_crossentropy")
    return model

def make_model_medium(batch_shape):
    ipt = Input(batch_shape=batch_shape)
    x = ipt
    for filters in [64, 128, 256, 256, 128, 64]:
        x  = Conv1D(filters, 20, strides=1, padding="valid")(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation="relu")(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation="relu")(x)
    x     = Dense(64,  activation="relu")(x)
    out   = Dense(1,   activation="sigmoid")(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), "binary_crossentropy")
    return model

def make_model_large(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(64,  400, strides=4, padding="valid")(ipt)
    x     = Conv1D(128, 200, strides=1, padding="valid")(x)
    for _ in range(40):
        x = Conv1D(256,  12, strides=1, padding="same")(x)
    x     = Conv1D(512,  20, strides=2, padding="valid")(x)
    x     = Conv1D(1028, 10, strides=2, padding="valid")(x)
    x     = Conv1D(256,   1, strides=1, padding="valid")(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation="relu")(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation="relu")(x)
    x     = Dense(64,  activation="relu")(x)    
    out   = Dense(1,   activation="sigmoid")(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), "binary_crossentropy")
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), 
           np.random.randint(0, 2, (batch_shape[0], 1))

def make_data_tf(batch_shape, n_batches, iters):
    data = np.random.randn(n_batches, *batch_shape),
    trgt = np.random.randint(0, 2, (n_batches, batch_shape[0], 1))
    return tf.data.Dataset.from_tensor_slices((data, trgt))#.repeat(iters)

batch_shape_small  = (32, 140,   30)
batch_shape_medium = (32, 1400,  30)
batch_shape_large  = (32, 14000, 30)

batch_shapes = batch_shape_small, batch_shape_medium, batch_shape_large
make_model_fns = make_model_small, make_model_medium, make_model_large
iterations = [200, 100, 50]
shape_names = ["Small data",  "Medium data",  "Large data"]
model_names = ["Small model", "Medium model", "Large model"]

def test_all(fit=False, tf_dataset=False):
    for model_fn, model_name, iters in zip(make_model_fns, model_names, iterations):
        for batch_shape, shape_name in zip(batch_shapes, shape_names):
            if (model_fn is make_model_large) and (batch_shape == batch_shape_small):
                continue
            reset_seeds(reset_graph_with_backend=K)
            if tf_dataset:
                data = make_data_tf(batch_shape, iters, iters)
            else:
                data = make_data(batch_shape)
            model = model_fn(batch_shape)

            if fit:
                if tf_dataset:
                    model.train_on_batch(data.take(1))
                    t0 = time()
                    model.fit(data, steps_per_epoch=iters)
                    print("Time/iter: %.4f sec" % ((time() - t0) / iters))
                else:
                    model.train_on_batch(*data)
                    timeit(model.fit, iters, *data, _verbose=1, verbose=0)
            else:
                model.train_on_batch(*data)
                timeit(model.train_on_batch, iters, *data, _verbose=1)
            cprint(">> {}, {} done <<
".format(model_name, shape_name), "blue")
            del model

test_all(fit=True, tf_dataset=False)

Tutorials