“Large data” workflows using pandas

repeat | sin | sine | StackOverflow

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it"s out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I"m not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

What are some best-practice workflows for accomplishing the following:

  1. Loading flat files into a permanent, on-disk database structure
  2. Querying that database to retrieve data to feed into a pandas data structure
  3. Updating the database after manipulating pieces in pandas

Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

Edit -- an example of how I would like this to work:

  1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
  2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
  3. I would create new columns by performing various operations on the selected columns.
  4. I would then have to append these new columns into the database structure.

I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

Edit -- Responding to Jeff"s questions specifically:

  1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
  2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = "A" elif var2 = 4 then newvar = "B". The result of these operations is a new column for every record in my dataset.
  3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
  4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
  5. It"s pretty rare that I would subset by rows when creating a new column. However, it"s pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
  6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I"m doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

Answer rating: 686

I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.

It"s worth reading the docs and late in this thread for several suggestions for how to store your data.

Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.

  1. Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
  2. What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
    (Giving a toy example could enable us to offer more specific recommendations.)
  3. After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
  4. Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
  5. Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
  6. Do you "work on" all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don"t need to pull in that column explicity until final results time)?



Solution

Ensure you have pandas at least 0.10.1 installed.

Read iterating files chunk-by-chunk and multiple table queries.

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it"s easy to select a small group of fields (which will work with a big table, but it"s more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore("mystore.h5")

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ["field_1","field_2",.....], dc = ["field_1",....,"field_5"]),
    B = dict(fields = ["field_10",......        ], dc = ["field_10"]),
    .....
    REPORTING_ONLY = dict(fields = ["field_1000","field_1001",...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v["fields"] ]))

Reading in the files and creating the storage (essentially doing what append_to_multiple does):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v["fields"], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v["dc"])

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn"t necessary).

This is how you get columns and create new ones:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to "add columns", create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don"t overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

When you are ready for post_processing:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =["field_1>0", "field_1000=foo"], selector = group_1)

About data_columns, you don"t actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:

store.select(group, where = ["field_1000=foo", "field_1001>0"])

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

You also might want to:

  • create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
  • indexes on certain data columns (makes row-subsetting much faster).
  • enable compression.

Let me know when you have questions!

Answer rating: 162

I think the answers above are missing a simple approach that I"ve found very useful.

When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

This approach doesn"t cover all scenarios, but is very useful in a lot of them

Answer rating: 103

There is now, two years after the question, an "out-of-core" pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it. Update: in the past two years it has been consistently maintained and there is substantial user community working with Dask.

And now, four years after the question, there is another high-performance "out-of-core" pandas equivalent in Vaex. It "uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." It can handle data sets of billions of rows and does not store them into memory (making it even possible to do analysis on suboptimal hardware).





"Large data" workflows using pandas: StackOverflow Questions

Create list of single item repeated N times

I want to create a series of lists, all of varying lengths. Each list will contain the same element e, repeated n times (where n = length of the list).

How do I create the lists, without using a list comprehension [e for number in xrange(n)] for each list?

What is the best way to repeatedly execute a function every x seconds?

Question by DavidM

I want to repeatedly execute a function in Python every 60 seconds forever (just like an NSTimer in Objective C). This code will run as a daemon and is effectively like calling the python script every minute using a cron, but without requiring that to be set up by the user.

In this question about a cron implemented in Python, the solution appears to effectively just sleep() for x seconds. I don"t need such advanced functionality so perhaps something like this would work

while True:
    # Code executed here
    time.sleep(60)

Are there any foreseeable problems with this code?

How can I tell if a string repeats itself in Python?

I"m looking for a way to test whether or not a given string repeats itself for the entire string or not.

Examples:

[
    "0045662100456621004566210045662100456621",             # "00456621"
    "0072992700729927007299270072992700729927",             # "00729927"
    "001443001443001443001443001443001443001443",           # "001443"
    "037037037037037037037037037037037037037037037",        # "037"
    "047619047619047619047619047619047619047619",           # "047619"
    "002457002457002457002457002457002457002457",           # "002457"
    "001221001221001221001221001221001221001221",           # "001221"
    "001230012300123001230012300123001230012300123",        # "00123"
    "0013947001394700139470013947001394700139470013947",    # "0013947"
    "001001001001001001001001001001001001001001001001001",  # "001"
    "001406469760900140646976090014064697609",              # "0014064697609"
]

are strings which repeat themselves, and

[
    "004608294930875576036866359447",
    "00469483568075117370892018779342723",
    "004739336492890995260663507109",
    "001508295625942684766214177978883861236802413273",
    "007518796992481203",
    "0071942446043165467625899280575539568345323741",
    "0434782608695652173913",
    "0344827586206896551724137931",
    "002481389578163771712158808933",
    "002932551319648093841642228739",
    "0035587188612099644128113879",
    "003484320557491289198606271777",
    "00115074798619102416570771",
]

are examples of ones that do not.

The repeating sections of the strings I"m given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can"t see any intuitive solution.

I"ve looked into regexes a bit and they seem good for when you know what you"re looking for, or at least the length of the pattern you"re looking for. Unfortunately, I know neither.

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?

Repeat string to certain length

What is an efficient way to repeat a string to a certain length? Eg: repeat("abc", 7) -> "abcabca"

Here is my current code:

def repeat(string, length):
    cur, old = 1, string
    while len(string) < length:
        string += old[cur-1]
        cur = (cur+1)%len(old)
    return string

Is there a better (more pythonic) way to do this? Maybe using list comprehension?

tqdm in Jupyter Notebook prints new progress bars repeatedly

I am using tqdm to print progress in a script I"m running in a Jupyter notebook. I am printing all messages to the console via tqdm.write(). However, this still gives me a skewed output like so:

enter image description here

That is, each time a new line has to be printed, a new progress bar is printed on the next line. This does not happen when I run the script via terminal. How can I solve this?

How to repeat last command in python interpreter shell?

How do I repeat the last command? The usual keys: Up, Ctrl+Up, Alt-p don"t work. They produce nonsensical characters.

(ve)[[email protected] ve]$ python
Python 2.6.6 (r266:84292, Nov 15 2010, 21:48:32) 
[GCC 4.4.4 20100630 (Red Hat 4.4.4-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print "hello world"
hello world
>>> ^[[A
  File "<stdin>", line 1
    ^
SyntaxError: invalid syntax
>>> ^[[1;5A
  File "<stdin>", line 1
    [1;5A
    ^
SyntaxError: invalid syntax
>>> ^[p
  File "<stdin>", line 1
    p
    ^
SyntaxError: invalid syntax
>>> 

Given a string of a million numbers, return all repeating 3 digit numbers

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)

I pretty much screwed up on the first interview problem...

Question: Given a string of a million numbers (Pi for example), write a function/program that returns all repeating 3 digit numbers and number of repetition greater than 1

For example: if the string was: 123412345123456 then the function/program would return:

123 - 3 times
234 - 3 times
345 - 2 times

They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:

000 --> 999

Now that I"m thinking about it, I don"t think it"s possible to come up with a constant time algorithm. Is it?

Python threading.timer - repeat function every "n" seconds

I want to fire off a function every 0.5 seconds and be able to start and stop and reset the timer. I"m not too knowledgeable of how Python threads work and am having difficulties with the python timer.

However, I keep getting RuntimeError: threads can only be started once when I execute threading.timer.start() twice. Is there a work around for this? I tried applying threading.timer.cancel() before each start.

Pseudo code:

t=threading.timer(0.5,function)
while True:
    t.cancel()
    t.start()

How to assign to repeated field?

I am using protocol buffers in python and I have a Person message

repeated uint64 id

but when I try to assign a value to it like:

person.id = [1, 32, 43432]

I get an error: Assigment not allowed for repeated field "id" in protocol message object How to assign a value to a repeated field ?

Answer #1

Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool.

The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code:

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)

Which is the multithreaded version of:

results = []
for item in my_array:
    results.append(my_function(item))

Description

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence.

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end.

Enter image description here


Implementation

Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy.

multiprocessing.dummy is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O):

multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module.

import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = [
  "http://www.python.org",
  "http://www.python.org/about/",
  "http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html",
  "http://www.python.org/doc/",
  "http://www.python.org/download/",
  "http://www.python.org/getit/",
  "http://www.python.org/community/",
  "https://wiki.python.org/moin/",
]

# Make the Pool of workers
pool = ThreadPool(4)

# Open the URLs in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

# Close the pool and wait for the work to finish
pool.close()
pool.join()

And the timing results:

Single thread:   14.4 seconds
       4 Pool:   3.1 seconds
       8 Pool:   1.4 seconds
      13 Pool:   1.3 seconds

Passing multiple arguments (works like this only in Python 3.3 and later):

To pass multiple arrays:

results = pool.starmap(function, zip(list_a, list_b))

Or to pass a constant and an array:

results = pool.starmap(function, zip(itertools.repeat(constant), list_a))

If you are using an earlier version of Python, you can pass multiple arguments via this workaround).

(Thanks to user136036 for the helpful comment.)

Answer #2

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #3

The short answer, or TL;DR

Basically, eval is used to evaluate a single dynamically generated Python expression, and exec is used to execute dynamically generated Python code only for its side effects.

eval and exec have these two differences:

  1. eval accepts only a single expression, exec can take a code block that has Python statements: loops, try: except:, class and function/method definitions and so on.

    An expression in Python is whatever you can have as the value in a variable assignment:

    a_variable = (anything you can put within these parentheses is an expression)
    
  2. eval returns the value of the given expression, whereas exec ignores the return value from its code, and always returns None (in Python 2 it is a statement and cannot be used as an expression, so it really does not return anything).

In versions 1.0 - 2.7, exec was a statement, because CPython needed to produce a different kind of code object for functions that used exec for its side effects inside the function.

In Python 3, exec is a function; its use has no effect on the compiled bytecode of the function where it is used.


Thus basically:

>>> a = 5
>>> eval("37 + a")   # it is an expression
42
>>> exec("37 + a")   # it is an expression statement; value is ignored (None is returned)
>>> exec("a = 47")   # modify a global variable as a side effect
>>> a
47
>>> eval("a = 47")  # you cannot evaluate a statement
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    a = 47
      ^
SyntaxError: invalid syntax

The compile in "exec" mode compiles any number of statements into a bytecode that implicitly always returns None, whereas in "eval" mode it compiles a single expression into bytecode that returns the value of that expression.

>>> eval(compile("42", "<string>", "exec"))  # code returns None
>>> eval(compile("42", "<string>", "eval"))  # code returns 42
42
>>> exec(compile("42", "<string>", "eval"))  # code returns 42,
>>>                                          # but ignored by exec

In the "eval" mode (and thus with the eval function if a string is passed in), the compile raises an exception if the source code contains statements or anything else beyond a single expression:

>>> compile("for i in range(3): print(i)", "<string>", "eval")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print(i)
      ^
SyntaxError: invalid syntax

Actually the statement "eval accepts only a single expression" applies only when a string (which contains Python source code) is passed to eval. Then it is internally compiled to bytecode using compile(source, "<string>", "eval") This is where the difference really comes from.

If a code object (which contains Python bytecode) is passed to exec or eval, they behave identically, excepting for the fact that exec ignores the return value, still returning None always. So it is possible use eval to execute something that has statements, if you just compiled it into bytecode before instead of passing it as a string:

>>> eval(compile("if 1: print("Hello")", "<string>", "exec"))
Hello
>>>

works without problems, even though the compiled code contains statements. It still returns None, because that is the return value of the code object returned from compile.

In the "eval" mode (and thus with the eval function if a string is passed in), the compile raises an exception if the source code contains statements or anything else beyond a single expression:

>>> compile("for i in range(3): print(i)", "<string>". "eval")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print(i)
      ^
SyntaxError: invalid syntax

The longer answer, a.k.a the gory details

exec and eval

The exec function (which was a statement in Python 2) is used for executing a dynamically created statement or program:

>>> program = """
for i in range(3):
    print("Python is cool")
"""
>>> exec(program)
Python is cool
Python is cool
Python is cool
>>> 

The eval function does the same for a single expression, and returns the value of the expression:

>>> a = 2
>>> my_calculation = "42 * a"
>>> result = eval(my_calculation)
>>> result
84

exec and eval both accept the program/expression to be run either as a str, unicode or bytes object containing source code, or as a code object which contains Python bytecode.

If a str/unicode/bytes containing source code was passed to exec, it behaves equivalently to:

exec(compile(source, "<string>", "exec"))

and eval similarly behaves equivalent to:

eval(compile(source, "<string>", "eval"))

Since all expressions can be used as statements in Python (these are called the Expr nodes in the Python abstract grammar; the opposite is not true), you can always use exec if you do not need the return value. That is to say, you can use either eval("my_func(42)") or exec("my_func(42)"), the difference being that eval returns the value returned by my_func, and exec discards it:

>>> def my_func(arg):
...     print("Called with %d" % arg)
...     return arg * 2
... 
>>> exec("my_func(42)")
Called with 42
>>> eval("my_func(42)")
Called with 42
84
>>> 

Of the 2, only exec accepts source code that contains statements, like def, for, while, import, or class, the assignment statement (a.k.a a = 42), or entire programs:

>>> exec("for i in range(3): print(i)")
0
1
2
>>> eval("for i in range(3): print(i)")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print(i)
      ^
SyntaxError: invalid syntax

Both exec and eval accept 2 additional positional arguments - globals and locals - which are the global and local variable scopes that the code sees. These default to the globals() and locals() within the scope that called exec or eval, but any dictionary can be used for globals and any mapping for locals (including dict of course). These can be used not only to restrict/modify the variables that the code sees, but are often also used for capturing the variables that the executed code creates:

>>> g = dict()
>>> l = dict()
>>> exec("global a; a, b = 123, 42", g, l)
>>> g["a"]
123
>>> l
{"b": 42}

(If you display the value of the entire g, it would be much longer, because exec and eval add the built-ins module as __builtins__ to the globals automatically if it is missing).

In Python 2, the official syntax for the exec statement is actually exec code in globals, locals, as in

>>> exec "global a; a, b = 123, 42" in g, l

However the alternate syntax exec(code, globals, locals) has always been accepted too (see below).

compile

The compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1) built-in can be used to speed up repeated invocations of the same code with exec or eval by compiling the source into a code object beforehand. The mode parameter controls the kind of code fragment the compile function accepts and the kind of bytecode it produces. The choices are "eval", "exec" and "single":

  • "eval" mode expects a single expression, and will produce bytecode that when run will return the value of that expression:

    >>> dis.dis(compile("a + b", "<string>", "eval"))
      1           0 LOAD_NAME                0 (a)
                  3 LOAD_NAME                1 (b)
                  6 BINARY_ADD
                  7 RETURN_VALUE
    
  • "exec" accepts any kinds of python constructs from single expressions to whole modules of code, and executes them as if they were module top-level statements. The code object returns None:

    >>> dis.dis(compile("a + b", "<string>", "exec"))
      1           0 LOAD_NAME                0 (a)
                  3 LOAD_NAME                1 (b)
                  6 BINARY_ADD
                  7 POP_TOP                             <- discard result
                  8 LOAD_CONST               0 (None)   <- load None on stack
                 11 RETURN_VALUE                        <- return top of stack
    
  • "single" is a limited form of "exec" which accepts a source code containing a single statement (or multiple statements separated by ;) if the last statement is an expression statement, the resulting bytecode also prints the repr of the value of that expression to the standard output(!).

    An if-elif-else chain, a loop with else, and try with its except, else and finally blocks is considered a single statement.

    A source fragment containing 2 top-level statements is an error for the "single", except in Python 2 there is a bug that sometimes allows multiple toplevel statements in the code; only the first is compiled; the rest are ignored:

    In Python 2.7.8:

    >>> exec(compile("a = 5
    a = 6", "<string>", "single"))
    >>> a
    5
    

    And in Python 3.4.2:

    >>> exec(compile("a = 5
    a = 6", "<string>", "single"))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        a = 5
            ^
    SyntaxError: multiple statements found while compiling a single statement
    

    This is very useful for making interactive Python shells. However, the value of the expression is not returned, even if you eval the resulting code.

Thus greatest distinction of exec and eval actually comes from the compile function and its modes.


In addition to compiling source code to bytecode, compile supports compiling abstract syntax trees (parse trees of Python code) into code objects; and source code into abstract syntax trees (the ast.parse is written in Python and just calls compile(source, filename, mode, PyCF_ONLY_AST)); these are used for example for modifying source code on the fly, and also for dynamic code creation, as it is often easier to handle the code as a tree of nodes instead of lines of text in complex cases.


While eval only allows you to evaluate a string that contains a single expression, you can eval a whole statement, or even a whole module that has been compiled into bytecode; that is, with Python 2, print is a statement, and cannot be evalled directly:

>>> eval("for i in range(3): print("Python is cool")")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print("Python is cool")
      ^
SyntaxError: invalid syntax

compile it with "exec" mode into a code object and you can eval it; the eval function will return None.

>>> code = compile("for i in range(3): print("Python is cool")",
                   "foo.py", "exec")
>>> eval(code)
Python is cool
Python is cool
Python is cool

If one looks into eval and exec source code in CPython 3, this is very evident; they both call PyEval_EvalCode with same arguments, the only difference being that exec explicitly returns None.

Syntax differences of exec between Python 2 and Python 3

One of the major differences in Python 2 is that exec is a statement and eval is a built-in function (both are built-in functions in Python 3). It is a well-known fact that the official syntax of exec in Python 2 is exec code [in globals[, locals]].

Unlike majority of the Python 2-to-3 porting guides seem to suggest, the exec statement in CPython 2 can be also used with syntax that looks exactly like the exec function invocation in Python 3. The reason is that Python 0.9.9 had the exec(code, globals, locals) built-in function! And that built-in function was replaced with exec statement somewhere before Python 1.0 release.

Since it was desirable to not break backwards compatibility with Python 0.9.9, Guido van Rossum added a compatibility hack in 1993: if the code was a tuple of length 2 or 3, and globals and locals were not passed into the exec statement otherwise, the code would be interpreted as if the 2nd and 3rd element of the tuple were the globals and locals respectively. The compatibility hack was not mentioned even in Python 1.4 documentation (the earliest available version online); and thus was not known to many writers of the porting guides and tools, until it was documented again in November 2012:

The first expression may also be a tuple of length 2 or 3. In this case, the optional parts must be omitted. The form exec(expr, globals) is equivalent to exec expr in globals, while the form exec(expr, globals, locals) is equivalent to exec expr in globals, locals. The tuple form of exec provides compatibility with Python 3, where exec is a function rather than a statement.

Yes, in CPython 2.7 that it is handily referred to as being a forward-compatibility option (why confuse people over that there is a backward compatibility option at all), when it actually had been there for backward-compatibility for two decades.

Thus while exec is a statement in Python 1 and Python 2, and a built-in function in Python 3 and Python 0.9.9,

>>> exec("print(a)", globals(), {"a": 42})
42

has had identical behaviour in possibly every widely released Python version ever; and works in Jython 2.5.2, PyPy 2.3.1 (Python 2.7.6) and IronPython 2.6.1 too (kudos to them following the undocumented behaviour of CPython closely).

What you cannot do in Pythons 1.0 - 2.7 with its compatibility hack, is to store the return value of exec into a variable:

Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = exec("print(42)")
  File "<stdin>", line 1
    a = exec("print(42)")
           ^
SyntaxError: invalid syntax

(which wouldn"t be useful in Python 3 either, as exec always returns None), or pass a reference to exec:

>>> call_later(exec, "print(42)", delay=1000)
  File "<stdin>", line 1
    call_later(exec, "print(42)", delay=1000)
                  ^
SyntaxError: invalid syntax

Which a pattern that someone might actually have used, though unlikely;

Or use it in a list comprehension:

>>> [exec(i) for i in ["print(42)", "print(foo)"]
  File "<stdin>", line 1
    [exec(i) for i in ["print(42)", "print(foo)"]
        ^
SyntaxError: invalid syntax

which is abuse of list comprehensions (use a for loop instead!).

Answer #4

What is the difference between the list methods append and extend?

  • append adds its argument as a single element to the end of a list. The length of the list itself will increase by one.
  • extend iterates over its argument adding each element to the list, extending the list. The length of the list will increase by however many elements were in the iterable argument.

append

The list.append method appends an object to the end of the list.

my_list.append(object) 

Whatever the object is, whether a number, a string, another list, or something else, it gets added onto the end of my_list as a single entry on the list.

>>> my_list
["foo", "bar"]
>>> my_list.append("baz")
>>> my_list
["foo", "bar", "baz"]

So keep in mind that a list is an object. If you append another list onto a list, the first list will be a single object at the end of the list (which may not be what you want):

>>> another_list = [1, 2, 3]
>>> my_list.append(another_list)
>>> my_list
["foo", "bar", "baz", [1, 2, 3]]
                     #^^^^^^^^^--- single item at the end of the list.

extend

The list.extend method extends a list by appending elements from an iterable:

my_list.extend(iterable)

So with extend, each element of the iterable gets appended onto the list. For example:

>>> my_list
["foo", "bar"]
>>> another_list = [1, 2, 3]
>>> my_list.extend(another_list)
>>> my_list
["foo", "bar", 1, 2, 3]

Keep in mind that a string is an iterable, so if you extend a list with a string, you"ll append each character as you iterate over the string (which may not be what you want):

>>> my_list.extend("baz")
>>> my_list
["foo", "bar", 1, 2, 3, "b", "a", "z"]

Operator Overload, __add__ (+) and __iadd__ (+=)

Both + and += operators are defined for list. They are semantically similar to extend.

my_list + another_list creates a third list in memory, so you can return the result of it, but it requires that the second iterable be a list.

my_list += another_list modifies the list in-place (it is the in-place operator, and lists are mutable objects, as we"ve seen) so it does not create a new list. It also works like extend, in that the second iterable can be any kind of iterable.

Don"t get confused - my_list = my_list + another_list is not equivalent to += - it gives you a brand new list assigned to my_list.

Time Complexity

Append has (amortized) constant time complexity, O(1).

Extend has time complexity, O(k).

Iterating through the multiple calls to append adds to the complexity, making it equivalent to that of extend, and since extend"s iteration is implemented in C, it will always be faster if you intend to append successive items from an iterable onto a list.

Regarding "amortized" - from the list object implementation source:

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().

This means that we get the benefits of a larger than needed memory reallocation up front, but we may pay for it on the next marginal reallocation with an even larger one. Total time for all appends is linear at O(n), and that time allocated per append, becomes O(1).

Performance

You may wonder what is more performant, since append can be used to achieve the same outcome as extend. The following functions do the same thing:

def append(alist, iterable):
    for item in iterable:
        alist.append(item)
        
def extend(alist, iterable):
    alist.extend(iterable)

So let"s time them:

import timeit

>>> min(timeit.repeat(lambda: append([], "abcdefghijklmnopqrstuvwxyz")))
2.867846965789795
>>> min(timeit.repeat(lambda: extend([], "abcdefghijklmnopqrstuvwxyz")))
0.8060121536254883

Addressing a comment on timings

A commenter said:

Perfect answer, I just miss the timing of comparing adding only one element

Do the semantically correct thing. If you want to append all elements in an iterable, use extend. If you"re just adding one element, use append.

Ok, so let"s create an experiment to see how this works out in time:

def append_one(a_list, element):
    a_list.append(element)

def extend_one(a_list, element):
    """creating a new list is semantically the most direct
    way to create an iterable to give to extend"""
    a_list.extend([element])

import timeit

And we see that going out of our way to create an iterable just to use extend is a (minor) waste of time:

>>> min(timeit.repeat(lambda: append_one([], 0)))
0.2082819009956438
>>> min(timeit.repeat(lambda: extend_one([], 0)))
0.2397019260097295

We learn from this that there"s nothing gained from using extend when we have only one element to append.

Also, these timings are not that important. I am just showing them to make the point that, in Python, doing the semantically correct thing is doing things the Right Way‚Ñ¢.

It"s conceivable that you might test timings on two comparable operations and get an ambiguous or inverse result. Just focus on doing the semantically correct thing.

Conclusion

We see that extend is semantically clearer, and that it can run much faster than append, when you intend to append each element in an iterable to a list.

If you only have a single element (not in an iterable) to add to the list, use append.

Answer #5

(Note: this answer is based on a short blog post about einsum I wrote a while ago.)

What does einsum do?

Imagine that we have two multi-dimensional arrays, A and B. Now let"s suppose we want to...

  • multiply A with B in a particular way to create new array of products; and then maybe
  • sum this new array along particular axes; and then maybe
  • transpose the axes of the new array in a particular order.

There"s a good chance that einsum will help us do this faster and more memory-efficiently than combinations of the NumPy functions like multiply, sum and transpose will allow.

How does einsum work?

Here"s a simple (but not completely trivial) example. Take the following two arrays:

A = np.array([0, 1, 2])

B = np.array([[ 0,  1,  2,  3],
              [ 4,  5,  6,  7],
              [ 8,  9, 10, 11]])

We will multiply A and B element-wise and then sum along the rows of the new array. In "normal" NumPy we"d write:

>>> (A[:, np.newaxis] * B).sum(axis=1)
array([ 0, 22, 76])

So here, the indexing operation on A lines up the first axes of the two arrays so that the multiplication can be broadcast. The rows of the array of products are then summed to return the answer.

Now if we wanted to use einsum instead, we could write:

>>> np.einsum("i,ij->i", A, B)
array([ 0, 22, 76])

The signature string "i,ij->i" is the key here and needs a little bit of explaining. You can think of it in two halves. On the left-hand side (left of the ->) we"ve labelled the two input arrays. To the right of ->, we"ve labelled the array we want to end up with.

Here is what happens next:

  • A has one axis; we"ve labelled it i. And B has two axes; we"ve labelled axis 0 as i and axis 1 as j.

  • By repeating the label i in both input arrays, we are telling einsum that these two axes should be multiplied together. In other words, we"re multiplying array A with each column of array B, just like A[:, np.newaxis] * B does.

  • Notice that j does not appear as a label in our desired output; we"ve just used i (we want to end up with a 1D array). By omitting the label, we"re telling einsum to sum along this axis. In other words, we"re summing the rows of the products, just like .sum(axis=1) does.

That"s basically all you need to know to use einsum. It helps to play about a little; if we leave both labels in the output, "i,ij->ij", we get back a 2D array of products (same as A[:, np.newaxis] * B). If we say no output labels, "i,ij->, we get back a single number (same as doing (A[:, np.newaxis] * B).sum()).

The great thing about einsum however, is that it does not build a temporary array of products first; it just sums the products as it goes. This can lead to big savings in memory use.

A slightly bigger example

To explain the dot product, here are two new arrays:

A = array([[1, 1, 1],
           [2, 2, 2],
           [5, 5, 5]])

B = array([[0, 1, 0],
           [1, 1, 0],
           [1, 1, 1]])

We will compute the dot product using np.einsum("ij,jk->ik", A, B). Here"s a picture showing the labelling of the A and B and the output array that we get from the function:

enter image description here

You can see that label j is repeated - this means we"re multiplying the rows of A with the columns of B. Furthermore, the label j is not included in the output - we"re summing these products. Labels i and k are kept for the output, so we get back a 2D array.

It might be even clearer to compare this result with the array where the label j is not summed. Below, on the left you can see the 3D array that results from writing np.einsum("ij,jk->ijk", A, B) (i.e. we"ve kept label j):

enter image description here

Summing axis j gives the expected dot product, shown on the right.

Some exercises

To get more of a feel for einsum, it can be useful to implement familiar NumPy array operations using the subscript notation. Anything that involves combinations of multiplying and summing axes can be written using einsum.

Let A and B be two 1D arrays with the same length. For example, A = np.arange(10) and B = np.arange(5, 15).

  • The sum of A can be written:

    np.einsum("i->", A)
    
  • Element-wise multiplication, A * B, can be written:

    np.einsum("i,i->i", A, B)
    
  • The inner product or dot product, np.inner(A, B) or np.dot(A, B), can be written:

    np.einsum("i,i->", A, B) # or just use "i,i"
    
  • The outer product, np.outer(A, B), can be written:

    np.einsum("i,j->ij", A, B)
    

For 2D arrays, C and D, provided that the axes are compatible lengths (both the same length or one of them of has length 1), here are a few examples:

  • The trace of C (sum of main diagonal), np.trace(C), can be written:

    np.einsum("ii", C)
    
  • Element-wise multiplication of C and the transpose of D, C * D.T, can be written:

    np.einsum("ij,ji->ij", C, D)
    
  • Multiplying each element of C by the array D (to make a 4D array), C[:, :, None, None] * D, can be written:

    np.einsum("ij,kl->ijkl", C, D)    
    

Answer #6

root is the old (pre-conda 4.4) name for the main environment; after conda 4.4, it was renamed to be base. source

What 95% of people actually want

In most cases what you want to do when you say that you want to update Anaconda is to execute the command:

conda update --all

(But this should be preceeded by conda update -n base conda so you have the latest conda version installed)

This will update all packages in the current environment to the latest version -- with the small print being that it may use an older version of some packages in order to satisfy dependency constraints (often this won"t be necessary and when it is necessary the package plan solver will do its best to minimize the impact).

This needs to be executed from the command line, and the best way to get there is from Anaconda Navigator, then the "Environments" tab, then click on the triangle beside the base environment, selecting "Open Terminal":

Open terminal from Navigator

This operation will only update the one selected environment (in this case, the base environment). If you have other environments you"d like to update you can repeat the process above, but first click on the environment. When it is selected there is a triangular marker on the right (see image above, step 3). Or from the command line you can provide the environment name (-n envname) or path (-p /path/to/env), for example to update your dspyr environment from the screenshot above:

conda update -n dspyr --all

Update individual packages

If you are only interested in updating an individual package then simply click on the blue arrow or blue version number in Navigator, e.g. for astroid or astropy in the screenshot above, and this will tag those packages for an upgrade. When you are done you need to click the "Apply" button:

Apply to update individual packages

Or from the command line:

conda update astroid astropy

Updating just the packages in the standard Anaconda Distribution

If you don"t care about package versions and just want "the latest set of all packages in the standard Anaconda Distribution, so long as they work together", then you should take a look at this gist.

Why updating the Anaconda package is almost always a bad idea

In most cases updating the Anaconda package in the package list will have a surprising result: you may actually downgrade many packages (in fact, this is likely if it indicates the version as custom). The gist above provides details.

Leverage conda environments

Your base environment is probably not a good place to try and manage an exact set of packages: it is going to be a dynamic working space with new packages installed and packages randomly updated. If you need an exact set of packages then create a conda environment to hold them. Thanks to the conda package cache and the way file linking is used doing this is typically i) fast and ii) consumes very little additional disk space. E.g.

conda create -n myspecialenv -c bioconda -c conda-forge python=3.5 pandas beautifulsoup seaborn nltk

The conda documentation has more details and examples.

pip, PyPI, and setuptools?

None of this is going to help with updating packages that have been installed from PyPI via pip or any packages installed using python setup.py install. conda list will give you some hints about the pip-based Python packages you have in an environment, but it won"t do anything special to update them.

Commercial use of Anaconda or Anaconda Enterprise

It is pretty much exactly the same story, with the exception that you may not be able to update the base environment if it was installed by someone else (say to /opt/anaconda/latest). If you"re not able to update the environments you are using you should be able to clone and then update:

conda create -n myenv --clone base
conda update -n myenv --all

Answer #7

First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.

If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.24.2 at the moment):

  • iterrows: dtype might not match from row to row

    Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()

  • iterrows: Do not modify rows

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

    Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2)
    
  • itertuples:

    The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

See pandas docs on iteration for more details.

Answer #8

Client and Resource are two different abstractions within the boto3 SDK for making AWS service requests. You would typically choose to use either the Client abstraction or the Resource abstraction. I"ve outlined the differences between Client and Resource below to help readers decide which to use.

Session is largely orthogonal to the concepts of Client and Resource (but is used by both).

Here"s some more detailed information on what Client, Resource, and Session are all about.

Client:

  • this is the original boto3 API abstraction
  • provides low-level AWS service access
  • all AWS service operations are supported by clients
  • exposes botocore client to the developer
  • typically maps 1:1 with the AWS service API
  • snake-cased method names (e.g. ListBuckets API => list_buckets method)
  • generated from AWS service description

Here"s an example of client-level access to an S3 bucket"s objects:

import boto3

client = boto3.client("s3")
response = client.list_objects_v2(Bucket="mybucket")
for content in response["Contents"]:
    obj_dict = client.get_object(Bucket="mybucket", Key=content["Key"])
    print(content["Key"], obj_dict["LastModified"])

Note: this client-level code is limited to listing at most 1000 objects. You would have to use a paginator, or implement your own loop, calling list_objects_v2() repeatedly with a continuation marker if there were more than 1000 objects.

Resource:

  • this is the newer boto3 API abstraction
  • provides high-level, object-oriented API
  • does not provide 100% API coverage of AWS services
  • uses identifiers and attributes
  • has actions (operations on resources)
  • exposes subresources and collections of AWS resources
  • generated from resource description

Here"s the equivalent example using resource-level access to an S3 bucket"s objects (all):

import boto3

s3 = boto3.resource("s3")
bucket = s3.Bucket("mybucket")
for obj in bucket.objects.all():
    print(obj.key, obj.last_modified)

Note: in this case you do not have to make a second API call to get the objects; they"re available to you as a collection on the bucket. These collections of subresources are lazily-loaded.

You can see that the Resource version of the code is much simpler, more compact, and has more capability (it does pagination for you). The Client version of the code would actually be more complicated than shown above if you wanted to include pagination.

Session:

  • stores configuration information (primarily credentials and selected region)
  • allows you to create service clients and resources
  • boto3 creates a default session for you when needed

A useful resource to learn more about these boto3 concepts is the introductory re:Invent video.

Answer #9

Best way to check if a list is empty

For example, if passed the following:

a = []

How do I check to see if a is empty?

Short Answer:

Place the list in a boolean context (for example, with an if or while statement). It will test False if it is empty, and True otherwise. For example:

if not a:                           # do this!
    print("a is an empty list")

PEP 8

PEP 8, the official Python style guide for Python code in Python"s standard library, asserts:

For sequences, (strings, lists, tuples), use the fact that empty sequences are false.

Yes: if not seq:
     if seq:

No: if len(seq):
    if not len(seq):

We should expect that standard library code should be as performant and correct as possible. But why is that the case, and why do we need this guidance?

Explanation

I frequently see code like this from experienced programmers new to Python:

if len(a) == 0:                     # Don"t do this!
    print("a is an empty list")

And users of lazy languages may be tempted to do this:

if a == []:                         # Don"t do this!
    print("a is an empty list")

These are correct in their respective other languages. And this is even semantically correct in Python.

But we consider it un-Pythonic because Python supports these semantics directly in the list object"s interface via boolean coercion.

From the docs (and note specifically the inclusion of the empty list, []):

By default, an object is considered true unless its class defines either a __bool__() method that returns False or a __len__() method that returns zero, when called with the object. Here are most of the built-in objects considered false:

  • constants defined to be false: None and False.
  • zero of any numeric type: 0, 0.0, 0j, Decimal(0), Fraction(0, 1)
  • empty sequences and collections: "", (), [], {}, set(), range(0)

And the datamodel documentation:

object.__bool__(self)

Called to implement truth value testing and the built-in operation bool(); should return False or True. When this method is not defined, __len__() is called, if it is defined, and the object is considered true if its result is nonzero. If a class defines neither __len__() nor __bool__(), all its instances are considered true.

and

object.__len__(self)

Called to implement the built-in function len(). Should return the length of the object, an integer >= 0. Also, an object that doesn’t define a __bool__() method and whose __len__() method returns zero is considered to be false in a Boolean context.

So instead of this:

if len(a) == 0:                     # Don"t do this!
    print("a is an empty list")

or this:

if a == []:                     # Don"t do this!
    print("a is an empty list")

Do this:

if not a:
    print("a is an empty list")

Doing what"s Pythonic usually pays off in performance:

Does it pay off? (Note that less time to perform an equivalent operation is better:)

>>> import timeit
>>> min(timeit.repeat(lambda: len([]) == 0, repeat=100))
0.13775854044661884
>>> min(timeit.repeat(lambda: [] == [], repeat=100))
0.0984637276455409
>>> min(timeit.repeat(lambda: not [], repeat=100))
0.07878462291455435

For scale, here"s the cost of calling the function and constructing and returning an empty list, which you might subtract from the costs of the emptiness checks used above:

>>> min(timeit.repeat(lambda: [], repeat=100))
0.07074015751817342

We see that either checking for length with the builtin function len compared to 0 or checking against an empty list is much less performant than using the builtin syntax of the language as documented.

Why?

For the len(a) == 0 check:

First Python has to check the globals to see if len is shadowed.

Then it must call the function, load 0, and do the equality comparison in Python (instead of with C):

>>> import dis
>>> dis.dis(lambda: len([]) == 0)
  1           0 LOAD_GLOBAL              0 (len)
              2 BUILD_LIST               0
              4 CALL_FUNCTION            1
              6 LOAD_CONST               1 (0)
              8 COMPARE_OP               2 (==)
             10 RETURN_VALUE

And for the [] == [] it has to build an unnecessary list and then, again, do the comparison operation in Python"s virtual machine (as opposed to C)

>>> dis.dis(lambda: [] == [])
  1           0 BUILD_LIST               0
              2 BUILD_LIST               0
              4 COMPARE_OP               2 (==)
              6 RETURN_VALUE

The "Pythonic" way is a much simpler and faster check since the length of the list is cached in the object instance header:

>>> dis.dis(lambda: not [])
  1           0 BUILD_LIST               0
              2 UNARY_NOT
              4 RETURN_VALUE

Evidence from the C source and documentation

PyVarObject

This is an extension of PyObject that adds the ob_size field. This is only used for objects that have some notion of length. This type does not often appear in the Python/C API. It corresponds to the fields defined by the expansion of the PyObject_VAR_HEAD macro.

From the c source in Include/listobject.h:

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for "allocated" elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size

Response to comments:

I would point out that this is also true for the non-empty case though its pretty ugly as with l=[] then %timeit len(l) != 0 90.6 ns ± 8.3 ns, %timeit l != [] 55.6 ns ± 3.09, %timeit not not l 38.5 ns ± 0.372. But there is no way anyone is going to enjoy not not l despite triple the speed. It looks ridiculous. But the speed wins out
I suppose the problem is testing with timeit since just if l: is sufficient but surprisingly %timeit bool(l) yields 101 ns ± 2.64 ns. Interesting there is no way to coerce to bool without this penalty. %timeit l is useless since no conversion would occur.

IPython magic, %timeit, is not entirely useless here:

In [1]: l = []                                                                  

In [2]: %timeit l                                                               
20 ns ± 0.155 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

In [3]: %timeit not l                                                           
24.4 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit not not l                                                       
30.1 ns ± 2.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

We can see there"s a bit of linear cost for each additional not here. We want to see the costs, ceteris paribus, that is, all else equal - where all else is minimized as far as possible:

In [5]: %timeit if l: pass                                                      
22.6 ns ± 0.963 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [6]: %timeit if not l: pass                                                  
24.4 ns ± 0.796 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [7]: %timeit if not not l: pass                                              
23.4 ns ± 0.793 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Now let"s look at the case for an unempty list:

In [8]: l = [1]                                                                 

In [9]: %timeit if l: pass                                                      
23.7 ns ± 1.06 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [10]: %timeit if not l: pass                                                 
23.6 ns ± 1.64 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [11]: %timeit if not not l: pass                                             
26.3 ns ± 1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

What we can see here is that it makes little difference whether you pass in an actual bool to the condition check or the list itself, and if anything, giving the list, as is, is faster.

Python is written in C; it uses its logic at the C level. Anything you write in Python will be slower. And it will likely be orders of magnitude slower unless you"re using the mechanisms built into Python directly.

Answer #10

I know object columns type makes the data hard to convert with a pandas function. When I received the data like this, the first thing that came to mind was to "flatten" or unnest the columns .

I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203"s answer, since it"s using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters.


Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

df.explode("B")

       A  B
    0  1  1
    1  1  2
    0  2  1
    1  2  2

Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

df = pd.DataFrame({"A": [1, 2, 3, 4],"B": [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index})  # replace NaN with []
df.explode("B")

   A    B
0  1    1
0  1    2
1  2    1
1  2    2
2  3  NaN
3  4  NaN

Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )

df.set_index("A").B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:"B"})
Out[463]: 
   A  B
0  1  1
1  1  2
0  2  1
1  2  2

Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({"A":df.A.repeat(df.B.str.len()),"B":np.concatenate(df.B.values)})
df
Out[465]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after "unnest" the single columns

s=pd.DataFrame({"B":np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop("B",1),how="left")
Out[477]: 
   B  A
0  1  1
0  2  1
1  1  2
1  2  2

If you need the column order exactly the same as before, add reindex at the end.

s.join(df.drop("B",1),how="left").reindex(columns=df.columns)

Method 3
recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

If more than two columns, use

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
   0  1  A       B
0  0  1  1  [1, 2]
1  0  2  1  [1, 2]
2  1  1  2  [1, 2]
3  1  2  2  [1, 2]

Method 4
using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5
when the list only contains unique values:

df=pd.DataFrame({"A":[1,2],"B":[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df["B"], df["A"])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
   B  A
0  1  1
1  2  1
2  3  2
3  4  2

Method 6
using numpy for high performance:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Method 7
using base function itertools cycle and chain: Pure python solution just for fun

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Generalizing to multiple columns

df=pd.DataFrame({"A":[1,2],"B":[[1,2],[3,4]],"C":[[1,2],[3,4]]})
df
Out[592]: 
   A       B       C
0  1  [1, 2]  [1, 2]
1  2  [3, 4]  [3, 4]

Self-def function:

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how="left")

        
unnesting(df,["B","C"])
Out[609]: 
   B  C  A
0  1  1  1
0  2  2  1
1  3  3  2
1  4  4  2

Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix("B_"))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

Updated function

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how="left")
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how="left")

Test Output

unnesting(df, ["B","C"], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

Update 2021-02-17 with original explode function

def unnesting(df, explode, axis):
    if axis==1:
        df1 = pd.concat([df[x].explode() for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how="left")
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how="left")

"Large data" workflows using pandas: StackOverflow Questions

How do I merge two dictionaries in a single expression (taking union of dictionaries)?

Question by Carl Meyer

I have two Python dictionaries, and I want to write a single expression that returns these two dictionaries, merged (i.e. taking the union). The update() method would be what I need, if it returned its result instead of modifying a dictionary in-place.

>>> x = {"a": 1, "b": 2}
>>> y = {"b": 10, "c": 11}
>>> z = x.update(y)
>>> print(z)
None
>>> x
{"a": 1, "b": 10, "c": 11}

How can I get that final merged dictionary in z, not x?

(To be extra-clear, the last-one-wins conflict-handling of dict.update() is what I"m looking for as well.)

Accessing the index in "for" loops?

Question by Joan Venge

How do I access the index in a for loop like the following?

ints = [8, 23, 45, 12, 78]
for i in ints:
    print("item #{} = {}".format(???, i))

I want to get this output:

item #1 = 8
item #2 = 23
item #3 = 45
item #4 = 12
item #5 = 78

When I loop through it using a for loop, how do I access the loop index, from 1 to 5 in this case?

Iterating over dictionaries using "for" loops

I am a bit puzzled by the following code:

d = {"x": 1, "y": 2, "z": 3} 
for key in d:
    print (key, "corresponds to", d[key])

What I don"t understand is the key portion. How does Python recognize that it needs only to read the key from the dictionary? Is key a special word in Python? Or is it simply a variable?

Using global variables in a function

How can I create or use a global variable in a function?

If I create a global variable in one function, how can I use that global variable in another function? Do I need to store the global variable in a local variable of the function which needs its access?

Manually raising (throwing) an exception in Python

How can I raise an exception in Python so that it can later be caught via an except block?

Calling a function of a module by using its name (a string)

What is the best way to go about calling a function given a string with the function"s name in a Python program. For example, let"s say that I have a module foo, and I have a string whose content is "bar". What is the best way to call foo.bar()?

I need to get the return value of the function, which is why I don"t just use eval. I figured out how to do it by using eval to define a temp function that returns the result of that function call, but I"m hoping that there is a more elegant way to do this.

What is the meaning of single and double underscore before an object name?

Can someone please explain the exact meaning of having single and double leading underscores before an object"s name in Python, and the difference between both?

Also, does that meaning stay the same regardless of whether the object in question is a variable, a function, a method, etc.?

Save plot to image file instead of displaying it using Matplotlib

I am writing a quick-and-dirty script to generate plots on the fly. I am using the code below (from Matplotlib documentation) as a starting point:

from pylab import figure, axes, pie, title, show

# Make a square figure and axes
figure(1, figsize=(6, 6))
ax = axes([0.1, 0.1, 0.8, 0.8])

labels = "Frogs", "Hogs", "Dogs", "Logs"
fracs = [15, 30, 45, 10]

explode = (0, 0.05, 0, 0)
pie(fracs, explode=explode, labels=labels, autopct="%1.1f%%", shadow=True)
title("Raining Hogs and Dogs", bbox={"facecolor": "0.8", "pad": 5})

show()  # Actually, don"t show, just save to foo.png

I don"t want to display the plot on a GUI, instead, I want to save the plot to a file (say foo.png), so that, for example, it can be used in batch scripts. How do I do that?

What are the differences between type() and isinstance()?

What are the differences between these two code fragments?

Using type():

import types

if type(a) is types.DictType:
    do_something()
if type(b) in types.StringTypes:
    do_something_else()

Using isinstance():

if isinstance(a, dict):
    do_something()
if isinstance(b, str) or isinstance(b, unicode):
    do_something_else()

How can I install packages using pip according to the requirements.txt file from a local directory?

Here is the problem:

I have a requirements.txt file that looks like:

BeautifulSoup==3.2.0
Django==1.3
Fabric==1.2.0
Jinja2==2.5.5
PyYAML==3.09
Pygments==1.4
SQLAlchemy==0.7.1
South==0.7.3
amqplib==0.6.1
anyjson==0.3
...

I have a local archive directory containing all the packages + others.

I have created a new virtualenv with

bin/virtualenv testing

Upon activating it, I tried to install the packages according to requirements.txt from the local archive directory.

source bin/activate
pip install -r /path/to/requirements.txt -f file:///path/to/archive/

I got some output that seems to indicate that the installation is fine:

Downloading/unpacking Fabric==1.2.0 (from -r ../testing/requirements.txt (line 3))
  Running setup.py egg_info for package Fabric
    warning: no previously-included files matching "*" found under directory "docs/_build"
    warning: no files found matching "fabfile.py"
Downloading/unpacking South==0.7.3 (from -r ../testing/requirements.txt (line 8))
  Running setup.py egg_info for package South
....

But a later check revealed none of the package is installed properly. I cannot import the package, and none is found in the site-packages directory of my virtualenv. So what went wrong?

Answer #1

The Python 3 range() object doesn"t produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.

The object also implements the object.__contains__ hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.

From the range() object documentation:

The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed).

So at a minimum, your range() object would do:

class my_range:
    def __init__(self, start, stop=None, step=1, /):
        if stop is None:
            start, stop = 0, start
        self.start, self.stop, self.step = start, stop, step
        if step < 0:
            lo, hi, step = stop, start, -step
        else:
            lo, hi = start, stop
        self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1

    def __iter__(self):
        current = self.start
        if self.step < 0:
            while current > self.stop:
                yield current
                current += self.step
        else:
            while current < self.stop:
                yield current
                current += self.step

    def __len__(self):
        return self.length

    def __getitem__(self, i):
        if i < 0:
            i += self.length
        if 0 <= i < self.length:
            return self.start + i * self.step
        raise IndexError("my_range object index out of range")

    def __contains__(self, num):
        if self.step < 0:
            if not (self.stop < num <= self.start):
                return False
        else:
            if not (self.start <= num < self.stop):
                return False
        return (num - self.start) % self.step == 0

This is still missing several things that a real range() supports (such as the .index() or .count() methods, hashing, equality testing, or slicing), but should give you an idea.

I also simplified the __contains__ implementation to only focus on integer tests; if you give a real range() object a non-integer value (including subclasses of int), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.


* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it’s all executed in optimised C code and Python stores integer values in 30-bit chunks, you’d run out of memory before you saw any performance impact due to the size of the integers involved here.

Answer #2

Recommendation for beginners:

This is my personal recommendation for beginners: start by learning virtualenv and pip, tools which work with both Python 2 and 3 and in a variety of situations, and pick up other tools once you start needing them.

PyPI packages not in the standard library:

  • virtualenv is a very popular tool that creates isolated Python environments for Python libraries. If you"re not familiar with this tool, I highly recommend learning it, as it is a very useful tool, and I"ll be making comparisons to it for the rest of this answer.

It works by installing a bunch of files in a directory (eg: env/), and then modifying the PATH environment variable to prefix it with a custom bin directory (eg: env/bin/). An exact copy of the python or python3 binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It"s not part of Python"s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using pip.

  • pyenv is used to isolate Python versions. For example, you may want to test your code against Python 2.7, 3.6, 3.7 and 3.8, so you"ll need a way to switch between them. Once activated, it prefixes the PATH environment variable with ~/.pyenv/shims, where there are special files matching the Python commands (python, pip). These are not copies of the Python-shipped commands; they are special scripts that decide on the fly which version of Python to run based on the PYENV_VERSION environment variable, or the .python-version file, or the ~/.pyenv/version file. pyenv also makes the process of downloading and installing multiple Python versions easier, using the command pyenv install.

  • pyenv-virtualenv is a plugin for pyenv by the same author as pyenv, to allow you to use pyenv and virtualenv at the same time conveniently. However, if you"re using Python 3.3 or later, pyenv-virtualenv will try to run python -m venv if it is available, instead of virtualenv. You can use virtualenv and pyenv together without pyenv-virtualenv, if you don"t want the convenience features.

  • virtualenvwrapper is a set of extensions to virtualenv (see docs). It gives you commands like mkvirtualenv, lssitepackages, and especially workon for switching between different virtualenv directories. This tool is especially useful if you want multiple virtualenv directories.

  • pyenv-virtualenvwrapper is a plugin for pyenv by the same author as pyenv, to conveniently integrate virtualenvwrapper into pyenv.

  • pipenv aims to combine Pipfile, pip and virtualenv into one command on the command-line. The virtualenv directory typically gets placed in ~/.local/share/virtualenvs/XXX, with XXX being a hash of the path of the project directory. This is different from virtualenv, where the directory is typically in the current working directory. pipenv is meant to be used when developing Python applications (as opposed to libraries). There are alternatives to pipenv, such as poetry, which I won"t list here since this question is only about the packages that are similarly named.

Standard library:

  • pyvenv (not to be confused with pyenv in the previous section) is a script shipped with Python 3 but deprecated in Python 3.6 as it had problems (not to mention the confusing name). In Python 3.6+, the exact equivalent is python3 -m venv.

  • venv is a package shipped with Python 3, which you can run using python3 -m venv (although for some reason some distros separate it out into a separate distro package, such as python3-venv on Ubuntu/Debian). It serves the same purpose as virtualenv, but only has a subset of its features (see a comparison here). virtualenv continues to be more popular than venv, especially since the former supports both Python 2 and 3.

Answer #3

You have four main options for converting types in pandas:

  1. to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)

  2. astype() - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

  4. convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.


1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

Basic usage

The input to to_numeric() is a Series or a single column of a DataFrame.

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that"s probably all you need.

Error handling

But what if some values can"t be converted to a numeric type?

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

Here"s an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:

>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

df.apply(pd.to_numeric, errors="ignore")

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

Downcasting

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

to_numeric() gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series s of integer type:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to "integer" uses the smallest possible integer that can hold the values:

>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8

Downcasting to "float" similarly picks a smaller than normal floating type:

>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32

2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

Basic usage

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and astype() will try and convert it for you:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaN or inf value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing errors="ignore". Your original object will be return untouched.

Be careful

astype() is powerful, but it will sometimes convert values "incorrectly". For example:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using pd.to_numeric(s, downcast="unsigned") instead could help prevent this error.


3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects(), you can change the type of column "a" to int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead.


4. convert_dtypes()

Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to Int64, a column of NumPy int32 values will become the pandas dtype Int32.

With our object DataFrame df, we get the following result:

>>> df.convert_dtypes().dtypes                                             
a     Int64
b    string
dtype: object

Since column "a" held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).

Column "b" contained string objects, so was changed to pandas" string dtype.

By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:

>>> df.convert_dtypes(infer_objects=False).dtypes                          
a    object
b    string
dtype: object

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran infer_dtype) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

Answer #4

Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool.

The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code:

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)

Which is the multithreaded version of:

results = []
for item in my_array:
    results.append(my_function(item))

Description

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence.

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end.

Enter image description here


Implementation

Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy.

multiprocessing.dummy is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O):

multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module.

import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = [
  "http://www.python.org",
  "http://www.python.org/about/",
  "http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html",
  "http://www.python.org/doc/",
  "http://www.python.org/download/",
  "http://www.python.org/getit/",
  "http://www.python.org/community/",
  "https://wiki.python.org/moin/",
]

# Make the Pool of workers
pool = ThreadPool(4)

# Open the URLs in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

# Close the pool and wait for the work to finish
pool.close()
pool.join()

And the timing results:

Single thread:   14.4 seconds
       4 Pool:   3.1 seconds
       8 Pool:   1.4 seconds
      13 Pool:   1.3 seconds

Passing multiple arguments (works like this only in Python 3.3 and later):

To pass multiple arrays:

results = pool.starmap(function, zip(list_a, list_b))

Or to pass a constant and an array:

results = pool.starmap(function, zip(itertools.repeat(constant), list_a))

If you are using an earlier version of Python, you can pass multiple arguments via this workaround).

(Thanks to user136036 for the helpful comment.)

Answer #5

How to iterate over rows in a DataFrame in Pandas?

Answer: DON"T*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.


Further Reading

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Answer #6

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #7

os.listdir() - list in the current directory

With listdir in os module you get the files and the folders in the current dir

 import os
 arr = os.listdir()
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Looking in a directory

arr = os.listdir("c:\files")

glob from glob

with glob you can specify a type of file to list like this

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

glob in a list comprehension

mylist = [f for f in glob.glob("*.txt")]

get the full path of only files in the current directory

import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles) 

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]

Getting the full path name with os.path.abspath

You get the full path in return

 import os
 files_path = [os.path.abspath(x) for x in os.listdir()]
 print(files_path)
 
 ["F:\documentiapplications.txt", "F:\documenticollections.txt"]

Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
    for file in f:
        if file.endswith(".docx"):
            print(os.path.join(r, file))

os.listdir(): get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

 import os
 arr = os.listdir(".")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

To go up in the directory tree

# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")

Get files: os.listdir() in a particular directory (Python 2 and 3)

 import os
 arr = os.listdir("F:\python")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Get files of a particular subdirectory with os.listdir()

import os

x = os.listdir("./content")

os.walk(".") - current directory

 import os
 arr = next(os.walk("."))[2]
 print(arr)
 
 >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]

next(os.walk(".")) and os.path.join("dir", "file")

 import os
 arr = []
 for d,r,f in next(os.walk("F:\_python")):
     for file in f:
         arr.append(os.path.join(r,file))

 for f in arr:
     print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt

next(os.walk("F:\") - get the full path - list comprehension

 [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
 
 >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]

os.walk - get full path - all files in sub dirs**

x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

>>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]

os.listdir() - get only txt files

 arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
 print(arr_txt)
 
 >>> ["work.txt", "3ebooks.txt"]

Using glob to get the full path of the files

If I should need the absolute path of the files:

from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
    print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt

Using os.path.isfile to avoid directories in the list

import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]

Using pathlib from Python 3.4

import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
    if p.is_file():
        print(p)
        flist.append(p)

 >>> error.PNG
 >>> exemaker.bat
 >>> guiprova.mp3
 >>> setup.py
 >>> speak_gui2.py
 >>> thumb.PNG

With list comprehension:

flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]

Alternatively, use pathlib.Path() instead of pathlib.Path(".")

Use glob method in pathlib.Path()

import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
    print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py

Get all and only files with os.walk

import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
    for f in t:
        y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]

Get only files with next and walk in a directory

 import os
 x = next(os.walk("F://python"))[2]
 print(x)
 
 >>> ["calculator.bat","calculator.py"]

Get only directories with next and walk in a directory

 import os
 next(os.walk("F://python"))[1] # for the current dir use (".")
 
 >>> ["python3","others"]

Get all the subdir names with walk

for r,d,f in os.walk("F:\_python"):
    for dirs in d:
        print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints

os.scandir() from Python 3.5 and greater

import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
    for entry in i:
        if entry.is_file():
            print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG

Examples:

Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

import os

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"

Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
    "Searches for pptx (or other - pptx is the default) files and copies them"
    for pack in os.walk(dir):
        for f in pack[2]:
            if f.endswith(filetype):
                fullpath = pack[0] + "\" + f
                print(fullpath)
                shutil.copy(fullpath, destination)
                counter += 1
    if counter > 0:
        print("-" * 30)
        print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
    "searches for folders that starts with `_`"
    if dir[0] == "_":
        # copyfile(dir, filetype="pdf")
        copyfile(dir, filetype="txt")


>>> _compiti18Compito Contabilità 1conti.txt
>>> _compiti18Compito Contabilità 1modula4.txt
>>> _compiti18Compito Contabilità 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files

Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
    for eachfile in os.listdir():
        mylist += eachfile + "
"
    file.write(mylist)

Example: txt with all the files of an hard drive

"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
    for root, dirs, files in os.walk("D:\"):
        for file in files:
            listafile.append(file)
            percorso.append(root + "\" + file)
            testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
    for file in listafile:
        testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
    for file in percorso:
        file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")

All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
    for r, d, f in os.walk("C:\"):
        for file in f:
            filewrite.write(f"{r + file}
")

How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

import os

def searchfiles(extension=".ttf", folder="H:\"):
    "Create a txt file with all the file of a type"
    with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png

(New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
    "insert all files in the listbox"
    for r, d, f in os.walk(folder):
        for file in f:
            if file.endswith(extension):
                lb.insert(0, r + "\" + file)

def open_file():
    os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()

Answer #8

This is the behaviour to adopt when the referenced object is deleted. It is not specific to Django; this is an SQL standard. Although Django has its own implementation on top of SQL. (1)

There are seven possible actions to take when such event occurs:

  • CASCADE: When the referenced object is deleted, also delete the objects that have references to it (when you remove a blog post for instance, you might want to delete comments as well). SQL equivalent: CASCADE.
  • PROTECT: Forbid the deletion of the referenced object. To delete it you will have to delete all objects that reference it manually. SQL equivalent: RESTRICT.
  • RESTRICT: (introduced in Django 3.1) Similar behavior as PROTECT that matches SQL"s RESTRICT more accurately. (See django documentation example)
  • SET_NULL: Set the reference to NULL (requires the field to be nullable). For instance, when you delete a User, you might want to keep the comments he posted on blog posts, but say it was posted by an anonymous (or deleted) user. SQL equivalent: SET NULL.
  • SET_DEFAULT: Set the default value. SQL equivalent: SET DEFAULT.
  • SET(...): Set a given value. This one is not part of the SQL standard and is entirely handled by Django.
  • DO_NOTHING: Probably a very bad idea since this would create integrity issues in your database (referencing an object that actually doesn"t exist). SQL equivalent: NO ACTION. (2)

Source: Django documentation

See also the documentation of PostgreSQL for instance.

In most cases, CASCADE is the expected behaviour, but for every ForeignKey, you should always ask yourself what is the expected behaviour in this situation. PROTECT and SET_NULL are often useful. Setting CASCADE where it should not, can potentially delete all of your database in cascade, by simply deleting a single user.


Additional note to clarify cascade direction

It"s funny to notice that the direction of the CASCADE action is not clear to many people. Actually, it"s funny to notice that only the CASCADE action is not clear. I understand the cascade behavior might be confusing, however you must think that it is the same direction as any other action. Thus, if you feel that CASCADE direction is not clear to you, it actually means that on_delete behavior is not clear to you.

In your database, a foreign key is basically represented by an integer field which value is the primary key of the foreign object. Let"s say you have an entry comment_A, which has a foreign key to an entry article_B. If you delete the entry comment_A, everything is fine. article_B used to live without comment_A and don"t bother if it"s deleted. However, if you delete article_B, then comment_A panics! It never lived without article_B and needs it, and it"s part of its attributes (article=article_B, but what is article_B???). This is where on_delete steps in, to determine how to resolve this integrity error, either by saying:

  • "No! Please! Don"t! I can"t live without you!" (which is said PROTECT or RESTRICT in Django/SQL)
  • "All right, if I"m not yours, then I"m nobody"s" (which is said SET_NULL)
  • "Good bye world, I can"t live without article_B" and commit suicide (this is the CASCADE behavior).
  • "It"s OK, I"ve got spare lover, and I"ll reference article_C from now" (SET_DEFAULT, or even SET(...)).
  • "I can"t face reality, and I"ll keep calling your name even if that"s the only thing left to me!" (DO_NOTHING)

I hope it makes cascade direction clearer. :)


Footnotes

(1) Django has its own implementation on top of SQL. And, as mentioned by @JoeMjr2 in the comments below, Django will not create the SQL constraints. If you want the constraints to be ensured by your database (for instance, if your database is used by another application, or if you hang in the database console from time to time), you might want to set the related constraints manually yourself. There is an open ticket to add support for database-level on delete constrains in Django.

(2) Actually, there is one case where DO_NOTHING can be useful: If you want to skip Django"s implementation and implement the constraint yourself at the database-level.

Answer #9

Label vs. Location

The main distinction between the two methods is:

  • loc gets rows (and/or columns) with particular labels.

  • iloc gets rows (and/or columns) at integer locations.

To demonstrate, consider a series s of characters with a non-monotonic integer index:

>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2]) 
49    a
48    b
47    c
0     d
1     e
2     f

>>> s.loc[0]    # value at index label 0
"d"

>>> s.iloc[0]   # value at index location 0
"a"

>>> s.loc[0:1]  # rows at index labels between 0 and 1 (inclusive)
0    d
1    e

>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49    a

Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:

<object> description s.loc[<object>] s.iloc[<object>]
0 single item Value at index label 0 (the string "d") Value at index location 0 (the string "a")
0:1 slice Two rows (labels 0 and 1) One row (first row at location 0)
1:47 slice with out-of-bounds end Zero rows (empty Series) Five rows (location 1 onwards)
1:47:-1 slice with negative step three rows (labels 1 back to 47) Zero rows (empty Series)
[2, 0] integer list Two rows with given labels Two rows with given locations
s > "e" Bool series (indicating which values have the property) One row (containing "f") NotImplementedError
(s>"e").values Bool array One row (containing "f") Same as loc
999 int object not in index KeyError IndexError (out of bounds)
-1 int object not in index KeyError Returns last value in s
lambda x: x.index[3] callable applied to series (here returning 3rd item in index) s.loc[s.index[3]] s.iloc[s.index[3]]

loc"s label-querying capabilities extend well-beyond integer indexes and it"s worth highlighting a couple of additional examples.

Here"s a Series where the index contains string objects:

>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a    49
b    48
c    47
d     0
e     1
f     2

Since loc is label-based, it can fetch the first value in the Series using s2.loc["a"]. It can also slice with non-integer objects:

>>> s2.loc["c":"e"]  # all rows lying between "c" and "e" (inclusive)
c    47
d     0
e     1

For DateTime indexes, we don"t need to pass the exact date/time to fetch by label. For example:

>>> s3 = pd.Series(list("abcde"), pd.date_range("now", periods=5, freq="M")) 
>>> s3
2021-01-31 16:41:31.879768    a
2021-02-28 16:41:31.879768    b
2021-03-31 16:41:31.879768    c
2021-04-30 16:41:31.879768    d
2021-05-31 16:41:31.879768    e

Then to fetch the row(s) for March/April 2021 we only need:

>>> s3.loc["2021-03":"2021-04"]
2021-03-31 17:04:30.742316    c
2021-04-30 17:04:30.742316    d

Rows and Columns

loc and iloc work the same way with DataFrames as they do with Series. It"s useful to note that both methods can address columns and rows together.

When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.

Consider the DataFrame defined below:

>>> import numpy as np 
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),  
                      index=list("abcde"), 
                      columns=["x","y","z", 8, 9])
>>> df
    x   y   z   8   9
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

Then for example:

>>> df.loc["c": , :"z"]  # rows "c" and onwards AND columns up to "z"
    x   y   z
c  10  11  12
d  15  16  17
e  20  21  22

>>> df.iloc[:, 3]        # all rows, but only the column at index location 3
a     3
b     8
c    13
d    18
e    23

Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of loc and iloc.

For example, consider the following DataFrame. How best to slice the rows up to and including "c" and take the first four columns?

>>> import numpy as np 
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),  
                      index=list("abcde"), 
                      columns=["x","y","z", 8, 9])
>>> df
    x   y   z   8   9
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

We can achieve this result using iloc and the help of another method:

>>> df.iloc[:df.index.get_loc("c") + 1, :4]
    x   y   z   8
a   0   1   2   3
b   5   6   7   8
c  10  11  12  13

get_loc() is an index method meaning "get the position of the label in this index". Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row "c" as well.

Answer #10

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(["col1","col2"]).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(["col1", "col2"]).size().reset_index(name="counts")


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let"s use .size() to get the row counts:

In [3]: df.groupby(["col1", "col2"]).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let"s use .size().reset_index(name="counts") to get the row counts:

In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(["col1", "col2"])
   ...: .agg({
   ...:     "col3": ["mean", "count"], 
   ...:     "col4": ["median", "min", "count"]
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(["col1", "col2"])
   ...: counts = gb.size().to_frame(name="counts")
   ...: (counts
   ...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
   ...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
   ...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["E", "F"],
   ...:         ["E", "F"],
   ...:         ["G", "H"] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
   ...: )
   ...: 
   ...: df[["col3", "col4", "col5", "col6"]] = 
   ...:     df[["col3", "col4", "col5", "col6"]].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

"Large data" workflows using pandas: StackOverflow Questions

Separation of business logic and data access in django

I am writing a project in Django and I see that 80% of the code is in the file models.py. This code is confusing and, after a certain time, I cease to understand what is really happening.

Here is what bothers me:

  1. I find it ugly that my model level (which was supposed to be responsible only for the work with data from a database) is also sending email, walking on API to other services, etc.
  2. Also, I find it unacceptable to place business logic in the view, because this way it becomes difficult to control. For example, in my application there are at least three ways to create new instances of User, but technically it should create them uniformly.
  3. I do not always notice when the methods and properties of my models become non-deterministic and when they develop side effects.

Here is a simple example. At first, the User model was like this:

class User(db.Models):

    def get_present_name(self):
        return self.name or "Anonymous"

    def activate(self):
        self.status = "activated"
        self.save()

Over time, it turned into this:

class User(db.Models):

    def get_present_name(self): 
        # property became non-deterministic in terms of database
        # data is taken from another service by api
        return remote_api.request_user_name(self.uid) or "Anonymous" 

    def activate(self):
        # method now has a side effect (send message to user)
        self.status = "activated"
        self.save()
        send_mail("Your account is activated!", "…", [self.email])

What I want is to separate entities in my code:

  1. Entities of my database, persistence level: What data does my application keep?
  2. Entities of my application, business logic level: What does my application do?

What are the good practices to implement such an approach that can be applied in Django?

Cosine Similarity between 2 Number Lists

I want to calculate the cosine similarity between two lists, let"s say for example list 1 which is dataSetI and list 2 which is dataSetII.

Let"s say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]

def cosine_similarity(list1, list2):
  # How to?
  pass

print(cosine_similarity(dataSetI, dataSetII))

Haversine Formula in Python (Bearing and Distance between two GPS points)

Problem

I would like to know how to get the distance and bearing between 2 GPS points. I have researched on the haversine formula. Someone told me that I could also find the bearing using the same data.

Edit

Everything is working fine but the bearing doesn"t quite work right yet. The bearing outputs negative but should be between 0 - 360 degrees. The set data should make the horizontal bearing 96.02166666666666 and is:

Start point: 53.32055555555556 , -1.7297222222222221   
Bearing:  96.02166666666666  
Distance: 2 km  
Destination point: 53.31861111111111, -1.6997222222222223  
Final bearing: 96.04555555555555

Here is my new code:

from math import *

Aaltitude = 2000
Oppsite  = 20000

lat1 = 53.32055555555556
lat2 = 53.31861111111111
lon1 = -1.7297222222222221
lon2 = -1.6997222222222223

lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
Base = 6371 * c


Bearing =atan2(cos(lat1)*sin(lat2)-sin(lat1)*cos(lat2)*cos(lon2-lon1), sin(lon2-lon1)*cos(lat2)) 

Bearing = degrees(Bearing)
print ""
print ""
print "--------------------"
print "Horizontal Distance:"
print Base
print "--------------------"
print "Bearing:"
print Bearing
print "--------------------"


Base2 = Base * 1000
distance = Base * 2 + Oppsite * 2 / 2
Caltitude = Oppsite - Aaltitude

a = Oppsite/Base
b = atan(a)
c = degrees(b)

distance = distance / 1000

print "The degree of vertical angle is:"
print c
print "--------------------"
print "The distance between the Balloon GPS and the Antenna GPS is:"
print distance
print "--------------------"

Answer #1

How to iterate over rows in a DataFrame in Pandas?

Answer: DON"T*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.


Further Reading

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Answer #2

Update: 04/2018: Note that Vincenty distance is deprecated since GeoPy version 1.13 - you should use geopy.distance.distance() instead!


The answers above are based on the Haversine formula, which assumes the earth is a sphere, which results in errors of up to about 0.5% (according to help(geopy.distance)). Vincenty distance uses more accurate ellipsoidal models such as WGS-84, and is implemented in geopy. For example,

import geopy.distance

coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)

print geopy.distance.vincenty(coords_1, coords_2).km

will print the distance of 279.352901604 kilometers using the default ellipsoid WGS-84. (You can also choose .miles or one of several other distance units).

Answer #3

You can use idxmax with axis=1 to find the column with the greatest value on each row:

>>> df.idxmax(axis=1)
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

To create the new column "Max", use df["Max"] = df.idxmax(axis=1).

To find the row index at which the maximum value occurs in each column, use df.idxmax() (or equivalently df.idxmax(axis=0)).

Answer #4

You can find it called Offset Aliases:

A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases.

Alias    Description
B        business day frequency
C        custom business day frequency
D        calendar day frequency
W        weekly frequency
M        month end frequency
SM       semi-month end frequency (15th and end of month)
BM       business month end frequency
CBM      custom business month end frequency
MS       month start frequency
SMS      semi-month start frequency (1st and 15th)
BMS      business month start frequency
CBMS     custom business month start frequency
Q        quarter end frequency
BQ       business quarter end frequency
QS       quarter start frequency
BQS      business quarter start frequency
A, Y     year end frequency
BA, BY   business year end frequency
AS, YS   year start frequency
BAS, BYS business year start frequency
BH       business hour frequency
H        hourly frequency
T, min   minutely frequency
S        secondly frequency
L, ms    milliseconds
U, us    microseconds
N        nanoseconds

Answer #5

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Hes|Shes|Its|Theys|Theirs|Ours|Wes|Buts|Howevers|Thats|Thiss|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("
";" ")
    text = re.sub(prefixes,"\1<prd>",text)
    text = re.sub(websites,"<prd>\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.";"Ph<prd>D<prd>")
    text = re.sub("s" + alphabets + "[.] ";" \1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\1<stop> \2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]";"\1<prd>\2<prd>\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]";"\1<prd>\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \1<stop> \2",text)
    text = re.sub(" "+suffixes+"[.]";" \1<prd>",text)
    text = re.sub(" " + alphabets + "[.]";" \1<prd>",text)
    if "”" in text: text = text.replace(".”";"”.")
    if """ in text: text = text.replace("."";"".")
    if "!" in text: text = text.replace("!"";""!")
    if "?" in text: text = text.replace("?"";""?")
    text = text.replace(".";".<stop>")
    text = text.replace("?";"?<stop>")
    text = text.replace("!";"!<stop>")
    text = text.replace("<prd>";".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

Answer #6

For people (like me) coming here via search engine and just looking for a solution which works out of the box, I recommend installing mpu. Install it via pip install mpu --user and use it like this to get the haversine distance:

import mpu

# Point one
lat1 = 52.2296756
lon1 = 21.0122287

# Point two
lat2 = 52.406374
lon2 = 16.9251681

# What you were looking for
dist = mpu.haversine_distance((lat1, lon1), (lat2, lon2))
print(dist)  # gives 278.45817507541943.

An alternative package is gpxpy.

If you don"t want dependencies, you can use:

import math


def distance(origin, destination):
    """
    Calculate the Haversine distance.

    Parameters
    ----------
    origin : tuple of float
        (lat, long)
    destination : tuple of float
        (lat, long)

    Returns
    -------
    distance_in_km : float

    Examples
    --------
    >>> origin = (48.1372, 11.5756)  # Munich
    >>> destination = (52.5186, 13.4083)  # Berlin
    >>> round(distance(origin, destination), 1)
    504.2
    """
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371  # km

    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon / 2) * math.sin(dlon / 2))
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    d = radius * c

    return d


if __name__ == "__main__":
    import doctest
    doctest.testmod()

The other alternative package is haversine

from haversine import haversine, Unit

lyon = (45.7597, 4.8422) # (lat, lon)
paris = (48.8567, 2.3508)

haversine(lyon, paris)
>> 392.2172595594006  # in kilometers

haversine(lyon, paris, unit=Unit.MILES)
>> 243.71201856934454  # in miles

# you can also use the string abbreviation for units:
haversine(lyon, paris, unit="mi")
>> 243.71201856934454  # in miles

haversine(lyon, paris, unit=Unit.NAUTICAL_MILES)
>> 211.78037755311516  # in nautical miles

They claim to have performance optimization for distances between all points in two vectors

from haversine import haversine_vector, Unit

lyon = (45.7597, 4.8422) # (lat, lon)
paris = (48.8567, 2.3508)
new_york = (40.7033962, -74.2351462)

haversine_vector([lyon, lyon], [paris, new_york], Unit.KILOMETERS)

>> array([ 392.21725956, 6163.43638211])

Answer #7

To add to Tim Peters" excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:

  1. You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
  2. PyPy"s C API works to some extent, but isn"t very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
  3. C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.

Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you"re trying to do. It will also be much faster than native Python, if used correctly.

Answer #8

It"s crucial to understand that the options in a Django model field definition serve (at least) two purposes: defining the database tables, and defining the default format and validation of model forms. (I say "default" because the values can always be overridden by providing a custom form.) Some options affect the database, some options affect forms, and some affect both.

When it comes to null and blank, other answers have already made clear that the former affects the database table definition and the latter affects model validation. I think the distinction can be made even clearer by looking at use cases for all four possible configurations:

  • null=False, blank=False: This is the default configuration and means that the value is required in all circumstances.

  • null=True, blank=True: This means that the field is optional in all circumstances. (As noted below, though, this is not the recommended way to make string-based fields optional.)

  • null=False, blank=True: This means that the form doesn"t require a value but the database does. There are a number of use cases for this:

    • The most common use is for optional string-based fields. As noted in the documentation, the Django idiom is to use the empty string to indicate a missing value. If NULL was also allowed you would end up with two different ways to indicate a missing value.

    • Another common situation is that you want to calculate one field automatically based on the value of another (in your save() method, say). You don"t want the user to provide the value in a form (hence blank=True), but you do want the database to enforce that a value is always provided (null=False).

    • Another use is when you want to indicate that a ManyToManyField is optional. Because this field is implemented as a separate table rather than a database column, null is meaningless. The value of blank will still affect forms, though, controlling whether or not validation will succeed when there are no relations.

  • null=True, blank=False: This means that the form requires a value but the database doesn"t. This may be the most infrequently used configuration, but there are some use cases for it:

    • It"s perfectly reasonable to require your users to always include a value even if it"s not actually required by your business logic. After all, forms are only one way of adding and editing data. You may have code that is generating data which doesn"t need the same stringent validation that you want to require of a human editor.

    • Another use case that I"ve seen is when you have a ForeignKey for which you don"t wish to allow cascade deletion. That is, in normal use the relation should always be there (blank=False), but if the thing it points to happens to be deleted, you don"t want this object to be deleted too. In that case you can use null=True and on_delete=models.SET_NULL to implement a simple kind of soft deletion.

Answer #9

tl;dr

The answer is NEVER! (unless you really know what you"re doing)

9/10 times the solution can be resolved with a proper understanding of encoding/decoding.

1/10 people have an incorrectly defined locale or environment and need to set:

PYTHONIOENCODING="UTF-8"  

in their environment to fix console printing problems.

What does it do?

sys.setdefaultencoding("utf-8") (struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:

str(u"u20AC")
unicode("€")
"{}".format(u"u20AC") 

In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:

UnicodeDecodeError: "ascii" codec can"t decode byte 0xe2 in position 0: ordinal not in range(128)

(My console is configured as UTF-8, so "€" = "xe2x82xac", hence exception on xe2)

or

UnicodeEncodeError: "ascii" codec can"t encode character u"u20ac" in position 0: ordinal not in range(128)

sys.setdefaultencoding("utf-8") will allow these to work for me, but won"t necessarily work for people who don"t use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into code

Console

sys.setdefaultencoding("utf-8") also has a side effect of appearing to fix sys.stdout.encoding, used when printing characters to the console. Python uses the user"s locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user"s locale is broken and just requires PYTHONIOENCODING to fix the console encoding.

Example:

$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
"ANSI_X3.4-1968"
>>> print u"u20AC"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: "ascii" codec can"t encode character u"u20ac" in position 0: ordinal not in range(128)
>>> exit()

$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
"UTF-8"
>>> print u"u20AC"
€

What"s so bad with sys.setdefaultencoding("utf-8")?

People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.

From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Changing what should be a constant will have dramatic effects on modules you depend upon. It"s better to just fix the data coming in and out of your code.

Example problem

While the setting of defaultencoding to UTF-8 isn"t the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

Answer #10

To understand what yield does, you must understand what generators are. And before you can understand generators, you must understand iterables.

Iterables

When you create a list, you can read its items one by one. Reading its items one by one is called iteration:

>>> mylist = [1, 2, 3]
>>> for i in mylist:
...    print(i)
1
2
3

mylist is an iterable. When you use a list comprehension, you create a list, and so an iterable:

>>> mylist = [x*x for x in range(3)]
>>> for i in mylist:
...    print(i)
0
1
4

Everything you can use "for... in..." on is an iterable; lists, strings, files...

These iterables are handy because you can read them as much as you wish, but you store all the values in memory and this is not always what you want when you have a lot of values.

Generators

Generators are iterators, a kind of iterable you can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly:

>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
...    print(i)
0
1
4

It is just the same except you used () instead of []. BUT, you cannot perform for i in mygenerator a second time since generators can only be used once: they calculate 0, then forget about it and calculate 1, and end calculating 4, one by one.

Yield

yield is a keyword that is used like return, except the function will return a generator.

>>> def create_generator():
...    mylist = range(3)
...    for i in mylist:
...        yield i*i
...
>>> mygenerator = create_generator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object create_generator at 0xb7555c34>
>>> for i in mygenerator:
...     print(i)
0
1
4

Here it"s a useless example, but it"s handy when you know your function will return a huge set of values that you will only need to read once.

To master yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky.

Then, your code will continue from where it left off each time for uses the generator.

Now the hard part:

The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it"ll return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting yield. That can be because the loop has come to an end, or because you no longer satisfy an "if/else".


Your code explained

Generator:

# Here you create the method of the node object that will return the generator
def _get_child_candidates(self, distance, min_dist, max_dist):

    # Here is the code that will be called each time you use the generator object:

    # If there is still a child of the node object on its left
    # AND if the distance is ok, return the next child
    if self._leftchild and distance - max_dist < self._median:
        yield self._leftchild

    # If there is still a child of the node object on its right
    # AND if the distance is ok, return the next child
    if self._rightchild and distance + max_dist >= self._median:
        yield self._rightchild

    # If the function arrives here, the generator will be considered empty
    # there is no more than two values: the left and the right children

Caller:

# Create an empty list and a list with the current object reference
result, candidates = list(), [self]

# Loop on candidates (they contain only one element at the beginning)
while candidates:

    # Get the last candidate and remove it from the list
    node = candidates.pop()

    # Get the distance between obj and the candidate
    distance = node._get_dist(obj)

    # If distance is ok, then you can fill the result
    if distance <= max_dist and distance >= min_dist:
        result.extend(node._values)

    # Add the children of the candidate in the candidate"s list
    # so the loop will keep running until it will have looked
    # at all the children of the children of the children, etc. of the candidate
    candidates.extend(node._get_child_candidates(distance, min_dist, max_dist))

return result

This code contains several smart parts:

  • The loop iterates on a list, but the list expands while the loop is being iterated. It"s a concise way to go through all these nested data even if it"s a bit dangerous since you can end up with an infinite loop. In this case, candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) exhaust all the values of the generator, but while keeps creating new generator objects which will produce different values from the previous ones since it"s not applied on the same node.

  • The extend() method is a list object method that expects an iterable and adds its values to the list.

Usually we pass a list to it:

>>> a = [1, 2]
>>> b = [3, 4]
>>> a.extend(b)
>>> print(a)
[1, 2, 3, 4]

But in your code, it gets a generator, which is good because:

  1. You don"t need to read the values twice.
  2. You may have a lot of children and you don"t want them all stored in memory.

And it works because Python does not care if the argument of a method is a list or not. Python expects iterables so it will work with strings, lists, tuples, and generators! This is called duck typing and is one of the reasons why Python is so cool. But this is another story, for another question...

You can stop here, or read a little bit to see an advanced use of a generator:

Controlling a generator exhaustion

>>> class Bank(): # Let"s create a bank, building ATMs
...    crisis = False
...    def create_atm(self):
...        while not self.crisis:
...            yield "$100"
>>> hsbc = Bank() # When everything"s ok the ATM gives you as much as you want
>>> corner_street_atm = hsbc.create_atm()
>>> print(corner_street_atm.next())
$100
>>> print(corner_street_atm.next())
$100
>>> print([corner_street_atm.next() for cash in range(5)])
["$100", "$100", "$100", "$100", "$100"]
>>> hsbc.crisis = True # Crisis is coming, no more money!
>>> print(corner_street_atm.next())
<type "exceptions.StopIteration">
>>> wall_street_atm = hsbc.create_atm() # It"s even true for new ATMs
>>> print(wall_street_atm.next())
<type "exceptions.StopIteration">
>>> hsbc.crisis = False # The trouble is, even post-crisis the ATM remains empty
>>> print(corner_street_atm.next())
<type "exceptions.StopIteration">
>>> brand_new_atm = hsbc.create_atm() # Build a new one to get back in business
>>> for cash in brand_new_atm:
...    print cash
$100
$100
$100
$100
$100
$100
$100
$100
$100
...

Note: For Python 3, useprint(corner_street_atm.__next__()) or print(next(corner_street_atm))

It can be useful for various things like controlling access to a resource.

Itertools, your best friend

The itertools module contains special functions to manipulate iterables. Ever wish to duplicate a generator? Chain two generators? Group values in a nested list with a one-liner? Map / Zip without creating another list?

Then just import itertools.

An example? Let"s see the possible orders of arrival for a four-horse race:

>>> horses = [1, 2, 3, 4]
>>> races = itertools.permutations(horses)
>>> print(races)
<itertools.permutations object at 0xb754f1dc>
>>> print(list(itertools.permutations(horses)))
[(1, 2, 3, 4),
 (1, 2, 4, 3),
 (1, 3, 2, 4),
 (1, 3, 4, 2),
 (1, 4, 2, 3),
 (1, 4, 3, 2),
 (2, 1, 3, 4),
 (2, 1, 4, 3),
 (2, 3, 1, 4),
 (2, 3, 4, 1),
 (2, 4, 1, 3),
 (2, 4, 3, 1),
 (3, 1, 2, 4),
 (3, 1, 4, 2),
 (3, 2, 1, 4),
 (3, 2, 4, 1),
 (3, 4, 1, 2),
 (3, 4, 2, 1),
 (4, 1, 2, 3),
 (4, 1, 3, 2),
 (4, 2, 1, 3),
 (4, 2, 3, 1),
 (4, 3, 1, 2),
 (4, 3, 2, 1)]

Understanding the inner mechanisms of iteration

Iteration is a process implying iterables (implementing the __iter__() method) and iterators (implementing the __next__() method). Iterables are any objects you can get an iterator from. Iterators are objects that let you iterate on iterables.

There is more about it in this article about how for loops work.

Tutorials