Split models.py into several files

__del__ | _files | split | StackOverflow

I"m trying to split the models.py of my app into several files:

My first guess was do this:

myproject/
    settings.py
    manage.py
    urls.py
    __init__.py
    app1/
        views.py
        __init__.py
        models/
            __init__.py
            model1.py
            model2.py
    app2/
        views.py
        __init__.py
        models/
            __init__.py
            model3.py
            model4.py

This doesn"t work, then i found this, but in this solution i still have a problem, when i run python manage.py sqlall app1 I got something like:

BEGIN;
CREATE TABLE "product_product" (
    "id" serial NOT NULL PRIMARY KEY,
    "store_id" integer NOT NULL
)
;
-- The following references should be added but depend on non-existent tables:
-- ALTER TABLE "product_product" ADD CONSTRAINT "store_id_refs_id_3e117eef" FOREIGN KEY     ("store_id") REFERENCES "store_store" ("id") DEFERRABLE INITIALLY DEFERRED;
CREATE INDEX "product_product_store_id" ON "product_product" ("store_id");
COMMIT;

I"m not pretty sure about this, but i"m worried aboout the part The following references should be added but depend on non-existent tables:

This is my model1.py file:

from django.db import models

class Store(models.Model):
    class Meta:
        app_label = "store"

This is my model3.py file:

from django.db import models

from store.models import Store

class Product(models.Model):
    store = models.ForeignKey(Store)
    class Meta:
        app_label = "product"

And apparently works but i got the comment in alter table and if I try this, same thing happens:

class Product(models.Model):
    store = models.ForeignKey("store.Store")
    class Meta:
        app_label = "product"

So, should I run the alter for references manually? this may bring me problems with south?

Answer rating: 164

For anyone on Django 1.9, it is now supported by the framework without defining the class meta data.

https://docs.djangoproject.com/en/1.9/topics/db/models/#organizing-models-in-a-package

NOTE: For Django 2, it"s still the same

The manage.py startapp command creates an application structure that includes a models.py file. If you have many models, organizing them in separate files may be useful.

To do so, create a models package. Remove models.py and create a myapp/models/ directory with an __init__.py file and the files to store your models. You must import the models in the __init__.py file.

So, in your case, for a structure like

app1/
    views.py
    __init__.py
    models/
        __init__.py
        model1.py
        model2.py
app2/
    views.py
    __init__.py
    models/
        __init__.py
        model3.py
        model4.py

You only need to do

#myproject/app1/models/__init__.py:
from .model1 import Model1
from .model2 import Model2

#myproject/app2/models/__init__.py:
from .model3 import Model3
from .model4 import Model4

A note against importing all the classes:

Explicitly importing each model rather than using from .models import * has the advantages of not cluttering the namespace, making code more readable, and keeping code analysis tools useful.





Split models.py into several files: StackOverflow Questions

How can I make a time delay in Python?

I would like to know how to put a time delay in a Python script.

How to delete a file or folder in Python?

How do I delete a file or folder in Python?

Difference between del, remove, and pop on lists

>>> a=[1,2,3]
>>> a.remove(2)
>>> a
[1, 3]
>>> a=[1,2,3]
>>> del a[1]
>>> a
[1, 3]
>>> a= [1,2,3]
>>> a.pop(1)
2
>>> a
[1, 3]
>>> 

Is there any difference between the above three methods to remove an element from a list?

Is there a simple way to delete a list element by value?

I want to remove a value from a list if it exists in the list (which it may not).

a = [1, 2, 3, 4]
b = a.index(6)

del a[b]
print(a)

The above case (in which it does not exist) shows the following error:

Traceback (most recent call last):
  File "D:zjm_codea.py", line 6, in <module>
    b = a.index(6)
ValueError: list.index(x): x not in list

So I have to do this:

a = [1, 2, 3, 4]

try:
    b = a.index(6)
    del a[b]
except:
    pass

print(a)

But is there not a simpler way to do this?

How do I remove/delete a folder that is not empty?

Question by Amara

I am getting an "access is denied" error when I attempt to delete a folder that is not empty. I used the following command in my attempt: os.remove("/folder_name").

What is the most effective way of removing/deleting a folder/directory that is not empty?

Split Strings into words with multiple word boundary delimiters

I think what I want to do is a fairly common task but I"ve found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

["hey", "you", "what", "are", "you", "doing", "here"]

But Python"s str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

Split string with multiple delimiters in Python

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ";" or ", " That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

("b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]" , "mesitylene [000108-67-8]", "polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]") 

How to save/restore a model after training?

After you train a model in Tensorflow:

  1. How do you save the trained model?
  2. How do you later restore this saved model?

How to delete the contents of a folder?

Question by Unkwntech

How can I delete the contents of a local folder in Python?

The current project is for Windows, but I would like to see *nix also.

How do I remove/delete a virtualenv?

I created an environment with the following command: virtualenv venv --distribute

I cannot remove it with the following command: rmvirtualenv venv - This is part of virtualenvwrapper as mentioned in answer below for virtualenvwrapper

I do an lson my current directory and I still see venv

The only way I can remove it seems to be: sudo rm -rf venv

Note that the environment is not active. I"m running Ubuntu 11.10. Any ideas? I"ve tried rebooting my system to no avail.

Answer #1

The Python 3 range() object doesn"t produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.

The object also implements the object.__contains__ hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.

From the range() object documentation:

The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed).

So at a minimum, your range() object would do:

class my_range:
    def __init__(self, start, stop=None, step=1, /):
        if stop is None:
            start, stop = 0, start
        self.start, self.stop, self.step = start, stop, step
        if step < 0:
            lo, hi, step = stop, start, -step
        else:
            lo, hi = start, stop
        self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1

    def __iter__(self):
        current = self.start
        if self.step < 0:
            while current > self.stop:
                yield current
                current += self.step
        else:
            while current < self.stop:
                yield current
                current += self.step

    def __len__(self):
        return self.length

    def __getitem__(self, i):
        if i < 0:
            i += self.length
        if 0 <= i < self.length:
            return self.start + i * self.step
        raise IndexError("my_range object index out of range")

    def __contains__(self, num):
        if self.step < 0:
            if not (self.stop < num <= self.start):
                return False
        else:
            if not (self.start <= num < self.stop):
                return False
        return (num - self.start) % self.step == 0

This is still missing several things that a real range() supports (such as the .index() or .count() methods, hashing, equality testing, or slicing), but should give you an idea.

I also simplified the __contains__ implementation to only focus on integer tests; if you give a real range() object a non-integer value (including subclasses of int), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.


* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it’s all executed in optimised C code and Python stores integer values in 30-bit chunks, you’d run out of memory before you saw any performance impact due to the size of the integers involved here.

Answer #2

You have four main options for converting types in pandas:

  1. to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)

  2. astype() - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

  4. convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.


1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

Basic usage

The input to to_numeric() is a Series or a single column of a DataFrame.

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that"s probably all you need.

Error handling

But what if some values can"t be converted to a numeric type?

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

Here"s an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:

>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

df.apply(pd.to_numeric, errors="ignore")

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

Downcasting

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

to_numeric() gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series s of integer type:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to "integer" uses the smallest possible integer that can hold the values:

>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8

Downcasting to "float" similarly picks a smaller than normal floating type:

>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32

2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

Basic usage

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and astype() will try and convert it for you:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaN or inf value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing errors="ignore". Your original object will be return untouched.

Be careful

astype() is powerful, but it will sometimes convert values "incorrectly". For example:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using pd.to_numeric(s, downcast="unsigned") instead could help prevent this error.


3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects(), you can change the type of column "a" to int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead.


4. convert_dtypes()

Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to Int64, a column of NumPy int32 values will become the pandas dtype Int32.

With our object DataFrame df, we get the following result:

>>> df.convert_dtypes().dtypes                                             
a     Int64
b    string
dtype: object

Since column "a" held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).

Column "b" contained string objects, so was changed to pandas" string dtype.

By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:

>>> df.convert_dtypes(infer_objects=False).dtypes                          
a    object
b    string
dtype: object

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran infer_dtype) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

Answer #3

Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool.

The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code:

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)

Which is the multithreaded version of:

results = []
for item in my_array:
    results.append(my_function(item))

Description

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence.

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end.

Enter image description here


Implementation

Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy.

multiprocessing.dummy is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O):

multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module.

import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = [
  "http://www.python.org",
  "http://www.python.org/about/",
  "http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html",
  "http://www.python.org/doc/",
  "http://www.python.org/download/",
  "http://www.python.org/getit/",
  "http://www.python.org/community/",
  "https://wiki.python.org/moin/",
]

# Make the Pool of workers
pool = ThreadPool(4)

# Open the URLs in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

# Close the pool and wait for the work to finish
pool.close()
pool.join()

And the timing results:

Single thread:   14.4 seconds
       4 Pool:   3.1 seconds
       8 Pool:   1.4 seconds
      13 Pool:   1.3 seconds

Passing multiple arguments (works like this only in Python 3.3 and later):

To pass multiple arrays:

results = pool.starmap(function, zip(list_a, list_b))

Or to pass a constant and an array:

results = pool.starmap(function, zip(itertools.repeat(constant), list_a))

If you are using an earlier version of Python, you can pass multiple arguments via this workaround).

(Thanks to user136036 for the helpful comment.)

Answer #4

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #5

This is the behaviour to adopt when the referenced object is deleted. It is not specific to Django; this is an SQL standard. Although Django has its own implementation on top of SQL. (1)

There are seven possible actions to take when such event occurs:

  • CASCADE: When the referenced object is deleted, also delete the objects that have references to it (when you remove a blog post for instance, you might want to delete comments as well). SQL equivalent: CASCADE.
  • PROTECT: Forbid the deletion of the referenced object. To delete it you will have to delete all objects that reference it manually. SQL equivalent: RESTRICT.
  • RESTRICT: (introduced in Django 3.1) Similar behavior as PROTECT that matches SQL"s RESTRICT more accurately. (See django documentation example)
  • SET_NULL: Set the reference to NULL (requires the field to be nullable). For instance, when you delete a User, you might want to keep the comments he posted on blog posts, but say it was posted by an anonymous (or deleted) user. SQL equivalent: SET NULL.
  • SET_DEFAULT: Set the default value. SQL equivalent: SET DEFAULT.
  • SET(...): Set a given value. This one is not part of the SQL standard and is entirely handled by Django.
  • DO_NOTHING: Probably a very bad idea since this would create integrity issues in your database (referencing an object that actually doesn"t exist). SQL equivalent: NO ACTION. (2)

Source: Django documentation

See also the documentation of PostgreSQL for instance.

In most cases, CASCADE is the expected behaviour, but for every ForeignKey, you should always ask yourself what is the expected behaviour in this situation. PROTECT and SET_NULL are often useful. Setting CASCADE where it should not, can potentially delete all of your database in cascade, by simply deleting a single user.


Additional note to clarify cascade direction

It"s funny to notice that the direction of the CASCADE action is not clear to many people. Actually, it"s funny to notice that only the CASCADE action is not clear. I understand the cascade behavior might be confusing, however you must think that it is the same direction as any other action. Thus, if you feel that CASCADE direction is not clear to you, it actually means that on_delete behavior is not clear to you.

In your database, a foreign key is basically represented by an integer field which value is the primary key of the foreign object. Let"s say you have an entry comment_A, which has a foreign key to an entry article_B. If you delete the entry comment_A, everything is fine. article_B used to live without comment_A and don"t bother if it"s deleted. However, if you delete article_B, then comment_A panics! It never lived without article_B and needs it, and it"s part of its attributes (article=article_B, but what is article_B???). This is where on_delete steps in, to determine how to resolve this integrity error, either by saying:

  • "No! Please! Don"t! I can"t live without you!" (which is said PROTECT or RESTRICT in Django/SQL)
  • "All right, if I"m not yours, then I"m nobody"s" (which is said SET_NULL)
  • "Good bye world, I can"t live without article_B" and commit suicide (this is the CASCADE behavior).
  • "It"s OK, I"ve got spare lover, and I"ll reference article_C from now" (SET_DEFAULT, or even SET(...)).
  • "I can"t face reality, and I"ll keep calling your name even if that"s the only thing left to me!" (DO_NOTHING)

I hope it makes cascade direction clearer. :)


Footnotes

(1) Django has its own implementation on top of SQL. And, as mentioned by @JoeMjr2 in the comments below, Django will not create the SQL constraints. If you want the constraints to be ensured by your database (for instance, if your database is used by another application, or if you hang in the database console from time to time), you might want to set the related constraints manually yourself. There is an open ticket to add support for database-level on delete constrains in Django.

(2) Actually, there is one case where DO_NOTHING can be useful: If you want to skip Django"s implementation and implement the constraint yourself at the database-level.

Answer #6

TL;DR: If you are using Python 3.10 or later, it just works. As of today (2019), in 3.7+ you must turn this feature on using a future statement (from __future__ import annotations). In Python 3.6 or below, use a string.

I guess you got this exception:

NameError: name "Position" is not defined

This is because Position must be defined before you can use it in an annotation unless you are using Python 3.10 or later.

Python 3.7+: from __future__ import annotations

Python 3.7 introduces PEP 563: postponed evaluation of annotations. A module that uses the future statement from __future__ import annotations will store annotations as strings automatically:

from __future__ import annotations

class Position:
    def __add__(self, other: Position) -> Position:
        ...

This is scheduled to become the default in Python 3.10. Since Python still is a dynamically typed language so no type checking is done at runtime, typing annotations should have no performance impact, right? Wrong! Before python 3.7 the typing module used to be one of the slowest python modules in core so if you import typing you will see up to 7 times increase in performance when you upgrade to 3.7.

Python <3.7: use a string

According to PEP 484, you should use a string instead of the class itself:

class Position:
    ...
    def __add__(self, other: "Position") -> "Position":
       ...

If you use the Django framework this may be familiar as Django models also use strings for forward references (foreign key definitions where the foreign model is self or is not declared yet). This should work with Pycharm and other tools.

Sources

The relevant parts of PEP 484 and PEP 563, to spare you the trip:

Forward references

When a type hint contains names that have not been defined yet, that definition may be expressed as a string literal, to be resolved later.

A situation where this occurs commonly is the definition of a container class, where the class being defined occurs in the signature of some of the methods. For example, the following code (the start of a simple binary tree implementation) does not work:

class Tree:
    def __init__(self, left: Tree, right: Tree):
        self.left = left
        self.right = right

To address this, we write:

class Tree:
    def __init__(self, left: "Tree", right: "Tree"):
        self.left = left
        self.right = right

The string literal should contain a valid Python expression (i.e., compile(lit, "", "eval") should be a valid code object) and it should evaluate without errors once the module has been fully loaded. The local and global namespace in which it is evaluated should be the same namespaces in which default arguments to the same function would be evaluated.

and PEP 563:

Implementation

In Python 3.10, function and variable annotations will no longer be evaluated at definition time. Instead, a string form will be preserved in the respective __annotations__ dictionary. Static type checkers will see no difference in behavior, whereas tools using annotations at runtime will have to perform postponed evaluation.

...

Enabling the future behavior in Python 3.7

The functionality described above can be enabled starting from Python 3.7 using the following special import:

from __future__ import annotations

Things that you may be tempted to do instead

A. Define a dummy Position

Before the class definition, place a dummy definition:

class Position(object):
    pass


class Position(object):
    ...

This will get rid of the NameError and may even look OK:

>>> Position.__add__.__annotations__
{"other": __main__.Position, "return": __main__.Position}

But is it?

>>> for k, v in Position.__add__.__annotations__.items():
...     print(k, "is Position:", v is Position)                                                                                                                                                                                                                  
return is Position: False
other is Position: False

B. Monkey-patch in order to add the annotations:

You may want to try some Python meta programming magic and write a decorator to monkey-patch the class definition in order to add annotations:

class Position:
    ...
    def __add__(self, other):
        return self.__class__(self.x + other.x, self.y + other.y)

The decorator should be responsible for the equivalent of this:

Position.__add__.__annotations__["return"] = Position
Position.__add__.__annotations__["other"] = Position

At least it seems right:

>>> for k, v in Position.__add__.__annotations__.items():
...     print(k, "is Position:", v is Position)                                                                                                                                                                                                                  
return is Position: True
other is Position: True

Probably too much trouble.

Answer #7

Is there any reason for a class declaration to inherit from object?

In Python 3, apart from compatibility between Python 2 and 3, no reason. In Python 2, many reasons.


Python 2.x story:

In Python 2.x (from 2.2 onwards) there"s two styles of classes depending on the presence or absence of object as a base-class:

  1. "classic" style classes: they don"t have object as a base class:

    >>> class ClassicSpam:      # no base class
    ...     pass
    >>> ClassicSpam.__bases__
    ()
    
  2. "new" style classes: they have, directly or indirectly (e.g inherit from a built-in type), object as a base class:

    >>> class NewSpam(object):           # directly inherit from object
    ...    pass
    >>> NewSpam.__bases__
    (<type "object">,)
    >>> class IntSpam(int):              # indirectly inherit from object...
    ...    pass
    >>> IntSpam.__bases__
    (<type "int">,) 
    >>> IntSpam.__bases__[0].__bases__   # ... because int inherits from object  
    (<type "object">,)
    

Without a doubt, when writing a class you"ll always want to go for new-style classes. The perks of doing so are numerous, to list some of them:

  • Support for descriptors. Specifically, the following constructs are made possible with descriptors:

    1. classmethod: A method that receives the class as an implicit argument instead of the instance.
    2. staticmethod: A method that does not receive the implicit argument self as a first argument.
    3. properties with property: Create functions for managing the getting, setting and deleting of an attribute.
    4. __slots__: Saves memory consumptions of a class and also results in faster attribute access. Of course, it does impose limitations.
  • The __new__ static method: lets you customize how new class instances are created.

  • Method resolution order (MRO): in what order the base classes of a class will be searched when trying to resolve which method to call.

  • Related to MRO, super calls. Also see, super() considered super.

If you don"t inherit from object, forget these. A more exhaustive description of the previous bullet points along with other perks of "new" style classes can be found here.

One of the downsides of new-style classes is that the class itself is more memory demanding. Unless you"re creating many class objects, though, I doubt this would be an issue and it"s a negative sinking in a sea of positives.


Python 3.x story:

In Python 3, things are simplified. Only new-style classes exist (referred to plainly as classes) so, the only difference in adding object is requiring you to type in 8 more characters. This:

class ClassicSpam:
    pass

is completely equivalent (apart from their name :-) to this:

class NewSpam(object):
     pass

and to this:

class Spam():
    pass

All have object in their __bases__.

>>> [object in cls.__bases__ for cls in {Spam, NewSpam, ClassicSpam}]
[True, True, True]

So, what should you do?

In Python 2: always inherit from object explicitly. Get the perks.

In Python 3: inherit from object if you are writing code that tries to be Python agnostic, that is, it needs to work both in Python 2 and in Python 3. Otherwise don"t, it really makes no difference since Python inserts it for you behind the scenes.

Answer #8

It must relate to the renaming and deprecation of cross_validation sub-module to model_selection. Try substituting cross_validation to model_selection

Answer #9

There are many ways to convert an instance to a dictionary, with varying degrees of corner case handling and closeness to the desired result.


1. instance.__dict__

instance.__dict__

which returns

{"_foreign_key_cache": <OtherModel: OtherModel object>,
 "_state": <django.db.models.base.ModelState at 0x7ff0993f6908>,
 "auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is by far the simplest, but is missing many_to_many, foreign_key is misnamed, and it has two unwanted extra things in it.


2. model_to_dict

from django.forms.models import model_to_dict
model_to_dict(instance)

which returns

{"foreign_key": 2,
 "id": 1,
 "many_to_many": [<OtherModel: OtherModel object>],
 "normal_value": 1}

This is the only one with many_to_many, but is missing the uneditable fields.


3. model_to_dict(..., fields=...)

from django.forms.models import model_to_dict
model_to_dict(instance, fields=[field.name for field in instance._meta.fields])

which returns

{"foreign_key": 2, "id": 1, "normal_value": 1}

This is strictly worse than the standard model_to_dict invocation.


4. query_set.values()

SomeModel.objects.filter(id=instance.id).values()[0]

which returns

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is the same output as instance.__dict__ but without the extra fields. foreign_key_id is still wrong and many_to_many is still missing.


5. Custom Function

The code for django"s model_to_dict had most of the answer. It explicitly removed non-editable fields, so removing that check and getting the ids of foreign keys for many to many fields results in the following code which behaves as desired:

from itertools import chain

def to_dict(instance):
    opts = instance._meta
    data = {}
    for f in chain(opts.concrete_fields, opts.private_fields):
        data[f.name] = f.value_from_object(instance)
    for f in opts.many_to_many:
        data[f.name] = [i.id for i in f.value_from_object(instance)]
    return data

While this is the most complicated option, calling to_dict(instance) gives us exactly the desired result:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

6. Use Serializers

Django Rest Framework"s ModelSerialzer allows you to build a serializer automatically from a model.

from rest_framework import serializers
class SomeModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = "__all__"

SomeModelSerializer(instance).data

returns

{"auto_now_add": "2018-12-20T21:34:29.494827Z",
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

This is almost as good as the custom function, but auto_now_add is a string instead of a datetime object.


Bonus Round: better model printing

If you want a django model that has a better python command-line display, have your models child-class the following:

from django.db import models
from itertools import chain

class PrintableModel(models.Model):
    def __repr__(self):
        return str(self.to_dict())

    def to_dict(instance):
        opts = instance._meta
        data = {}
        for f in chain(opts.concrete_fields, opts.private_fields):
            data[f.name] = f.value_from_object(instance)
        for f in opts.many_to_many:
            data[f.name] = [i.id for i in f.value_from_object(instance)]
        return data

    class Meta:
        abstract = True

So, for example, if we define our models as such:

class OtherModel(PrintableModel): pass

class SomeModel(PrintableModel):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

Calling SomeModel.objects.first() now gives output like this:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

Answer #10

The short answer, or TL;DR

Basically, eval is used to evaluate a single dynamically generated Python expression, and exec is used to execute dynamically generated Python code only for its side effects.

eval and exec have these two differences:

  1. eval accepts only a single expression, exec can take a code block that has Python statements: loops, try: except:, class and function/method definitions and so on.

    An expression in Python is whatever you can have as the value in a variable assignment:

    a_variable = (anything you can put within these parentheses is an expression)
    
  2. eval returns the value of the given expression, whereas exec ignores the return value from its code, and always returns None (in Python 2 it is a statement and cannot be used as an expression, so it really does not return anything).

In versions 1.0 - 2.7, exec was a statement, because CPython needed to produce a different kind of code object for functions that used exec for its side effects inside the function.

In Python 3, exec is a function; its use has no effect on the compiled bytecode of the function where it is used.


Thus basically:

>>> a = 5
>>> eval("37 + a")   # it is an expression
42
>>> exec("37 + a")   # it is an expression statement; value is ignored (None is returned)
>>> exec("a = 47")   # modify a global variable as a side effect
>>> a
47
>>> eval("a = 47")  # you cannot evaluate a statement
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    a = 47
      ^
SyntaxError: invalid syntax

The compile in "exec" mode compiles any number of statements into a bytecode that implicitly always returns None, whereas in "eval" mode it compiles a single expression into bytecode that returns the value of that expression.

>>> eval(compile("42", "<string>", "exec"))  # code returns None
>>> eval(compile("42", "<string>", "eval"))  # code returns 42
42
>>> exec(compile("42", "<string>", "eval"))  # code returns 42,
>>>                                          # but ignored by exec

In the "eval" mode (and thus with the eval function if a string is passed in), the compile raises an exception if the source code contains statements or anything else beyond a single expression:

>>> compile("for i in range(3): print(i)", "<string>", "eval")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print(i)
      ^
SyntaxError: invalid syntax

Actually the statement "eval accepts only a single expression" applies only when a string (which contains Python source code) is passed to eval. Then it is internally compiled to bytecode using compile(source, "<string>", "eval") This is where the difference really comes from.

If a code object (which contains Python bytecode) is passed to exec or eval, they behave identically, excepting for the fact that exec ignores the return value, still returning None always. So it is possible use eval to execute something that has statements, if you just compiled it into bytecode before instead of passing it as a string:

>>> eval(compile("if 1: print("Hello")", "<string>", "exec"))
Hello
>>>

works without problems, even though the compiled code contains statements. It still returns None, because that is the return value of the code object returned from compile.

In the "eval" mode (and thus with the eval function if a string is passed in), the compile raises an exception if the source code contains statements or anything else beyond a single expression:

>>> compile("for i in range(3): print(i)", "<string>". "eval")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print(i)
      ^
SyntaxError: invalid syntax

The longer answer, a.k.a the gory details

exec and eval

The exec function (which was a statement in Python 2) is used for executing a dynamically created statement or program:

>>> program = """
for i in range(3):
    print("Python is cool")
"""
>>> exec(program)
Python is cool
Python is cool
Python is cool
>>> 

The eval function does the same for a single expression, and returns the value of the expression:

>>> a = 2
>>> my_calculation = "42 * a"
>>> result = eval(my_calculation)
>>> result
84

exec and eval both accept the program/expression to be run either as a str, unicode or bytes object containing source code, or as a code object which contains Python bytecode.

If a str/unicode/bytes containing source code was passed to exec, it behaves equivalently to:

exec(compile(source, "<string>", "exec"))

and eval similarly behaves equivalent to:

eval(compile(source, "<string>", "eval"))

Since all expressions can be used as statements in Python (these are called the Expr nodes in the Python abstract grammar; the opposite is not true), you can always use exec if you do not need the return value. That is to say, you can use either eval("my_func(42)") or exec("my_func(42)"), the difference being that eval returns the value returned by my_func, and exec discards it:

>>> def my_func(arg):
...     print("Called with %d" % arg)
...     return arg * 2
... 
>>> exec("my_func(42)")
Called with 42
>>> eval("my_func(42)")
Called with 42
84
>>> 

Of the 2, only exec accepts source code that contains statements, like def, for, while, import, or class, the assignment statement (a.k.a a = 42), or entire programs:

>>> exec("for i in range(3): print(i)")
0
1
2
>>> eval("for i in range(3): print(i)")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print(i)
      ^
SyntaxError: invalid syntax

Both exec and eval accept 2 additional positional arguments - globals and locals - which are the global and local variable scopes that the code sees. These default to the globals() and locals() within the scope that called exec or eval, but any dictionary can be used for globals and any mapping for locals (including dict of course). These can be used not only to restrict/modify the variables that the code sees, but are often also used for capturing the variables that the executed code creates:

>>> g = dict()
>>> l = dict()
>>> exec("global a; a, b = 123, 42", g, l)
>>> g["a"]
123
>>> l
{"b": 42}

(If you display the value of the entire g, it would be much longer, because exec and eval add the built-ins module as __builtins__ to the globals automatically if it is missing).

In Python 2, the official syntax for the exec statement is actually exec code in globals, locals, as in

>>> exec "global a; a, b = 123, 42" in g, l

However the alternate syntax exec(code, globals, locals) has always been accepted too (see below).

compile

The compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1) built-in can be used to speed up repeated invocations of the same code with exec or eval by compiling the source into a code object beforehand. The mode parameter controls the kind of code fragment the compile function accepts and the kind of bytecode it produces. The choices are "eval", "exec" and "single":

  • "eval" mode expects a single expression, and will produce bytecode that when run will return the value of that expression:

    >>> dis.dis(compile("a + b", "<string>", "eval"))
      1           0 LOAD_NAME                0 (a)
                  3 LOAD_NAME                1 (b)
                  6 BINARY_ADD
                  7 RETURN_VALUE
    
  • "exec" accepts any kinds of python constructs from single expressions to whole modules of code, and executes them as if they were module top-level statements. The code object returns None:

    >>> dis.dis(compile("a + b", "<string>", "exec"))
      1           0 LOAD_NAME                0 (a)
                  3 LOAD_NAME                1 (b)
                  6 BINARY_ADD
                  7 POP_TOP                             <- discard result
                  8 LOAD_CONST               0 (None)   <- load None on stack
                 11 RETURN_VALUE                        <- return top of stack
    
  • "single" is a limited form of "exec" which accepts a source code containing a single statement (or multiple statements separated by ;) if the last statement is an expression statement, the resulting bytecode also prints the repr of the value of that expression to the standard output(!).

    An if-elif-else chain, a loop with else, and try with its except, else and finally blocks is considered a single statement.

    A source fragment containing 2 top-level statements is an error for the "single", except in Python 2 there is a bug that sometimes allows multiple toplevel statements in the code; only the first is compiled; the rest are ignored:

    In Python 2.7.8:

    >>> exec(compile("a = 5
    a = 6", "<string>", "single"))
    >>> a
    5
    

    And in Python 3.4.2:

    >>> exec(compile("a = 5
    a = 6", "<string>", "single"))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        a = 5
            ^
    SyntaxError: multiple statements found while compiling a single statement
    

    This is very useful for making interactive Python shells. However, the value of the expression is not returned, even if you eval the resulting code.

Thus greatest distinction of exec and eval actually comes from the compile function and its modes.


In addition to compiling source code to bytecode, compile supports compiling abstract syntax trees (parse trees of Python code) into code objects; and source code into abstract syntax trees (the ast.parse is written in Python and just calls compile(source, filename, mode, PyCF_ONLY_AST)); these are used for example for modifying source code on the fly, and also for dynamic code creation, as it is often easier to handle the code as a tree of nodes instead of lines of text in complex cases.


While eval only allows you to evaluate a string that contains a single expression, you can eval a whole statement, or even a whole module that has been compiled into bytecode; that is, with Python 2, print is a statement, and cannot be evalled directly:

>>> eval("for i in range(3): print("Python is cool")")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    for i in range(3): print("Python is cool")
      ^
SyntaxError: invalid syntax

compile it with "exec" mode into a code object and you can eval it; the eval function will return None.

>>> code = compile("for i in range(3): print("Python is cool")",
                   "foo.py", "exec")
>>> eval(code)
Python is cool
Python is cool
Python is cool

If one looks into eval and exec source code in CPython 3, this is very evident; they both call PyEval_EvalCode with same arguments, the only difference being that exec explicitly returns None.

Syntax differences of exec between Python 2 and Python 3

One of the major differences in Python 2 is that exec is a statement and eval is a built-in function (both are built-in functions in Python 3). It is a well-known fact that the official syntax of exec in Python 2 is exec code [in globals[, locals]].

Unlike majority of the Python 2-to-3 porting guides seem to suggest, the exec statement in CPython 2 can be also used with syntax that looks exactly like the exec function invocation in Python 3. The reason is that Python 0.9.9 had the exec(code, globals, locals) built-in function! And that built-in function was replaced with exec statement somewhere before Python 1.0 release.

Since it was desirable to not break backwards compatibility with Python 0.9.9, Guido van Rossum added a compatibility hack in 1993: if the code was a tuple of length 2 or 3, and globals and locals were not passed into the exec statement otherwise, the code would be interpreted as if the 2nd and 3rd element of the tuple were the globals and locals respectively. The compatibility hack was not mentioned even in Python 1.4 documentation (the earliest available version online); and thus was not known to many writers of the porting guides and tools, until it was documented again in November 2012:

The first expression may also be a tuple of length 2 or 3. In this case, the optional parts must be omitted. The form exec(expr, globals) is equivalent to exec expr in globals, while the form exec(expr, globals, locals) is equivalent to exec expr in globals, locals. The tuple form of exec provides compatibility with Python 3, where exec is a function rather than a statement.

Yes, in CPython 2.7 that it is handily referred to as being a forward-compatibility option (why confuse people over that there is a backward compatibility option at all), when it actually had been there for backward-compatibility for two decades.

Thus while exec is a statement in Python 1 and Python 2, and a built-in function in Python 3 and Python 0.9.9,

>>> exec("print(a)", globals(), {"a": 42})
42

has had identical behaviour in possibly every widely released Python version ever; and works in Jython 2.5.2, PyPy 2.3.1 (Python 2.7.6) and IronPython 2.6.1 too (kudos to them following the undocumented behaviour of CPython closely).

What you cannot do in Pythons 1.0 - 2.7 with its compatibility hack, is to store the return value of exec into a variable:

Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = exec("print(42)")
  File "<stdin>", line 1
    a = exec("print(42)")
           ^
SyntaxError: invalid syntax

(which wouldn"t be useful in Python 3 either, as exec always returns None), or pass a reference to exec:

>>> call_later(exec, "print(42)", delay=1000)
  File "<stdin>", line 1
    call_later(exec, "print(42)", delay=1000)
                  ^
SyntaxError: invalid syntax

Which a pattern that someone might actually have used, though unlikely;

Or use it in a list comprehension:

>>> [exec(i) for i in ["print(42)", "print(foo)"]
  File "<stdin>", line 1
    [exec(i) for i in ["print(42)", "print(foo)"]
        ^
SyntaxError: invalid syntax

which is abuse of list comprehensions (use a for loop instead!).

Split models.py into several files: StackOverflow Questions

How do I list all files of a directory?

How can I list all files of a directory in Python and add them to a list?

Importing files from different folder

I have the following folder structure.

application
├── app
│   └── folder
│       └── file.py
└── app2
    └── some_folder
        └── some_file.py

I want to import some functions from file.py in some_file.py.

I"ve tried

from application.app.folder.file import func_name

and some other various attempts but so far I couldn"t manage to import properly. How can I do this?

If Python is interpreted, what are .pyc files?

I"ve been given to understand that Python is an interpreted language...
However, when I look at my Python source code I see .pyc files, which Windows identifies as "Compiled Python Files".

Where do these come in?

Find all files in a directory with extension .txt in Python

How can I find all the files in a directory having the extension .txt in python?

How to import other Python files?

How do I import other files in Python?

  1. How exactly can I import a specific python file like import file.py?
  2. How can I import a folder instead of a specific file?
  3. I want to load a Python file dynamically at runtime, based on user input.
  4. I want to know how to load just one specific part from the file.

For example, in main.py I have:

from extra import * 

Although this gives me all the definitions in extra.py, when maybe all I want is a single definition:

def gap():
    print
    print

What do I add to the import statement to just get gap from extra.py?

How to use glob() to find files recursively?

This is what I have:

glob(os.path.join("src","*.c"))

but I want to search the subfolders of src. Something like this would work:

glob(os.path.join("src","*.c"))
glob(os.path.join("src","*","*.c"))
glob(os.path.join("src","*","*","*.c"))
glob(os.path.join("src","*","*","*","*.c"))

But this is obviously limited and clunky.

How can I open multiple files using "with open" in Python?

I want to change a couple of files at one time, iff I can write to all of them. I"m wondering if I somehow can combine the multiple open calls with the with statement:

try:
  with open("a", "w") as a and open("b", "w") as b:
    do_something()
except IOError as e:
  print "Operation failed: %s" % e.strerror

If that"s not possible, what would an elegant solution to this problem look like?

How can I iterate over files in a given directory?

I need to iterate through all .asm files inside a given directory and do some actions on them.

How can this be done in a efficient way?

How to serve static files in Flask

So this is embarrassing. I"ve got an application that I threw together in Flask and for now it is just serving up a single static HTML page with some links to CSS and JS. And I can"t find where in the documentation Flask describes returning static files. Yes, I could use render_template but I know the data is not templatized. I"d have thought send_file or url_for was the right thing, but I could not get those to work. In the meantime, I am opening the files, reading content, and rigging up a Response with appropriate mimetype:

import os.path

from flask import Flask, Response


app = Flask(__name__)
app.config.from_object(__name__)


def root_dir():  # pragma: no cover
    return os.path.abspath(os.path.dirname(__file__))


def get_file(filename):  # pragma: no cover
    try:
        src = os.path.join(root_dir(), filename)
        # Figure out how flask returns static files
        # Tried:
        # - render_template
        # - send_file
        # This should not be so non-obvious
        return open(src).read()
    except IOError as exc:
        return str(exc)


@app.route("/", methods=["GET"])
def metrics():  # pragma: no cover
    content = get_file("jenkins_analytics.html")
    return Response(content, mimetype="text/html")


@app.route("/", defaults={"path": ""})
@app.route("/<path:path>")
def get_resource(path):  # pragma: no cover
    mimetypes = {
        ".css": "text/css",
        ".html": "text/html",
        ".js": "application/javascript",
    }
    complete_path = os.path.join(root_dir(), path)
    ext = os.path.splitext(path)[1]
    mimetype = mimetypes.get(ext, "text/html")
    content = get_file(complete_path)
    return Response(content, mimetype=mimetype)


if __name__ == "__main__":  # pragma: no cover
    app.run(port=80)

Someone want to give a code sample or url for this? I know this is going to be dead simple.

Unzipping files in Python

I read through the zipfile documentation, but couldn"t understand how to unzip a file, only how to zip a file. How do I unzip all the contents of a zip file into the same directory?

Answer #1

Recommendation for beginners:

This is my personal recommendation for beginners: start by learning virtualenv and pip, tools which work with both Python 2 and 3 and in a variety of situations, and pick up other tools once you start needing them.

PyPI packages not in the standard library:

  • virtualenv is a very popular tool that creates isolated Python environments for Python libraries. If you"re not familiar with this tool, I highly recommend learning it, as it is a very useful tool, and I"ll be making comparisons to it for the rest of this answer.

It works by installing a bunch of files in a directory (eg: env/), and then modifying the PATH environment variable to prefix it with a custom bin directory (eg: env/bin/). An exact copy of the python or python3 binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It"s not part of Python"s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using pip.

  • pyenv is used to isolate Python versions. For example, you may want to test your code against Python 2.7, 3.6, 3.7 and 3.8, so you"ll need a way to switch between them. Once activated, it prefixes the PATH environment variable with ~/.pyenv/shims, where there are special files matching the Python commands (python, pip). These are not copies of the Python-shipped commands; they are special scripts that decide on the fly which version of Python to run based on the PYENV_VERSION environment variable, or the .python-version file, or the ~/.pyenv/version file. pyenv also makes the process of downloading and installing multiple Python versions easier, using the command pyenv install.

  • pyenv-virtualenv is a plugin for pyenv by the same author as pyenv, to allow you to use pyenv and virtualenv at the same time conveniently. However, if you"re using Python 3.3 or later, pyenv-virtualenv will try to run python -m venv if it is available, instead of virtualenv. You can use virtualenv and pyenv together without pyenv-virtualenv, if you don"t want the convenience features.

  • virtualenvwrapper is a set of extensions to virtualenv (see docs). It gives you commands like mkvirtualenv, lssitepackages, and especially workon for switching between different virtualenv directories. This tool is especially useful if you want multiple virtualenv directories.

  • pyenv-virtualenvwrapper is a plugin for pyenv by the same author as pyenv, to conveniently integrate virtualenvwrapper into pyenv.

  • pipenv aims to combine Pipfile, pip and virtualenv into one command on the command-line. The virtualenv directory typically gets placed in ~/.local/share/virtualenvs/XXX, with XXX being a hash of the path of the project directory. This is different from virtualenv, where the directory is typically in the current working directory. pipenv is meant to be used when developing Python applications (as opposed to libraries). There are alternatives to pipenv, such as poetry, which I won"t list here since this question is only about the packages that are similarly named.

Standard library:

  • pyvenv (not to be confused with pyenv in the previous section) is a script shipped with Python 3 but deprecated in Python 3.6 as it had problems (not to mention the confusing name). In Python 3.6+, the exact equivalent is python3 -m venv.

  • venv is a package shipped with Python 3, which you can run using python3 -m venv (although for some reason some distros separate it out into a separate distro package, such as python3-venv on Ubuntu/Debian). It serves the same purpose as virtualenv, but only has a subset of its features (see a comparison here). virtualenv continues to be more popular than venv, especially since the former supports both Python 2 and 3.

Answer #2

os.listdir() - list in the current directory

With listdir in os module you get the files and the folders in the current dir

 import os
 arr = os.listdir()
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Looking in a directory

arr = os.listdir("c:\files")

glob from glob

with glob you can specify a type of file to list like this

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

glob in a list comprehension

mylist = [f for f in glob.glob("*.txt")]

get the full path of only files in the current directory

import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles) 

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]

Getting the full path name with os.path.abspath

You get the full path in return

 import os
 files_path = [os.path.abspath(x) for x in os.listdir()]
 print(files_path)
 
 ["F:\documentiapplications.txt", "F:\documenticollections.txt"]

Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
    for file in f:
        if file.endswith(".docx"):
            print(os.path.join(r, file))

os.listdir(): get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

 import os
 arr = os.listdir(".")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

To go up in the directory tree

# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")

Get files: os.listdir() in a particular directory (Python 2 and 3)

 import os
 arr = os.listdir("F:\python")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Get files of a particular subdirectory with os.listdir()

import os

x = os.listdir("./content")

os.walk(".") - current directory

 import os
 arr = next(os.walk("."))[2]
 print(arr)
 
 >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]

next(os.walk(".")) and os.path.join("dir", "file")

 import os
 arr = []
 for d,r,f in next(os.walk("F:\_python")):
     for file in f:
         arr.append(os.path.join(r,file))

 for f in arr:
     print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt

next(os.walk("F:\") - get the full path - list comprehension

 [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
 
 >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]

os.walk - get full path - all files in sub dirs**

x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

>>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]

os.listdir() - get only txt files

 arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
 print(arr_txt)
 
 >>> ["work.txt", "3ebooks.txt"]

Using glob to get the full path of the files

If I should need the absolute path of the files:

from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
    print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt

Using os.path.isfile to avoid directories in the list

import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]

Using pathlib from Python 3.4

import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
    if p.is_file():
        print(p)
        flist.append(p)

 >>> error.PNG
 >>> exemaker.bat
 >>> guiprova.mp3
 >>> setup.py
 >>> speak_gui2.py
 >>> thumb.PNG

With list comprehension:

flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]

Alternatively, use pathlib.Path() instead of pathlib.Path(".")

Use glob method in pathlib.Path()

import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
    print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py

Get all and only files with os.walk

import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
    for f in t:
        y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]

Get only files with next and walk in a directory

 import os
 x = next(os.walk("F://python"))[2]
 print(x)
 
 >>> ["calculator.bat","calculator.py"]

Get only directories with next and walk in a directory

 import os
 next(os.walk("F://python"))[1] # for the current dir use (".")
 
 >>> ["python3","others"]

Get all the subdir names with walk

for r,d,f in os.walk("F:\_python"):
    for dirs in d:
        print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints

os.scandir() from Python 3.5 and greater

import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
    for entry in i:
        if entry.is_file():
            print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG

Examples:

Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

import os

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"

Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
    "Searches for pptx (or other - pptx is the default) files and copies them"
    for pack in os.walk(dir):
        for f in pack[2]:
            if f.endswith(filetype):
                fullpath = pack[0] + "\" + f
                print(fullpath)
                shutil.copy(fullpath, destination)
                counter += 1
    if counter > 0:
        print("-" * 30)
        print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
    "searches for folders that starts with `_`"
    if dir[0] == "_":
        # copyfile(dir, filetype="pdf")
        copyfile(dir, filetype="txt")


>>> _compiti18Compito Contabilità 1conti.txt
>>> _compiti18Compito Contabilità 1modula4.txt
>>> _compiti18Compito Contabilità 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files

Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
    for eachfile in os.listdir():
        mylist += eachfile + "
"
    file.write(mylist)

Example: txt with all the files of an hard drive

"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
    for root, dirs, files in os.walk("D:\"):
        for file in files:
            listafile.append(file)
            percorso.append(root + "\" + file)
            testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
    for file in listafile:
        testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
    for file in percorso:
        file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")

All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
    for r, d, f in os.walk("C:\"):
        for file in f:
            filewrite.write(f"{r + file}
")

How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

import os

def searchfiles(extension=".ttf", folder="H:\"):
    "Create a txt file with all the file of a type"
    with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png

(New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
    "insert all files in the listbox"
    for r, d, f in os.walk(folder):
        for file in f:
            if file.endswith(extension):
                lb.insert(0, r + "\" + file)

def open_file():
    os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()

Answer #3

-----> pip install gensim config --global http.sslVerify false

Just install any package with the "config --global http.sslVerify false" statement

You can ignore SSL errors by setting pypi.org and files.pythonhosted.org as trusted hosts.

$ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <package_name>

Note: Sometime during April 2018, the Python Package Index was migrated from pypi.python.org to pypi.org. This means "trusted-host" commands using the old domain no longer work.

Permanent Fix

Since the release of pip 10.0, you should be able to fix this permanently just by upgrading pip itself:

$ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org pip setuptools

Or by just reinstalling it to get the latest version:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

(… and then running get-pip.py with the relevant Python interpreter).

pip install <otherpackage> should just work after this. If not, then you will need to do more, as explained below.


You may want to add the trusted hosts and proxy to your config file.

pip.ini (Windows) or pip.conf (unix)

[global]
trusted-host = pypi.python.org
               pypi.org
               files.pythonhosted.org

Alternate Solutions (Less secure)

Most of the answers could pose a security issue.

Two of the workarounds that help in installing most of the python packages with ease would be:

  • Using easy_install: if you are really lazy and don"t want to waste much time, use easy_install <package_name>. Note that some packages won"t be found or will give small errors.
  • Using Wheel: download the Wheel of the python package and use the pip command pip install wheel_package_name.whl to install the package.

Answer #4

It helps to install a python package foo on your machine (can also be in virtualenv) so that you can import the package foo from other projects and also from [I]Python prompts.

It does the similar job of pip, easy_install etc.,


Using setup.py

Let"s start with some definitions:

Package - A folder/directory that contains __init__.py file.
Module - A valid python file with .py extension.
Distribution - How one package relates to other packages and modules.

Let"s say you want to install a package named foo. Then you do,

$ git clone https://github.com/user/foo  
$ cd foo
$ python setup.py install

Instead, if you don"t want to actually install it but still would like to use it. Then do,

$ python setup.py develop  

This command will create symlinks to the source directory within site-packages instead of copying things. Because of this, it is quite fast (particularly for large packages).


Creating setup.py

If you have your package tree like,

foo
├── foo
│   ├── data_struct.py
│   ├── __init__.py
│   └── internals.py
├── README
├── requirements.txt
└── setup.py

Then, you do the following in your setup.py script so that it can be installed on some machine:

from setuptools import setup

setup(
   name="foo",
   version="1.0",
   description="A useful module",
   author="Man Foo",
   author_email="[email protected]",
   packages=["foo"],  #same as name
   install_requires=["wheel", "bar", "greek"], #external packages as dependencies
)

Instead, if your package tree is more complex like the one below:

foo
├── foo
│   ├── data_struct.py
│   ├── __init__.py
│   └── internals.py
├── README
├── requirements.txt
├── scripts
│   ├── cool
│   └── skype
└── setup.py

Then, your setup.py in this case would be like:

from setuptools import setup

setup(
   name="foo",
   version="1.0",
   description="A useful module",
   author="Man Foo",
   author_email="[email protected]",
   packages=["foo"],  #same as name
   install_requires=["wheel", "bar", "greek"], #external packages as dependencies
   scripts=[
            "scripts/cool",
            "scripts/skype",
           ]
)

Add more stuff to (setup.py) & make it decent:

from setuptools import setup

with open("README", "r") as f:
    long_description = f.read()

setup(
   name="foo",
   version="1.0",
   description="A useful module",
   license="MIT",
   long_description=long_description,
   author="Man Foo",
   author_email="[email protected]",
   url="http://www.foopackage.com/",
   packages=["foo"],  #same as name
   install_requires=["wheel", "bar", "greek"], #external packages as dependencies
   scripts=[
            "scripts/cool",
            "scripts/skype",
           ]
)

The long_description is used in pypi.org as the README description of your package.


And finally, you"re now ready to upload your package to PyPi.org so that others can install your package using pip install yourpackage.

At this point there are two options.

  • publish in the temporary test.pypi.org server to make oneself familiarize with the procedure, and then publish it on the permanent pypi.org server for the public to use your package.
  • publish straight away on the permanent pypi.org server, if you are already familiar with the procedure and have your user credentials (e.g., username, password, package name)

Once your package name is registered in pypi.org, nobody can claim or use it. Python packaging suggests the twine package for uploading purposes (of your package to PyPi). Thus,

(1) the first step is to locally build the distributions using:

# prereq: wheel (pip install wheel)  
$ python setup.py sdist bdist_wheel   

(2) then using twine for uploading either to test.pypi.org or pypi.org:

$ twine upload --repository testpypi dist/*  
username: ***  
password: ***  

It will take few minutes for the package to appear on test.pypi.org. Once you"re satisfied with it, you can then upload your package to the real & permanent index of pypi.org simply with:

$ twine upload dist/*  

Optionally, you can also sign the files in your package with a GPG by:

$ twine upload dist/* --sign 

Bonus Reading:

Answer #5

tl;dr / quick fix

  • Don"t decode/encode willy nilly
  • Don"t assume your strings are UTF-8 encoded
  • Try to convert strings to Unicode strings as soon as possible in your code
  • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
  • Don"t be tempted to use quick reload hacks

Unicode Zen in Python 2.x - The Long Version

Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

UnicodeDecodeError: "ascii" codec can"t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u"my ünicôdé strįng"
>>> type(my_u)
<type "unicode">

Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

Gotchas

Conversion from str to Unicode can happen even when you don"t explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode("€")

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("€")

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "€"

# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "€"         

Examples

In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: Diagram of a string being converted to a Python Unicode string

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can"t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

Diagram of a string being converted to a Python Unicode string with the wrong encoding

The Unicode Sandwich

It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

Input / Decode

Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u"Zürich"

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Files

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read() 

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you"ve probably used the wrong encoding value.

CSV Files

The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

In the connection string add:

charset="utf8",
use_unicode=True

E.g.

>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
PostgreSQL

Add:

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

Manually

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you"ve probably got the wrong encoding.

The meat of the sandwich

Work with Unicodes as you would normal strs.

Output

stdout / printing

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

Files

Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

Database

The same configuration for reading will allow Unicodes to be written directly.

Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

Why you shouldn"t use sys.setdefaultencoding("utf8")

It"s a nasty hack (there"s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

Answer #6

You can"t.

One workaround is to create clone a new environment and then remove the original one.

First, remember to deactivate your current environment. You can do this with the commands:

  • deactivate on Windows or
  • source deactivate on macOS/Linux.

Then:

conda create --name new_name --clone old_name
conda remove --name old_name --all # or its alias: `conda env remove --name old_name`

Notice there are several drawbacks of this method:

  1. It redownloads packages (you can use --offline flag to disable it)
  2. Time consumed on copying environment"s files
  3. Temporary double disk usage

There is an open issue requesting this feature.

Answer #7

Explanation

From PEP 328

Relative imports use a module"s __name__ attribute to determine that module"s position in the package hierarchy. If the module"s name does not contain any package information (e.g. it is set to "__main__") then relative imports are resolved as if the module were a top level module, regardless of where the module is actually located on the file system.

At some point PEP 338 conflicted with PEP 328:

... relative imports rely on __name__ to determine the current module"s position in the package hierarchy. In a main module, the value of __name__ is always "__main__", so explicit relative imports will always fail (as they only work for a module inside a package)

and to address the issue, PEP 366 introduced the top level variable __package__:

By adding a new module level attribute, this PEP allows relative imports to work automatically if the module is executed using the -m switch. A small amount of boilerplate in the module itself will allow the relative imports to work when the file is executed by name. [...] When it [the attribute] is present, relative imports will be based on this attribute rather than the module __name__ attribute. [...] When the main module is specified by its filename, then the __package__ attribute will be set to None. [...] When the import system encounters an explicit relative import in a module without __package__ set (or with it set to None), it will calculate and store the correct value (__name__.rpartition(".")[0] for normal modules and __name__ for package initialisation modules)

(emphasis mine)

If the __name__ is "__main__", __name__.rpartition(".")[0] returns empty string. This is why there"s empty string literal in the error description:

SystemError: Parent module "" not loaded, cannot perform relative import

The relevant part of the CPython"s PyImport_ImportModuleLevelObject function:

if (PyDict_GetItem(interp->modules, package) == NULL) {
    PyErr_Format(PyExc_SystemError,
            "Parent module %R not loaded, cannot perform relative "
            "import", package);
    goto error;
}

CPython raises this exception if it was unable to find package (the name of the package) in interp->modules (accessible as sys.modules). Since sys.modules is "a dictionary that maps module names to modules which have already been loaded", it"s now clear that the parent module must be explicitly absolute-imported before performing relative import.

Note: The patch from the issue 18018 has added another if block, which will be executed before the code above:

if (PyUnicode_CompareWithASCIIString(package, "") == 0) {
    PyErr_SetString(PyExc_ImportError,
            "attempted relative import with no known parent package");
    goto error;
} /* else if (PyDict_GetItem(interp->modules, package) == NULL) {
    ...
*/

If package (same as above) is empty string, the error message will be

ImportError: attempted relative import with no known parent package

However, you will only see this in Python 3.6 or newer.

Solution #1: Run your script using -m

Consider a directory (which is a Python package):

.
├── package
│   ├── __init__.py
│   ├── module.py
│   └── standalone.py

All of the files in package begin with the same 2 lines of code:

from pathlib import Path
print("Running" if __name__ == "__main__" else "Importing", Path(__file__).resolve())

I"m including these two lines only to make the order of operations obvious. We can ignore them completely, since they don"t affect the execution.

__init__.py and module.py contain only those two lines (i.e., they are effectively empty).

standalone.py additionally attempts to import module.py via relative import:

from . import module  # explicit relative import

We"re well aware that /path/to/python/interpreter package/standalone.py will fail. However, we can run the module with the -m command line option that will "search sys.path for the named module and execute its contents as the __main__ module":

[email protected]:~$ python3 -i -m package.standalone
Importing /home/vaultah/package/__init__.py
Running /home/vaultah/package/standalone.py
Importing /home/vaultah/package/module.py
>>> __file__
"/home/vaultah/package/standalone.py"
>>> __package__
"package"
>>> # The __package__ has been correctly set and module.py has been imported.
... # What"s inside sys.modules?
... import sys
>>> sys.modules["__main__"]
<module "package.standalone" from "/home/vaultah/package/standalone.py">
>>> sys.modules["package.module"]
<module "package.module" from "/home/vaultah/package/module.py">
>>> sys.modules["package"]
<module "package" from "/home/vaultah/package/__init__.py">

-m does all the importing stuff for you and automatically sets __package__, but you can do that yourself in the

Solution #2: Set __package__ manually

Please treat it as a proof of concept rather than an actual solution. It isn"t well-suited for use in real-world code.

PEP 366 has a workaround to this problem, however, it"s incomplete, because setting __package__ alone is not enough. You"re going to need to import at least N preceding packages in the module hierarchy, where N is the number of parent directories (relative to the directory of the script) that will be searched for the module being imported.

Thus,

  1. Add the parent directory of the Nth predecessor of the current module to sys.path

  2. Remove the current file"s directory from sys.path

  3. Import the parent module of the current module using its fully-qualified name

  4. Set __package__ to the fully-qualified name from 2

  5. Perform the relative import

I"ll borrow files from the Solution #1 and add some more subpackages:

package
├── __init__.py
├── module.py
└── subpackage
    ├── __init__.py
    └── subsubpackage
        ├── __init__.py
        └── standalone.py

This time standalone.py will import module.py from the package package using the following relative import

from ... import module  # N = 3

We"ll need to precede that line with the boilerplate code, to make it work.

import sys
from pathlib import Path

if __name__ == "__main__" and __package__ is None:
    file = Path(__file__).resolve()
    parent, top = file.parent, file.parents[3]

    sys.path.append(str(top))
    try:
        sys.path.remove(str(parent))
    except ValueError: # Already removed
        pass

    import package.subpackage.subsubpackage
    __package__ = "package.subpackage.subsubpackage"

from ... import module # N = 3

It allows us to execute standalone.py by filename:

[email protected]:~$ python3 package/subpackage/subsubpackage/standalone.py
Running /home/vaultah/package/subpackage/subsubpackage/standalone.py
Importing /home/vaultah/package/__init__.py
Importing /home/vaultah/package/subpackage/__init__.py
Importing /home/vaultah/package/subpackage/subsubpackage/__init__.py
Importing /home/vaultah/package/module.py

A more general solution wrapped in a function can be found here. Example usage:

if __name__ == "__main__" and __package__ is None:
    import_parents(level=3) # N = 3

from ... import module
from ...module.submodule import thing

Solution #3: Use absolute imports and setuptools

The steps are -

  1. Replace explicit relative imports with equivalent absolute imports

  2. Install package to make it importable

For instance, the directory structure may be as follows

.
├── project
│   ├── package
│   │   ├── __init__.py
│   │   ├── module.py
│   │   └── standalone.py
│   └── setup.py

where setup.py is

from setuptools import setup, find_packages
setup(
    name = "your_package_name",
    packages = find_packages(),
)

The rest of the files were borrowed from the Solution #1.

Installation will allow you to import the package regardless of your working directory (assuming there"ll be no naming issues).

We can modify standalone.py to use this advantage (step 1):

from package import module  # absolute import

Change your working directory to project and run /path/to/python/interpreter setup.py install --user (--user installs the package in your site-packages directory) (step 2):

[email protected]:~$ cd project
[email protected]:~/project$ python3 setup.py install --user

Let"s verify that it"s now possible to run standalone.py as a script:

[email protected]:~/project$ python3 -i package/standalone.py
Running /home/vaultah/project/package/standalone.py
Importing /home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/__init__.py
Importing /home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py
>>> module
<module "package.module" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py">
>>> import sys
>>> sys.modules["package"]
<module "package" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/__init__.py">
>>> sys.modules["package.module"]
<module "package.module" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py">

Note: If you decide to go down this route, you"d be better off using virtual environments to install packages in isolation.

Solution #4: Use absolute imports and some boilerplate code

Frankly, the installation is not necessary - you could add some boilerplate code to your script to make absolute imports work.

I"m going to borrow files from Solution #1 and change standalone.py:

  1. Add the parent directory of package to sys.path before attempting to import anything from package using absolute imports:

    import sys
    from pathlib import Path # if you haven"t already done so
    file = Path(__file__).resolve()
    parent, root = file.parent, file.parents[1]
    sys.path.append(str(root))
    
    # Additionally remove the current file"s directory from sys.path
    try:
        sys.path.remove(str(parent))
    except ValueError: # Already removed
        pass
    
  2. Replace the relative import by the absolute import:

    from package import module  # absolute import
    

standalone.py runs without problems:

[email protected]:~$ python3 -i package/standalone.py
Running /home/vaultah/package/standalone.py
Importing /home/vaultah/package/__init__.py
Importing /home/vaultah/package/module.py
>>> module
<module "package.module" from "/home/vaultah/package/module.py">
>>> import sys
>>> sys.modules["package"]
<module "package" from "/home/vaultah/package/__init__.py">
>>> sys.modules["package.module"]
<module "package.module" from "/home/vaultah/package/module.py">

I feel that I should warn you: try not to do this, especially if your project has a complex structure.


As a side note, PEP 8 recommends the use of absolute imports, but states that in some scenarios explicit relative imports are acceptable:

Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages). [...] However, explicit relative imports are an acceptable alternative to absolute imports, especially when dealing with complex package layouts where using absolute imports would be unnecessarily verbose.

Answer #8

Is this the correct use of conftest.py?

Yes it is. Fixtures are a potential and common use of conftest.py. The fixtures that you will define will be shared among all tests in your test suite. However, defining fixtures in the root conftest.py might be useless and it would slow down testing if such fixtures are not used by all tests.

Does it have other uses?

Yes it does.

  • Fixtures: Define fixtures for static data used by tests. This data can be accessed by all tests in the suite unless specified otherwise. This could be data as well as helpers of modules which will be passed to all tests.

  • External plugin loading: conftest.py is used to import external plugins or modules. By defining the following global variable, pytest will load the module and make it available for its test. Plugins are generally files defined in your project or other modules which might be needed in your tests. You can also load a set of predefined plugins as explained here.

    pytest_plugins = "someapp.someplugin"

  • Hooks: You can specify hooks such as setup and teardown methods and much more to improve your tests. For a set of available hooks, read Hooks link. Example:

      def pytest_runtest_setup(item):
           """ called before ``pytest_runtest_call(item). """
           #do some stuff`
    
  • Test root path: This is a bit of a hidden feature. By defining conftest.py in your root path, you will have pytest recognizing your application modules without specifying PYTHONPATH. In the background, py.test modifies your sys.path by including all submodules which are found from the root path.

Can I have more than one conftest.py file?

Yes you can and it is strongly recommended if your test structure is somewhat complex. conftest.py files have directory scope. Therefore, creating targeted fixtures and helpers is good practice.

When would I want to do that? Examples will be appreciated.

Several cases could fit:

Creating a set of tools or hooks for a particular group of tests.

root/mod/conftest.py

def pytest_runtest_setup(item):
    print("I am mod")
    #do some stuff


test root/mod2/test.py will NOT produce "I am mod"

Loading a set of fixtures for some tests but not for others.

root/mod/conftest.py

@pytest.fixture()
def fixture():
    return "some stuff"

root/mod2/conftest.py

@pytest.fixture()
def fixture():
    return "some other stuff"

root/mod2/test.py

def test(fixture):
    print(fixture)

Will print "some other stuff".

Overriding hooks inherited from the root conftest.py.

root/mod/conftest.py

def pytest_runtest_setup(item):
    print("I am mod")
    #do some stuff

root/conftest.py

def pytest_runtest_setup(item):
    print("I am root")
    #do some stuff

By running any test inside root/mod, only "I am mod" is printed.

You can read more about conftest.py here.

EDIT:

What if I need plain-old helper functions to be called from a number of tests in different modules - will they be available to me if I put them in a conftest.py? Or should I simply put them in a helpers.py module and import and use it in my test modules?

You can use conftest.py to define your helpers. However, you should follow common practice. Helpers can be used as fixtures at least in pytest. For example in my tests I have a mock redis helper which I inject into my tests this way.

root/helper/redis/redis.py

@pytest.fixture
def mock_redis():
    return MockRedis()

root/tests/stuff/conftest.py

pytest_plugin="helper.redis.redis"

root/tests/stuff/test.py

def test(mock_redis):
    print(mock_redis.get("stuff"))

This will be a test module that you can freely import in your tests. NOTE that you could potentially name redis.py as conftest.py if your module redis contains more tests. However, that practice is discouraged because of ambiguity.

If you want to use conftest.py, you can simply put that helper in your root conftest.py and inject it when needed.

root/tests/conftest.py

@pytest.fixture
def mock_redis():
    return MockRedis()

root/tests/stuff/test.py

def test(mock_redis):
    print(mock_redis.get(stuff))

Another thing you can do is to write an installable plugin. In that case your helper can be written anywhere but it needs to define an entry point to be installed in your and other potential test frameworks. See this.

If you don"t want to use fixtures, you could of course define a simple helper and just use the plain old import wherever it is needed.

root/tests/helper/redis.py

class MockRedis():
    # stuff

root/tests/stuff/test.py

from helper.redis import MockRedis

def test():
    print(MockRedis().get(stuff))

However, here you might have problems with the path since the module is not in a child folder of the test. You should be able to overcome this (not tested) by adding an __init__.py to your helper

root/tests/helper/init.py

from .redis import MockRedis

Or simply adding the helper module to your PYTHONPATH.

Answer #9

I would suggest reading PEP 483 and PEP 484 and watching this presentation by Guido on type hinting.

In a nutshell: Type hinting is literally what the words mean. You hint the type of the object(s) you"re using.

Due to the dynamic nature of Python, inferring or checking the type of an object being used is especially hard. This fact makes it hard for developers to understand what exactly is going on in code they haven"t written and, most importantly, for type checking tools found in many IDEs (PyCharm and PyDev come to mind) that are limited due to the fact that they don"t have any indicator of what type the objects are. As a result they resort to trying to infer the type with (as mentioned in the presentation) around 50% success rate.


To take two important slides from the type hinting presentation:

Why type hints?

  1. Helps type checkers: By hinting at what type you want the object to be the type checker can easily detect if, for instance, you"re passing an object with a type that isn"t expected.
  2. Helps with documentation: A third person viewing your code will know what is expected where, ergo, how to use it without getting them TypeErrors.
  3. Helps IDEs develop more accurate and robust tools: Development Environments will be better suited at suggesting appropriate methods when know what type your object is. You have probably experienced this with some IDE at some point, hitting the . and having methods/attributes pop up which aren"t defined for an object.

Why use static type checkers?

  • Find bugs sooner: This is self-evident, I believe.
  • The larger your project the more you need it: Again, makes sense. Static languages offer a robustness and control that dynamic languages lack. The bigger and more complex your application becomes the more control and predictability (from a behavioral aspect) you require.
  • Large teams are already running static analysis: I"m guessing this verifies the first two points.

As a closing note for this small introduction: This is an optional feature and, from what I understand, it has been introduced in order to reap some of the benefits of static typing.

You generally do not need to worry about it and definitely don"t need to use it (especially in cases where you use Python as an auxiliary scripting language). It should be helpful when developing large projects as it offers much needed robustness, control and additional debugging capabilities.


Type hinting with mypy:

In order to make this answer more complete, I think a little demonstration would be suitable. I"ll be using mypy, the library which inspired Type Hints as they are presented in the PEP. This is mainly written for anybody bumping into this question and wondering where to begin.

Before I do that let me reiterate the following: PEP 484 doesn"t enforce anything; it is simply setting a direction for function annotations and proposing guidelines for how type checking can/should be performed. You can annotate your functions and hint as many things as you want; your scripts will still run regardless of the presence of annotations because Python itself doesn"t use them.

Anyways, as noted in the PEP, hinting types should generally take three forms:

  • Function annotations (PEP 3107).
  • Stub files for built-in/user modules.
  • Special # type: type comments that complement the first two forms. (See: What are variable annotations? for a Python 3.6 update for # type: type comments)

Additionally, you"ll want to use type hints in conjunction with the new typing module introduced in Py3.5. In it, many (additional) ABCs (abstract base classes) are defined along with helper functions and decorators for use in static checking. Most ABCs in collections.abc are included, but in a generic form in order to allow subscription (by defining a __getitem__() method).

For anyone interested in a more in-depth explanation of these, the mypy documentation is written very nicely and has a lot of code samples demonstrating/describing the functionality of their checker; it is definitely worth a read.

Function annotations and special comments:

First, it"s interesting to observe some of the behavior we can get when using special comments. Special # type: type comments can be added during variable assignments to indicate the type of an object if one cannot be directly inferred. Simple assignments are generally easily inferred but others, like lists (with regard to their contents), cannot.

Note: If we want to use any derivative of containers and need to specify the contents for that container we must use the generic types from the typing module. These support indexing.

# Generic List, supports indexing.
from typing import List

# In this case, the type is easily inferred as type: int.
i = 0

# Even though the type can be inferred as of type list
# there is no way to know the contents of this list.
# By using type: List[str] we indicate we want to use a list of strings.
a = []  # type: List[str]

# Appending an int to our list
# is statically not correct.
a.append(i)

# Appending a string is fine.
a.append("i")

print(a)  # [0, "i"]

If we add these commands to a file and execute them with our interpreter, everything works just fine and print(a) just prints the contents of list a. The # type comments have been discarded, treated as plain comments which have no additional semantic meaning.

By running this with mypy, on the other hand, we get the following response:

(Python3)[email protected]: mypy typeHintsCode.py
typesInline.py:14: error: Argument 1 to "append" of "list" has incompatible type "int"; expected "str"

Indicating that a list of str objects cannot contain an int, which, statically speaking, is sound. This can be fixed by either abiding to the type of a and only appending str objects or by changing the type of the contents of a to indicate that any value is acceptable (Intuitively performed with List[Any] after Any has been imported from typing).

Function annotations are added in the form param_name : type after each parameter in your function signature and a return type is specified using the -> type notation before the ending function colon; all annotations are stored in the __annotations__ attribute for that function in a handy dictionary form. Using a trivial example (which doesn"t require extra types from the typing module):

def annotated(x: int, y: str) -> bool:
    return x < y

The annotated.__annotations__ attribute now has the following values:

{"y": <class "str">, "return": <class "bool">, "x": <class "int">}

If we"re a complete newbie, or we are familiar with Python 2.7 concepts and are consequently unaware of the TypeError lurking in the comparison of annotated, we can perform another static check, catch the error and save us some trouble:

(Python3)[email protected]: mypy typeHintsCode.py
typeFunction.py: note: In function "annotated":
typeFunction.py:2: error: Unsupported operand types for > ("str" and "int")

Among other things, calling the function with invalid arguments will also get caught:

annotated(20, 20)

# mypy complains:
typeHintsCode.py:4: error: Argument 2 to "annotated" has incompatible type "int"; expected "str"

These can be extended to basically any use case and the errors caught extend further than basic calls and operations. The types you can check for are really flexible and I have merely given a small sneak peak of its potential. A look in the typing module, the PEPs or the mypy documentation will give you a more comprehensive idea of the capabilities offered.

Stub files:

Stub files can be used in two different non mutually exclusive cases:

  • You need to type check a module for which you do not want to directly alter the function signatures
  • You want to write modules and have type-checking but additionally want to separate annotations from content.

What stub files (with an extension of .pyi) are is an annotated interface of the module you are making/want to use. They contain the signatures of the functions you want to type-check with the body of the functions discarded. To get a feel of this, given a set of three random functions in a module named randfunc.py:

def message(s):
    print(s)

def alterContents(myIterable):
    return [i for i in myIterable if i % 2 == 0]

def combine(messageFunc, itFunc):
    messageFunc("Printing the Iterable")
    a = alterContents(range(1, 20))
    return set(a)

We can create a stub file randfunc.pyi, in which we can place some restrictions if we wish to do so. The downside is that somebody viewing the source without the stub won"t really get that annotation assistance when trying to understand what is supposed to be passed where.

Anyway, the structure of a stub file is pretty simplistic: Add all function definitions with empty bodies (pass filled) and supply the annotations based on your requirements. Here, let"s assume we only want to work with int types for our Containers.

# Stub for randfucn.py
from typing import Iterable, List, Set, Callable

def message(s: str) -> None: pass

def alterContents(myIterable: Iterable[int])-> List[int]: pass

def combine(
    messageFunc: Callable[[str], Any],
    itFunc: Callable[[Iterable[int]], List[int]]
)-> Set[int]: pass

The combine function gives an indication of why you might want to use annotations in a different file, they some times clutter up the code and reduce readability (big no-no for Python). You could of course use type aliases but that sometime confuses more than it helps (so use them wisely).


This should get you familiarized with the basic concepts of type hints in Python. Even though the type checker used has been mypy you should gradually start to see more of them pop-up, some internally in IDEs (PyCharm,) and others as standard Python modules.

I"ll try and add additional checkers/related packages in the following list when and if I find them (or if suggested).

Checkers I know of:

  • Mypy: as described here.
  • PyType: By Google, uses different notation from what I gather, probably worth a look.

Related Packages/Projects:

  • typeshed: Official Python repository housing an assortment of stub files for the standard library.

The typeshed project is actually one of the best places you can look to see how type hinting might be used in a project of your own. Let"s take as an example the __init__ dunders of the Counter class in the corresponding .pyi file:

class Counter(Dict[_T, int], Generic[_T]):
        @overload
        def __init__(self) -> None: ...
        @overload
        def __init__(self, Mapping: Mapping[_T, int]) -> None: ...
        @overload
        def __init__(self, iterable: Iterable[_T]) -> None: ...

Where _T = TypeVar("_T") is used to define generic classes. For the Counter class we can see that it can either take no arguments in its initializer, get a single Mapping from any type to an int or take an Iterable of any type.


Notice: One thing I forgot to mention was that the typing module has been introduced on a provisional basis. From PEP 411:

A provisional package may have its API modified prior to "graduating" into a "stable" state. On one hand, this state provides the package with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the the stability of the package"s API, which may change for the next release. While it is considered an unlikely outcome, such packages may even be removed from the standard library without a deprecation period if the concerns regarding their API or maintenance prove well-founded.

So take things here with a pinch of salt; I"m doubtful it will be removed or altered in significant ways, but one can never know.


** Another topic altogether, but valid in the scope of type-hints: PEP 526: Syntax for Variable Annotations is an effort to replace # type comments by introducing new syntax which allows users to annotate the type of variables in simple varname: type statements.

See What are variable annotations?, as previously mentioned, for a small introduction to these.

Answer #10

Whatever is assigned to the files variable is incorrect. Use the following code.

import glob
import os

list_of_files = glob.glob("/path/to/folder/*") # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)

Split models.py into several files: StackOverflow Questions

How do you split a list into evenly sized chunks?

Question by jespern

I have a list of arbitrary length, and I need to split it up into equal size chunks and operate on it. There are some obvious ways to do this, like keeping a counter and two lists, and when the second list fills up, add it to the first list and empty the second list for the next round of data, but this is potentially extremely expensive.

I was wondering if anyone had a good solution to this for lists of any length, e.g. using generators.

I was looking for something useful in itertools but I couldn"t find anything obviously useful. Might"ve missed it, though.

Related question: What is the most “pythonic” way to iterate over a list in chunks?

Split Strings into words with multiple word boundary delimiters

I think what I want to do is a fairly common task but I"ve found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

["hey", "you", "what", "are", "you", "doing", "here"]

But Python"s str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

Split string with multiple delimiters in Python

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ";" or ", " That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

("b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]" , "mesitylene [000108-67-8]", "polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]") 

How to split a string into a list?

I want my Python function to split a sentence (input) and store each word in a list. My current code splits the sentence, but does not store the words as a list. How do I do that?

def split_line(text):

    # split the text
    words = text.split()

    # for each word in the line:
    for word in words:

        # print the word
        print(words)

Split string on whitespace in Python

I"m looking for the Python equivalent of

String str = "many   fancy word 
hello    	hi";
String whiteSpaceRegex = "\s";
String[] words = str.split(whiteSpaceRegex);

["many", "fancy", "word", "hello", "hi"]

How to split a string into a list of characters in Python?

I"ve tried to look around the web for answers to splitting a string into a list of characters but I can"t seem to find a simple method.

str.split(//) does not seem to work like Ruby does. Is there a simple way of doing this without looping?

Split string every nth character?

Is it possible to split a string every nth character?

For example, suppose I have a string containing the following:

"1234567890"

How can I get it to look like this:

["12","34","56","78","90"]

Split by comma and strip whitespace in Python

I have some python code that splits on comma, but doesn"t strip the whitespace:

>>> string = "blah, lots  ,  of ,  spaces, here "
>>> mylist = string.split(",")
>>> print mylist
["blah", " lots  ", "  of ", "  spaces", " here "]

I would rather end up with whitespace removed like this:

["blah", "lots", "of", "spaces", "here"]

I am aware that I could loop through the list and strip() each item but, as this is Python, I"m guessing there"s a quicker, easier and more elegant way of doing it.

Splitting on first occurrence

What would be the best way to split a string on the first occurrence of a delimiter?

For example:

"123mango abcd mango kiwi peach"

splitting on the first mango to get:

"abcd mango kiwi peach"

Split a list based on a condition?

What"s the best way, both aesthetically and from a performance perspective, to split a list of items into multiple lists based on a conditional? The equivalent of:

good = [x for x in mylist if x in goodvals]
bad  = [x for x in mylist if x not in goodvals]

is there a more elegant way to do this?

Update: here"s the actual use case, to better explain what I"m trying to do:

# files looks like: [ ("file1.jpg", 33L, ".jpg"), ("file2.avi", 999L, ".avi"), ... ]
IMAGE_TYPES = (".jpg",".jpeg",".gif",".bmp",".png")
images = [f for f in files if f[2].lower() in IMAGE_TYPES]
anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]

Answer #1

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #2

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(["col1","col2"]).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(["col1", "col2"]).size().reset_index(name="counts")


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let"s use .size() to get the row counts:

In [3]: df.groupby(["col1", "col2"]).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let"s use .size().reset_index(name="counts") to get the row counts:

In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(["col1", "col2"])
   ...: .agg({
   ...:     "col3": ["mean", "count"], 
   ...:     "col4": ["median", "min", "count"]
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(["col1", "col2"])
   ...: counts = gb.size().to_frame(name="counts")
   ...: (counts
   ...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
   ...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
   ...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["E", "F"],
   ...:         ["E", "F"],
   ...:         ["G", "H"] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
   ...: )
   ...: 
   ...: df[["col3", "col4", "col5", "col6"]] = 
   ...:     df[["col3", "col4", "col5", "col6"]].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

Answer #3

TL;DR version:

For the simple case of:

  • I have a text column with a delimiter and I want two columns

The simplest solution is:

df[["A", "B"]] = df["AB"].str.split(" ", 1, expand=True)

You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

In detail:

Andy Hayden"s solution is most excellent in demonstrating the power of the str.extract() method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>> import pandas as pd
>>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2"]})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df["AB_split"] = df["AB"].str.split("-")
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

1: If you"re unsure what the first two parameters of .str.split() do, I recommend the docs for the plain Python version of the method.

But how do you go from:

  • a column containing two-element lists

to:

  • two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the .str attribute of a column.

It"s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an "indexing" interface for getting each element of a string by its index:

>>> df["AB"].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df["AB"].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .str doesn"t really care if each element it"s indexing is actually a string, as long as it can be indexed, so:

>>> df["AB"].str.split("-", 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df["AB"].str.split("-", 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it"s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df["A"], df["B"] = df["AB"].str.split("-", 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

>>> df["AB"].str.split("-", 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

>>> df = df[["AB"]]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df["AB"].str.split("-", 1, expand=True).rename(columns={0:"A", 1:"B"}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn"t deal well with splits of different lengths:

>>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2", "A3-B3-C3"]})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df["A"], df["B"], df["C"] = df["AB"].str.split("-")
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>> 

But expand=True handles it nicely by placing None in the columns for which there aren"t enough "splits":

>>> df.join(
...     df["AB"].str.split("-", expand=True).rename(
...         columns={0:"A", 1:"B", 2:"C"}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

Answer #4

There are several ways to select rows from a Pandas dataframe:

  1. Boolean indexing (df[df["col"] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column "A" == "foo"

(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)


Setup

The first thing we"ll need is to identify a condition that will act as our criterion for selecting rows. We"ll start with the OP"s case column_name == some_value, and include some other common use cases.

Borrowing from @unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({"A": "foo bar foo bar foo bar foo foo".split(),
                   "B": "one one two three two two one three".split(),
                   "C": np.arange(8), "D": np.arange(8) * 2})

1. Boolean indexing

... Boolean indexing requires finding the true value of each row"s "A" column being equal to "foo", then using those truth values to identify which rows to keep. Typically, we"d name this series, an array of truth values, mask. We"ll do so here as well.

mask = df["A"] == "foo"

We can then use this mask to slice or index the data frame

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn"t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


2. Positional indexing

Positional indexing (df.iloc[...]) has its use cases, but this isn"t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

mask = df["A"] == "foo"
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

df.set_index("A", append=True, drop=False).xs("foo", level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

df.query("A == "foo"")

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask.

mask alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd.Series

mask = df["A"].values == "foo"

I"ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

%timeit mask = df["A"].values == "foo"
%timeit mask = df["A"] == "foo"

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

Next, we"ll look at the timing for slicing with one mask versus the other.

mask = df["A"].values == "foo"
%timeit df[mask]
mask = df["A"] == "foo"
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren"t as pronounced. We"ll see if this holds up over more robust testing.


mask alternative 2 We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

Instead of df[mask] we will do this

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

%%timeit
mask = d1["A"].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

%%timeit
mask = d1["A"].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.


mask alternative 3

@unutbu also shows us how to use pd.Series.isin to account for each element of df["A"] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely "foo". But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

mask = df["A"].isin(["foo"])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We"ll use np.in1d

mask = np.in1d(df["A"].values, ["foo"])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing

I"ll include other concepts mentioned in other posts as well for reference.

Code Below

Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You"ll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.

res.T.plot(loglog=True)

Enter image description here

Functions

def mask_standard(df):
    mask = df["A"] == "foo"
    return df[mask]

def mask_standard_loc(df):
    mask = df["A"] == "foo"
    return df.loc[mask]

def mask_with_values(df):
    mask = df["A"].values == "foo"
    return df[mask]

def mask_with_values_loc(df):
    mask = df["A"].values == "foo"
    return df.loc[mask]

def query(df):
    return df.query("A == "foo"")

def xs_label(df):
    return df.set_index("A", append=True, drop=False).xs("foo", level=-1)

def mask_with_isin(df):
    mask = df["A"].isin(["foo"])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df["A"].values, ["foo"])
    return df[mask]

Testing

res = pd.DataFrame(
    index=[
        "mask_standard", "mask_standard_loc", "mask_with_values", "mask_with_values_loc",
        "query", "xs_label", "mask_with_isin", "mask_with_in1d"
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = "{}(d)".format(i)
        setp = "from __main__ import d, {}".format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing

Looking at the special case when we have a single non-object dtype for the entire data frame.

Code Below

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn"t worth it past a few hundred rows.

spec.T.plot(loglog=True)

Enter image description here

Functions

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

def mask_with_values(df):
    mask = df["A"].values == "foo"
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df["A"].values, ["foo"])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df["A"].values, ["foo"])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=["mask_with_values", "mask_with_in1d", "reconstruct"],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = "{}(d)".format(i)
        setp = "from __main__ import d, {}".format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

Answer #5

It"s much easier if you use Response.raw and shutil.copyfileobj():

import requests
import shutil

def download_file(url):
    local_filename = url.split("/")[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, "wb") as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

Answer #6

Explain all in Python?

I keep seeing the variable __all__ set in different __init__.py files.

What does this do?

What does __all__ do?

It declares the semantically "public" names from a module. If there is a name in __all__, users are expected to use it, and they can have the expectation that it will not change.

It also will have programmatic effects:

import *

__all__ in a module, e.g. module.py:

__all__ = ["foo", "Bar"]

means that when you import * from the module, only those names in the __all__ are imported:

from module import *               # imports foo and Bar

Documentation tools

Documentation and code autocompletion tools may (in fact, should) also inspect the __all__ to determine what names to show as available from a module.

__init__.py makes a directory a Python package

From the docs:

The __init__.py files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path.

In the simplest case, __init__.py can just be an empty file, but it can also execute initialization code for the package or set the __all__ variable.

So the __init__.py can declare the __all__ for a package.

Managing an API:

A package is typically made up of modules that may import one another, but that are necessarily tied together with an __init__.py file. That file is what makes the directory an actual Python package. For example, say you have the following files in a package:

package
├── __init__.py
├── module_1.py
└── module_2.py

Let"s create these files with Python so you can follow along - you could paste the following into a Python 3 shell:

from pathlib import Path

package = Path("package")
package.mkdir()

(package / "__init__.py").write_text("""
from .module_1 import *
from .module_2 import *
""")

package_module_1 = package / "module_1.py"
package_module_1.write_text("""
__all__ = ["foo"]
imp_detail1 = imp_detail2 = imp_detail3 = None
def foo(): pass
""")

package_module_2 = package / "module_2.py"
package_module_2.write_text("""
__all__ = ["Bar"]
imp_detail1 = imp_detail2 = imp_detail3 = None
class Bar: pass
""")

And now you have presented a complete api that someone else can use when they import your package, like so:

import package
package.foo()
package.Bar()

And the package won"t have all the other implementation details you used when creating your modules cluttering up the package namespace.

__all__ in __init__.py

After more work, maybe you"ve decided that the modules are too big (like many thousands of lines?) and need to be split up. So you do the following:

package
├── __init__.py
├── module_1
│   ├── foo_implementation.py
│   └── __init__.py
└── module_2
    ├── Bar_implementation.py
    └── __init__.py

First make the subpackage directories with the same names as the modules:

subpackage_1 = package / "module_1"
subpackage_1.mkdir()
subpackage_2 = package / "module_2"
subpackage_2.mkdir()

Move the implementations:

package_module_1.rename(subpackage_1 / "foo_implementation.py")
package_module_2.rename(subpackage_2 / "Bar_implementation.py")

create __init__.pys for the subpackages that declare the __all__ for each:

(subpackage_1 / "__init__.py").write_text("""
from .foo_implementation import *
__all__ = ["foo"]
""")
(subpackage_2 / "__init__.py").write_text("""
from .Bar_implementation import *
__all__ = ["Bar"]
""")

And now you still have the api provisioned at the package level:

>>> import package
>>> package.foo()
>>> package.Bar()
<package.module_2.Bar_implementation.Bar object at 0x7f0c2349d210>

And you can easily add things to your API that you can manage at the subpackage level instead of the subpackage"s module level. If you want to add a new name to the API, you simply update the __init__.py, e.g. in module_2:

from .Bar_implementation import *
from .Baz_implementation import *
__all__ = ["Bar", "Baz"]

And if you"re not ready to publish Baz in the top level API, in your top level __init__.py you could have:

from .module_1 import *       # also constrained by __all__"s
from .module_2 import *       # in the __init__.py"s
__all__ = ["foo", "Bar"]     # further constraining the names advertised

and if your users are aware of the availability of Baz, they can use it:

import package
package.Baz()

but if they don"t know about it, other tools (like pydoc) won"t inform them.

You can later change that when Baz is ready for prime time:

from .module_1 import *
from .module_2 import *
__all__ = ["foo", "Bar", "Baz"]

Prefixing _ versus __all__:

By default, Python will export all names that do not start with an _. You certainly could rely on this mechanism. Some packages in the Python standard library, in fact, do rely on this, but to do so, they alias their imports, for example, in ctypes/__init__.py:

import os as _os, sys as _sys

Using the _ convention can be more elegant because it removes the redundancy of naming the names again. But it adds the redundancy for imports (if you have a lot of them) and it is easy to forget to do this consistently - and the last thing you want is to have to indefinitely support something you intended to only be an implementation detail, just because you forgot to prefix an _ when naming a function.

I personally write an __all__ early in my development lifecycle for modules so that others who might use my code know what they should use and not use.

Most packages in the standard library also use __all__.

When avoiding __all__ makes sense

It makes sense to stick to the _ prefix convention in lieu of __all__ when:

  • You"re still in early development mode and have no users, and are constantly tweaking your API.
  • Maybe you do have users, but you have unittests that cover the API, and you"re still actively adding to the API and tweaking in development.

An export decorator

The downside of using __all__ is that you have to write the names of functions and classes being exported twice - and the information is kept separate from the definitions. We could use a decorator to solve this problem.

I got the idea for such an export decorator from David Beazley"s talk on packaging. This implementation seems to work well in CPython"s traditional importer. If you have a special import hook or system, I do not guarantee it, but if you adopt it, it is fairly trivial to back out - you"ll just need to manually add the names back into the __all__

So in, for example, a utility library, you would define the decorator:

import sys

def export(fn):
    mod = sys.modules[fn.__module__]
    if hasattr(mod, "__all__"):
        mod.__all__.append(fn.__name__)
    else:
        mod.__all__ = [fn.__name__]
    return fn

and then, where you would define an __all__, you do this:

$ cat > main.py
from lib import export
__all__ = [] # optional - we create a list if __all__ is not there.

@export
def foo(): pass

@export
def bar():
    "bar"

def main():
    print("main")

if __name__ == "__main__":
    main()

And this works fine whether run as main or imported by another function.

$ cat > run.py
import main
main.main()

$ python run.py
main

And API provisioning with import * will work too:

$ cat > run.py
from main import *
foo()
bar()
main() # expected to error here, not exported

$ python run.py
Traceback (most recent call last):
  File "run.py", line 4, in <module>
    main() # expected to error here, not exported
NameError: name "main" is not defined

Answer #7

A comment in the Python source code for float objects acknowledges that:

Comparison is pretty much a nightmare

This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.

To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is "too big" to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.

When comparing a float v to an integer/long w, the worst case is that:

  • v and w have the same sign (both positive or both negative),
  • the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
  • the integer w has at least 49 bits,
  • the exponent of the float v is the same as the number of bits in w.

And this is exactly what we have for the values in the question:

>>> import math
>>> math.frexp(562949953420000.7) # gives the float"s (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49

We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.

Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.

This is specific to the CPython implementation of the language.


The comparison in more detail

The float_richcompare function handles the comparison between two values v and w.

Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I"ve left them in where relevant. I"ve also summarised these checks in a list at the foot of the answer.

The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).

The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;
    assert(PyFloat_Check(v));       
    i = PyFloat_AS_DOUBLE(v);       

    if (PyFloat_Check(w))           
        j = PyFloat_AS_DOUBLE(w);   

    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }

Now we know that if w failed these checks, it is not a Python float. Now the function checks if it"s a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:

    else if (PyLong_Check(w)) {
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;

        if (vsign != wsign) {
            /* Magnitudes are irrelevant -- the signs alone
             * determine the outcome.
             */
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
    }   

If this check failed, then v and w have the same sign.

The next check counts the number of bits in the integer w. If it has too many bits then it can"t possibly be held as a float and so must be larger in magnitude than the float v:

    nbits = _PyLong_NumBits(w);
    if (nbits == (size_t)-1 && PyErr_Occurred()) {
        /* This long is so large that size_t isn"t big enough
         * to hold the # of bits.  Replace with little doubles
         * that give the same outcome -- w is so large that
         * its magnitude must exceed the magnitude of any
         * finite float.
         */
        PyErr_Clear();
        i = (double)vsign;
        assert(wsign != 0);
        j = wsign * 2.0;
        goto Compare;
    }

On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:

    if (nbits <= 48) {
        j = PyLong_AsDouble(w);
        /* It"s impossible that <= 48 bits overflowed. */
        assert(j != -1.0 || ! PyErr_Occurred());
        goto Compare;
    }

From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:

    if (nbits <= 48) {
        /* "Multiply both sides" by -1; this also swaps the
         * comparator.
         */
        i = -i;
        op = _Py_SwappedOp[op];
    }

Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:

    (void) frexp(i, &exponent);
    if (exponent < 0 || (size_t)exponent < nbits) {
        i = 1.0;
        j = 2.0;
        goto Compare;
    }

This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.

Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:

    if ((size_t)exponent > nbits) {
        i = 2.0;
        j = 1.0;
        goto Compare;
    }

If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.

The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).

    {
        double fracpart;
        double intpart;
        PyObject *result = NULL;
        PyObject *one = NULL;
        PyObject *vv = NULL;
        PyObject *ww = w;

        // snip

        fracpart = modf(i, &intpart); // split i (the double that v mapped to)
        vv = PyLong_FromDouble(intpart);

        // snip

        if (fracpart != 0.0) {
            /* Shift left, and or a 1 bit into vv
             * to represent the lost fraction.
             */
            PyObject *temp;

            one = PyLong_FromLong(1);

            temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
            ww = temp;

            temp = PyNumber_Lshift(vv, one);
            vv = temp;

            temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
            vv = temp;
        }
        // snip
    }
}

For brevity I"ve left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.


Here is a summary of the checks that are performed by the comparison function.

Let v be a float and cast it as a C double. Now, if w is also a float:

  • Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.

  • If not, compare v and w directly by their representations as C doubles.

If w is an integer:

  • Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.

  • (The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.

  • Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.

  • (w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)

  • Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.

  • If the exponent of v is greater than the number of bits in w then v is greater than w.

  • (The exponent is the same as the number of bits in w.)

  • The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.

Answer #8

To somewhat expand on the earlier answers here, there are a number of details which are commonly overlooked.

  • Prefer subprocess.run() over subprocess.check_call() and friends over subprocess.call() over subprocess.Popen() over os.system() over os.popen()
  • Understand and probably use text=True, aka universal_newlines=True.
  • Understand the meaning of shell=True or shell=False and how it changes quoting and the availability of shell conveniences.
  • Understand differences between sh and Bash
  • Understand how a subprocess is separate from its parent, and generally cannot change the parent.
  • Avoid running the Python interpreter as a subprocess of Python.

These topics are covered in some more detail below.

Prefer subprocess.run() or subprocess.check_call()

The subprocess.Popen() function is a low-level workhorse but it is tricky to use correctly and you end up copy/pasting multiple lines of code ... which conveniently already exist in the standard library as a set of higher-level wrapper functions for various purposes, which are presented in more detail in the following.

Here"s a paragraph from the documentation:

The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.

Unfortunately, the availability of these wrapper functions differs between Python versions.

  • subprocess.run() was officially introduced in Python 3.5. It is meant to replace all of the following.
  • subprocess.check_output() was introduced in Python 2.7 / 3.1. It is basically equivalent to subprocess.run(..., check=True, stdout=subprocess.PIPE).stdout
  • subprocess.check_call() was introduced in Python 2.5. It is basically equivalent to subprocess.run(..., check=True)
  • subprocess.call() was introduced in Python 2.4 in the original subprocess module (PEP-324). It is basically equivalent to subprocess.run(...).returncode

High-level API vs subprocess.Popen()

The refactored and extended subprocess.run() is more logical and more versatile than the older legacy functions it replaces. It returns a CompletedProcess object which has various methods which allow you to retrieve the exit status, the standard output, and a few other results and status indicators from the finished subprocess.

subprocess.run() is the way to go if you simply need a program to run and return control to Python. For more involved scenarios (background processes, perhaps with interactive I/O with the Python parent program) you still need to use subprocess.Popen() and take care of all the plumbing yourself. This requires a fairly intricate understanding of all the moving parts and should not be undertaken lightly. The simpler Popen object represents the (possibly still-running) process which needs to be managed from your code for the remainder of the lifetime of the subprocess.

It should perhaps be emphasized that just subprocess.Popen() merely creates a process. If you leave it at that, you have a subprocess running concurrently alongside with Python, so a "background" process. If it doesn"t need to do input or output or otherwise coordinate with you, it can do useful work in parallel with your Python program.

Avoid os.system() and os.popen()

Since time eternal (well, since Python 2.5) the os module documentation has contained the recommendation to prefer subprocess over os.system():

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

The problems with system() are that it"s obviously system-dependent and doesn"t offer ways to interact with the subprocess. It simply runs, with standard output and standard error outside of Python"s reach. The only information Python receives back is the exit status of the command (zero means success, though the meaning of non-zero values is also somewhat system-dependent).

PEP-324 (which was already mentioned above) contains a more detailed rationale for why os.system is problematic and how subprocess attempts to solve those issues.

os.popen() used to be even more strongly discouraged:

Deprecated since version 2.6: This function is obsolete. Use the subprocess module.

However, since sometime in Python 3, it has been reimplemented to simply use subprocess, and redirects to the subprocess.Popen() documentation for details.

Understand and usually use check=True

You"ll also notice that subprocess.call() has many of the same limitations as os.system(). In regular use, you should generally check whether the process finished successfully, which subprocess.check_call() and subprocess.check_output() do (where the latter also returns the standard output of the finished subprocess). Similarly, you should usually use check=True with subprocess.run() unless you specifically need to allow the subprocess to return an error status.

In practice, with check=True or subprocess.check_*, Python will throw a CalledProcessError exception if the subprocess returns a nonzero exit status.

A common error with subprocess.run() is to omit check=True and be surprised when downstream code fails if the subprocess failed.

On the other hand, a common problem with check_call() and check_output() was that users who blindly used these functions were surprised when the exception was raised e.g. when grep did not find a match. (You should probably replace grep with native Python code anyway, as outlined below.)

All things counted, you need to understand how shell commands return an exit code, and under what conditions they will return a non-zero (error) exit code, and make a conscious decision how exactly it should be handled.

Understand and probably use text=True aka universal_newlines=True

Since Python 3, strings internal to Python are Unicode strings. But there is no guarantee that a subprocess generates Unicode output, or strings at all.

(If the differences are not immediately obvious, Ned Batchelder"s Pragmatic Unicode is recommended, if not outright obligatory, reading. There is a 36-minute video presentation behind the link if you prefer, though reading the page yourself will probably take significantly less time.)

Deep down, Python has to fetch a bytes buffer and interpret it somehow. If it contains a blob of binary data, it shouldn"t be decoded into a Unicode string, because that"s error-prone and bug-inducing behavior - precisely the sort of pesky behavior which riddled many Python 2 scripts, before there was a way to properly distinguish between encoded text and binary data.

With text=True, you tell Python that you, in fact, expect back textual data in the system"s default encoding, and that it should be decoded into a Python (Unicode) string to the best of Python"s ability (usually UTF-8 on any moderately up to date system, except perhaps Windows?)

If that"s not what you request back, Python will just give you bytes strings in the stdout and stderr strings. Maybe at some later point you do know that they were text strings after all, and you know their encoding. Then, you can decode them.

normal = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True,
    text=True)
print(normal.stdout)

convoluted = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True)
# You have to know (or guess) the encoding
print(convoluted.stdout.decode("utf-8"))

Python 3.7 introduced the shorter and more descriptive and understandable alias text for the keyword argument which was previously somewhat misleadingly called universal_newlines.

Understand shell=True vs shell=False

With shell=True you pass a single string to your shell, and the shell takes it from there.

With shell=False you pass a list of arguments to the OS, bypassing the shell.

When you don"t have a shell, you save a process and get rid of a fairly substantial amount of hidden complexity, which may or may not harbor bugs or even security problems.

On the other hand, when you don"t have a shell, you don"t have redirection, wildcard expansion, job control, and a large number of other shell features.

A common mistake is to use shell=True and then still pass Python a list of tokens, or vice versa. This happens to work in some cases, but is really ill-defined and could break in interesting ways.

# XXX AVOID THIS BUG
buggy = subprocess.run("dig +short stackoverflow.com")

# XXX AVOID THIS BUG TOO
broken = subprocess.run(["dig", "+short", "stackoverflow.com"],
    shell=True)

# XXX DEFINITELY AVOID THIS
pathological = subprocess.run(["dig +short stackoverflow.com"],
    shell=True)

correct = subprocess.run(["dig", "+short", "stackoverflow.com"],
    # Probably don"t forget these, too
    check=True, text=True)

# XXX Probably better avoid shell=True
# but this is nominally correct
fixed_but_fugly = subprocess.run("dig +short stackoverflow.com",
    shell=True,
    # Probably don"t forget these, too
    check=True, text=True)

The common retort "but it works for me" is not a useful rebuttal unless you understand exactly under what circumstances it could stop working.

Refactoring Example

Very often, the features of the shell can be replaced with native Python code. Simple Awk or sed scripts should probably simply be translated to Python instead.

To partially illustrate this, here is a typical but slightly silly example which involves many shell features.

cmd = """while read -r x;
   do ping -c 3 "$x" | grep "round-trip min/avg/max"
   done <hosts.txt"""

# Trivial but horrible
results = subprocess.run(
    cmd, shell=True, universal_newlines=True, check=True)
print(results.stdout)

# Reimplement with shell=False
with open("hosts.txt") as hosts:
    for host in hosts:
        host = host.rstrip("
")  # drop newline
        ping = subprocess.run(
             ["ping", "-c", "3", host],
             text=True,
             stdout=subprocess.PIPE,
             check=True)
        for line in ping.stdout.split("
"):
             if "round-trip min/avg/max" in line:
                 print("{}: {}".format(host, line))

Some things to note here:

  • With shell=False you don"t need the quoting that the shell requires around strings. Putting quotes anyway is probably an error.
  • It often makes sense to run as little code as possible in a subprocess. This gives you more control over execution from within your Python code.
  • Having said that, complex shell pipelines are tedious and sometimes challenging to reimplement in Python.

The refactored code also illustrates just how much the shell really does for you with a very terse syntax -- for better or for worse. Python says explicit is better than implicit but the Python code is rather verbose and arguably looks more complex than this really is. On the other hand, it offers a number of points where you can grab control in the middle of something else, as trivially exemplified by the enhancement that we can easily include the host name along with the shell command output. (This is by no means challenging to do in the shell, either, but at the expense of yet another diversion and perhaps another process.)

Common Shell Constructs

For completeness, here are brief explanations of some of these shell features, and some notes on how they can perhaps be replaced with native Python facilities.

  • Globbing aka wildcard expansion can be replaced with glob.glob() or very often with simple Python string comparisons like for file in os.listdir("."): if not file.endswith(".png"): continue. Bash has various other expansion facilities like .{png,jpg} brace expansion and {1..100} as well as tilde expansion (~ expands to your home directory, and more generally ~account to the home directory of another user)
  • Shell variables like $SHELL or $my_exported_var can sometimes simply be replaced with Python variables. Exported shell variables are available as e.g. os.environ["SHELL"] (the meaning of export is to make the variable available to subprocesses -- a variable which is not available to subprocesses will obviously not be available to Python running as a subprocess of the shell, or vice versa. The env= keyword argument to subprocess methods allows you to define the environment of the subprocess as a dictionary, so that"s one way to make a Python variable visible to a subprocess). With shell=False you will need to understand how to remove any quotes; for example, cd "$HOME" is equivalent to os.chdir(os.environ["HOME"]) without quotes around the directory name. (Very often cd is not useful or necessary anyway, and many beginners omit the double quotes around the variable and get away with it until one day ...)
  • Redirection allows you to read from a file as your standard input, and write your standard output to a file. grep "foo" <inputfile >outputfile opens outputfile for writing and inputfile for reading, and passes its contents as standard input to grep, whose standard output then lands in outputfile. This is not generally hard to replace with native Python code.
  • Pipelines are a form of redirection. echo foo | nl runs two subprocesses, where the standard output of echo is the standard input of nl (on the OS level, in Unix-like systems, this is a single file handle). If you cannot replace one or both ends of the pipeline with native Python code, perhaps think about using a shell after all, especially if the pipeline has more than two or three processes (though look at the pipes module in the Python standard library or a number of more modern and versatile third-party competitors).
  • Job control lets you interrupt jobs, run them in the background, return them to the foreground, etc. The basic Unix signals to stop and continue a process are of course available from Python, too. But jobs are a higher-level abstraction in the shell which involve process groups etc which you have to understand if you want to do something like this from Python.
  • Quoting in the shell is potentially confusing until you understand that everything is basically a string. So ls -l / is equivalent to "ls" "-l" "/" but the quoting around literals is completely optional. Unquoted strings which contain shell metacharacters undergo parameter expansion, whitespace tokenization and wildcard expansion; double quotes prevent whitespace tokenization and wildcard expansion but allow parameter expansions (variable substitution, command substitution, and backslash processing). This is simple in theory but can get bewildering, especially when there are several layers of interpretation (a remote shell command, for example).

Understand differences between sh and Bash

subprocess runs your shell commands with /bin/sh unless you specifically request otherwise (except of course on Windows, where it uses the value of the COMSPEC variable). This means that various Bash-only features like arrays, [[ etc are not available.

If you need to use Bash-only syntax, you can pass in the path to the shell as executable="/bin/bash" (where of course if your Bash is installed somewhere else, you need to adjust the path).

subprocess.run("""
    # This for loop syntax is Bash only
    for((i=1;i<=$#;i++)); do
        # Arrays are Bash-only
        array[i]+=123
    done""",
    shell=True, check=True,
    executable="/bin/bash")

A subprocess is separate from its parent, and cannot change it

A somewhat common mistake is doing something like

subprocess.run("cd /tmp", shell=True)
subprocess.run("pwd", shell=True)  # Oops, doesn"t print /tmp

The same thing will happen if the first subprocess tries to set an environment variable, which of course will have disappeared when you run another subprocess, etc.

A child process runs completely separate from Python, and when it finishes, Python has no idea what it did (apart from the vague indicators that it can infer from the exit status and output from the child process). A child generally cannot change the parent"s environment; it cannot set a variable, change the working directory, or, in so many words, communicate with its parent without cooperation from the parent.

The immediate fix in this particular case is to run both commands in a single subprocess;

subprocess.run("cd /tmp; pwd", shell=True)

though obviously this particular use case isn"t very useful; instead, use the cwd keyword argument, or simply os.chdir() before running the subprocess. Similarly, for setting a variable, you can manipulate the environment of the current process (and thus also its children) via

os.environ["foo"] = "bar"

or pass an environment setting to a child process with

subprocess.run("echo "$foo"", shell=True, env={"foo": "bar"})

(not to mention the obvious refactoring subprocess.run(["echo", "bar"]); but echo is a poor example of something to run in a subprocess in the first place, of course).

Don"t run Python from Python

This is slightly dubious advice; there are certainly situations where it does make sense or is even an absolute requirement to run the Python interpreter as a subprocess from a Python script. But very frequently, the correct approach is simply to import the other Python module into your calling script and call its functions directly.

If the other Python script is under your control, and it isn"t a module, consider turning it into one. (This answer is too long already so I will not delve into details here.)

If you need parallelism, you can run Python functions in subprocesses with the multiprocessing module. There is also threading which runs multiple tasks in a single process (which is more lightweight and gives you more control, but also more constrained in that threads within a process are tightly coupled, and bound to a single GIL.)

Answer #9

urllib has been split up in Python 3.

The urllib.urlencode() function is now urllib.parse.urlencode(),

the urllib.urlopen() function is now urllib.request.urlopen().

Answer #10

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo"s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

All Fitted Distributions

Best Fit Distribution

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
matplotlib.style.use("ggplot")

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Best holders
    best_distributions = []

    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):

        print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

        distribution = getattr(st, distribution)

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can"t be fit
            with warnings.catch_warnings():
                warnings.filterwarnings("ignore")
                
                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                
                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        
        except Exception:
            pass

    
    return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
    """Generate distributions"s Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index("YEAR").values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])[1]["color"])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u"El Niño sea temp.
 All Fitted Distributions")
ax.set_xlabel(u"Temp (°C)")
ax.set_ylabel("Frequency")

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)

param_names = (best_dist[0].shapes + ", loc, scale").split(", ") if best_dist[0].shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = "{}({})".format(best_dist[0].name, param_str)

ax.set_title(u"El Niño sea temp. with best fit distribution 
" + dist_str)
ax.set_xlabel(u"Temp. (°C)")
ax.set_ylabel("Frequency")

Tutorials