Understanding a Python List | Splitting an array three-way around a given range

Understanding a Python List | Splitting an array three-way around a given range: StackOverflow Questions

Removing white space around a saved image in matplotlib

I need to take an image and save it after some process. The figure looks fine when I display it, but after saving the figure, I got some white space around the saved image. I have tried the "tight" option for savefig method, did not work either. The code:

  import matplotlib.image as mpimg
  import matplotlib.pyplot as plt

  fig = plt.figure(1)
  img = mpimg.imread(path)
  plt.imshow(img)
  ax=fig.add_subplot(1,1,1)

  extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
  plt.savefig("1.png", bbox_inches=extent)

  plt.axis("off") 
  plt.show()

I am trying to draw a basic graph by using NetworkX on a figure and save it. I realized that without a graph it works, but when added a graph I get white space around the saved image;

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import networkx as nx

G = nx.Graph()
G.add_node(1)
G.add_node(2)
G.add_node(3)
G.add_edge(1,3)
G.add_edge(1,2)
pos = {1:[100,120], 2:[200,300], 3:[50,75]}

fig = plt.figure(1)
img = mpimg.imread("image.jpg")
plt.imshow(img)
ax=fig.add_subplot(1,1,1)

nx.draw(G, pos=pos)

extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
plt.savefig("1.png", bbox_inches = extent)

plt.axis("off") 
plt.show()

PEP 8, why no spaces around "=" in keyword argument or a default parameter value?

Why does PEP 8 recommend not having spaces around = in a keyword argument or a default parameter value?

Is this inconsistent with recommending spaces around every other occurrence of = in Python code?

How is:

func(1, 2, very_long_variable_name=another_very_long_variable_name)

better than:

func(1, 2, very_long_variable_name = another_very_long_variable_name)

Any links to discussion/explanation by Python"s BDFL will be appreciated.

Mind, this question is more about kwargs than default values, i just used the phrasing from PEP 8.

I"m not soliciting opinions. I"m asking for reasons behind this decision. It"s more like asking why would I use { on the same line as if statement in a C program, not whether I should use it or not.

Answer #1

Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool.

The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code:

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)

Which is the multithreaded version of:

results = []
for item in my_array:
    results.append(my_function(item))

Description

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence.

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end.

Enter image description here


Implementation

Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy.

multiprocessing.dummy is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O):

multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module.

import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = [
  "http://www.python.org",
  "http://www.python.org/about/",
  "http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html",
  "http://www.python.org/doc/",
  "http://www.python.org/download/",
  "http://www.python.org/getit/",
  "http://www.python.org/community/",
  "https://wiki.python.org/moin/",
]

# Make the Pool of workers
pool = ThreadPool(4)

# Open the URLs in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

# Close the pool and wait for the work to finish
pool.close()
pool.join()

And the timing results:

Single thread:   14.4 seconds
       4 Pool:   3.1 seconds
       8 Pool:   1.4 seconds
      13 Pool:   1.3 seconds

Passing multiple arguments (works like this only in Python 3.3 and later):

To pass multiple arrays:

results = pool.starmap(function, zip(list_a, list_b))

Or to pass a constant and an array:

results = pool.starmap(function, zip(itertools.repeat(constant), list_a))

If you are using an earlier version of Python, you can pass multiple arguments via this workaround).

(Thanks to user136036 for the helpful comment.)

Answer #2

How to iterate over rows in a DataFrame in Pandas?

Answer: DON"T*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.


An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.


Further Reading

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Answer #3

-----> pip install gensim config --global http.sslVerify false

Just install any package with the "config --global http.sslVerify false" statement

You can ignore SSL errors by setting pypi.org and files.pythonhosted.org as trusted hosts.

$ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <package_name>

Note: Sometime during April 2018, the Python Package Index was migrated from pypi.python.org to pypi.org. This means "trusted-host" commands using the old domain no longer work.

Permanent Fix

Since the release of pip 10.0, you should be able to fix this permanently just by upgrading pip itself:

$ pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org pip setuptools

Or by just reinstalling it to get the latest version:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

(… and then running get-pip.py with the relevant Python interpreter).

pip install <otherpackage> should just work after this. If not, then you will need to do more, as explained below.


You may want to add the trusted hosts and proxy to your config file.

pip.ini (Windows) or pip.conf (unix)

[global]
trusted-host = pypi.python.org
               pypi.org
               files.pythonhosted.org

Alternate Solutions (Less secure)

Most of the answers could pose a security issue.

Two of the workarounds that help in installing most of the python packages with ease would be:

  • Using easy_install: if you are really lazy and don"t want to waste much time, use easy_install <package_name>. Note that some packages won"t be found or will give small errors.
  • Using Wheel: download the Wheel of the python package and use the pip command pip install wheel_package_name.whl to install the package.

Answer #4

I tested most suggested solutions with perfplot (a pet project of mine, essentially a wrapper around timeit), and found

import functools
import operator
functools.reduce(operator.iconcat, a, [])

to be the fastest solution, both when many small lists and few long lists are concatenated. (operator.iadd is equally fast.)

enter image description here

enter image description here


Code to reproduce the plot:

import functools
import itertools
import numpy
import operator
import perfplot


def forfor(a):
    return [item for sublist in a for item in sublist]


def sum_brackets(a):
    return sum(a, [])


def functools_reduce(a):
    return functools.reduce(operator.concat, a)


def functools_reduce_iconcat(a):
    return functools.reduce(operator.iconcat, a, [])


def itertools_chain(a):
    return list(itertools.chain.from_iterable(a))


def numpy_flat(a):
    return list(numpy.array(a).flat)


def numpy_concatenate(a):
    return list(numpy.concatenate(a))


perfplot.show(
    setup=lambda n: [list(range(10))] * n,
    # setup=lambda n: [list(range(n))] * 10,
    kernels=[
        forfor,
        sum_brackets,
        functools_reduce,
        functools_reduce_iconcat,
        itertools_chain,
        numpy_flat,
        numpy_concatenate,
    ],
    n_range=[2 ** k for k in range(16)],
    xlabel="num lists (of length 10)",
    # xlabel="len lists (10 lists total)"
)

Answer #5

You can"t.

One workaround is to create clone a new environment and then remove the original one.

First, remember to deactivate your current environment. You can do this with the commands:

  • deactivate on Windows or
  • source deactivate on macOS/Linux.

Then:

conda create --name new_name --clone old_name
conda remove --name old_name --all # or its alias: `conda env remove --name old_name`

Notice there are several drawbacks of this method:

  1. It redownloads packages (you can use --offline flag to disable it)
  2. Time consumed on copying environment"s files
  3. Temporary double disk usage

There is an open issue requesting this feature.

Answer #6

Watch out for the parentheses. As has been pointed out above, in Python 3, assert is still a statement, so by analogy with print(..), one may extrapolate the same to assert(..) or raise(..) but you shouldn"t.

This is wrong:

assert(2 + 2 == 5, "Houston we"ve got a problem")

This is correct:

assert 2 + 2 == 5, "Houston we"ve got a problem"

The reason the first one will not work is that bool( (False, "Houston we"ve got a problem") ) evaluates to True.

In the statement assert(False), these are just redundant parentheses around False, which evaluate to their contents. But with assert(False,) the parentheses are now a tuple, and a non-empty tuple evaluates to True in a boolean context.

Answer #7

Using plt.rcParams

There is also this workaround in case you want to change the size without using the figure environment. So in case you are using plt.plot() for example, you can set a tuple with width and height.

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,3)

This is very useful when you plot inline (e.g., with IPython Notebook). As asmaier noticed, it is preferable to not put this statement in the same cell of the imports statements.

To reset the global figure size back to default for subsequent plots:

plt.rcParams["figure.figsize"] = plt.rcParamsDefault["figure.figsize"]

Conversion to cm

The figsize tuple accepts inches, so if you want to set it in centimetres you have to divide them by 2.54. Have a look at this question.

Answer #8

df.to_numpy() is better than df.values, here"s why.*

It"s time to deprecate your usage of values and as_matrix().

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

  1. to_numpy(), which is defined on Index, Series, and DataFrame objects, and
  2. array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.

* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you"re just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.



Towards Better Consistency: to_numpy()

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup
df = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}, 
                  index=["a", "b", "c"])

# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
#        [2, 5, 8],
#        [3, 6, 9]])

# Convert specific columns
df[["A", "C"]].to_numpy()
# array([[1, 7],
#        [2, 8],
#        [3, 9]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy()
# array(["a", "b", "c"], dtype=object)

df["A"].to_numpy()
#  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy()
v[0, 0] = -1
 
df
   A  B  C
a -1  4  7
b  2  5  8
c  3  6  9

If you need a copy instead, use to_numpy(copy=True).


pandas >= 1.0 update for ExtensionTypes

If you"re using pandas 1.x, chances are you"ll be dealing with extension types a lot more. You"ll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Correct
a.to_numpy(dtype="float", na_value=np.nan)                                 
# array([ 1.,  2., nan])

# Also correct
a.to_numpy(dtype="int", na_value=-1)
# array([ 1,  2, -1])

This is called out in the docs.


If you need the dtypes in the result...

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records()
# rec.array([("a", 1, 4, 7), ("b", 2, 5, 8), ("c", 3, 6, 9)],
#           dtype=[("index", "O"), ("A", "<i8"), ("B", "<i8"), ("C", "<i8")])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([("a", 1, 4, 7), ("b", 2, 5, 8), ("c", 3, 6, 9)],
#           dtype=[("index", "<U1"), ("A", "<i8"), ("B", "<i8"), ("C", "<i8")])

Performance wise, it"s nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[...] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. [...]

to_numpy aims to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.



Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() is deprecated now, do NOT use!

Answer #9

Explanation

From PEP 328

Relative imports use a module"s __name__ attribute to determine that module"s position in the package hierarchy. If the module"s name does not contain any package information (e.g. it is set to "__main__") then relative imports are resolved as if the module were a top level module, regardless of where the module is actually located on the file system.

At some point PEP 338 conflicted with PEP 328:

... relative imports rely on __name__ to determine the current module"s position in the package hierarchy. In a main module, the value of __name__ is always "__main__", so explicit relative imports will always fail (as they only work for a module inside a package)

and to address the issue, PEP 366 introduced the top level variable __package__:

By adding a new module level attribute, this PEP allows relative imports to work automatically if the module is executed using the -m switch. A small amount of boilerplate in the module itself will allow the relative imports to work when the file is executed by name. [...] When it [the attribute] is present, relative imports will be based on this attribute rather than the module __name__ attribute. [...] When the main module is specified by its filename, then the __package__ attribute will be set to None. [...] When the import system encounters an explicit relative import in a module without __package__ set (or with it set to None), it will calculate and store the correct value (__name__.rpartition(".")[0] for normal modules and __name__ for package initialisation modules)

(emphasis mine)

If the __name__ is "__main__", __name__.rpartition(".")[0] returns empty string. This is why there"s empty string literal in the error description:

SystemError: Parent module "" not loaded, cannot perform relative import

The relevant part of the CPython"s PyImport_ImportModuleLevelObject function:

if (PyDict_GetItem(interp->modules, package) == NULL) {
    PyErr_Format(PyExc_SystemError,
            "Parent module %R not loaded, cannot perform relative "
            "import", package);
    goto error;
}

CPython raises this exception if it was unable to find package (the name of the package) in interp->modules (accessible as sys.modules). Since sys.modules is "a dictionary that maps module names to modules which have already been loaded", it"s now clear that the parent module must be explicitly absolute-imported before performing relative import.

Note: The patch from the issue 18018 has added another if block, which will be executed before the code above:

if (PyUnicode_CompareWithASCIIString(package, "") == 0) {
    PyErr_SetString(PyExc_ImportError,
            "attempted relative import with no known parent package");
    goto error;
} /* else if (PyDict_GetItem(interp->modules, package) == NULL) {
    ...
*/

If package (same as above) is empty string, the error message will be

ImportError: attempted relative import with no known parent package

However, you will only see this in Python 3.6 or newer.

Solution #1: Run your script using -m

Consider a directory (which is a Python package):

.
├── package
│   ├── __init__.py
│   ├── module.py
│   └── standalone.py

All of the files in package begin with the same 2 lines of code:

from pathlib import Path
print("Running" if __name__ == "__main__" else "Importing", Path(__file__).resolve())

I"m including these two lines only to make the order of operations obvious. We can ignore them completely, since they don"t affect the execution.

__init__.py and module.py contain only those two lines (i.e., they are effectively empty).

standalone.py additionally attempts to import module.py via relative import:

from . import module  # explicit relative import

We"re well aware that /path/to/python/interpreter package/standalone.py will fail. However, we can run the module with the -m command line option that will "search sys.path for the named module and execute its contents as the __main__ module":

[email protected]:~$ python3 -i -m package.standalone
Importing /home/vaultah/package/__init__.py
Running /home/vaultah/package/standalone.py
Importing /home/vaultah/package/module.py
>>> __file__
"/home/vaultah/package/standalone.py"
>>> __package__
"package"
>>> # The __package__ has been correctly set and module.py has been imported.
... # What"s inside sys.modules?
... import sys
>>> sys.modules["__main__"]
<module "package.standalone" from "/home/vaultah/package/standalone.py">
>>> sys.modules["package.module"]
<module "package.module" from "/home/vaultah/package/module.py">
>>> sys.modules["package"]
<module "package" from "/home/vaultah/package/__init__.py">

-m does all the importing stuff for you and automatically sets __package__, but you can do that yourself in the

Solution #2: Set __package__ manually

Please treat it as a proof of concept rather than an actual solution. It isn"t well-suited for use in real-world code.

PEP 366 has a workaround to this problem, however, it"s incomplete, because setting __package__ alone is not enough. You"re going to need to import at least N preceding packages in the module hierarchy, where N is the number of parent directories (relative to the directory of the script) that will be searched for the module being imported.

Thus,

  1. Add the parent directory of the Nth predecessor of the current module to sys.path

  2. Remove the current file"s directory from sys.path

  3. Import the parent module of the current module using its fully-qualified name

  4. Set __package__ to the fully-qualified name from 2

  5. Perform the relative import

I"ll borrow files from the Solution #1 and add some more subpackages:

package
├── __init__.py
├── module.py
└── subpackage
    ├── __init__.py
    └── subsubpackage
        ├── __init__.py
        └── standalone.py

This time standalone.py will import module.py from the package package using the following relative import

from ... import module  # N = 3

We"ll need to precede that line with the boilerplate code, to make it work.

import sys
from pathlib import Path

if __name__ == "__main__" and __package__ is None:
    file = Path(__file__).resolve()
    parent, top = file.parent, file.parents[3]

    sys.path.append(str(top))
    try:
        sys.path.remove(str(parent))
    except ValueError: # Already removed
        pass

    import package.subpackage.subsubpackage
    __package__ = "package.subpackage.subsubpackage"

from ... import module # N = 3

It allows us to execute standalone.py by filename:

[email protected]:~$ python3 package/subpackage/subsubpackage/standalone.py
Running /home/vaultah/package/subpackage/subsubpackage/standalone.py
Importing /home/vaultah/package/__init__.py
Importing /home/vaultah/package/subpackage/__init__.py
Importing /home/vaultah/package/subpackage/subsubpackage/__init__.py
Importing /home/vaultah/package/module.py

A more general solution wrapped in a function can be found here. Example usage:

if __name__ == "__main__" and __package__ is None:
    import_parents(level=3) # N = 3

from ... import module
from ...module.submodule import thing

Solution #3: Use absolute imports and setuptools

The steps are -

  1. Replace explicit relative imports with equivalent absolute imports

  2. Install package to make it importable

For instance, the directory structure may be as follows

.
├── project
│   ├── package
│   │   ├── __init__.py
│   │   ├── module.py
│   │   └── standalone.py
│   └── setup.py

where setup.py is

from setuptools import setup, find_packages
setup(
    name = "your_package_name",
    packages = find_packages(),
)

The rest of the files were borrowed from the Solution #1.

Installation will allow you to import the package regardless of your working directory (assuming there"ll be no naming issues).

We can modify standalone.py to use this advantage (step 1):

from package import module  # absolute import

Change your working directory to project and run /path/to/python/interpreter setup.py install --user (--user installs the package in your site-packages directory) (step 2):

[email protected]:~$ cd project
[email protected]:~/project$ python3 setup.py install --user

Let"s verify that it"s now possible to run standalone.py as a script:

[email protected]:~/project$ python3 -i package/standalone.py
Running /home/vaultah/project/package/standalone.py
Importing /home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/__init__.py
Importing /home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py
>>> module
<module "package.module" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py">
>>> import sys
>>> sys.modules["package"]
<module "package" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/__init__.py">
>>> sys.modules["package.module"]
<module "package.module" from "/home/vaultah/.local/lib/python3.6/site-packages/your_package_name-0.0.0-py3.6.egg/package/module.py">

Note: If you decide to go down this route, you"d be better off using virtual environments to install packages in isolation.

Solution #4: Use absolute imports and some boilerplate code

Frankly, the installation is not necessary - you could add some boilerplate code to your script to make absolute imports work.

I"m going to borrow files from Solution #1 and change standalone.py:

  1. Add the parent directory of package to sys.path before attempting to import anything from package using absolute imports:

    import sys
    from pathlib import Path # if you haven"t already done so
    file = Path(__file__).resolve()
    parent, root = file.parent, file.parents[1]
    sys.path.append(str(root))
    
    # Additionally remove the current file"s directory from sys.path
    try:
        sys.path.remove(str(parent))
    except ValueError: # Already removed
        pass
    
  2. Replace the relative import by the absolute import:

    from package import module  # absolute import
    

standalone.py runs without problems:

[email protected]:~$ python3 -i package/standalone.py
Running /home/vaultah/package/standalone.py
Importing /home/vaultah/package/__init__.py
Importing /home/vaultah/package/module.py
>>> module
<module "package.module" from "/home/vaultah/package/module.py">
>>> import sys
>>> sys.modules["package"]
<module "package" from "/home/vaultah/package/__init__.py">
>>> sys.modules["package.module"]
<module "package.module" from "/home/vaultah/package/module.py">

I feel that I should warn you: try not to do this, especially if your project has a complex structure.


As a side note, PEP 8 recommends the use of absolute imports, but states that in some scenarios explicit relative imports are acceptable:

Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages). [...] However, explicit relative imports are an acceptable alternative to absolute imports, especially when dealing with complex package layouts where using absolute imports would be unnecessarily verbose.

Answer #10

What"s the pythonic way to use getters and setters?

The "Pythonic" way is not to use "getters" and "setters", but to use plain attributes, like the question demonstrates, and del for deleting (but the names are changed to protect the innocent... builtins):

value = "something"

obj.attribute = value  
value = obj.attribute
del obj.attribute

If later, you want to modify the setting and getting, you can do so without having to alter user code, by using the property decorator:

class Obj:
    """property demo"""
    #
    @property            # first decorate the getter method
    def attribute(self): # This getter method name is *the* name
        return self._attribute
    #
    @attribute.setter    # the property decorates with `.setter` now
    def attribute(self, value):   # name, e.g. "attribute", is the same
        self._attribute = value   # the "value" name isn"t special
    #
    @attribute.deleter     # decorate with `.deleter`
    def attribute(self):   # again, the method name is the same
        del self._attribute

(Each decorator usage copies and updates the prior property object, so note that you should use the same name for each set, get, and delete function/method.

After defining the above, the original setting, getting, and deleting code is the same:

obj = Obj()
obj.attribute = value  
the_value = obj.attribute
del obj.attribute

You should avoid this:

def set_property(property,value):  
def get_property(property):  

Firstly, the above doesn"t work, because you don"t provide an argument for the instance that the property would be set to (usually self), which would be:

class Obj:

    def set_property(self, property, value): # don"t do this
        ...
    def get_property(self, property):        # don"t do this either
        ...

Secondly, this duplicates the purpose of two special methods, __setattr__ and __getattr__.

Thirdly, we also have the setattr and getattr builtin functions.

setattr(object, "property_name", value)
getattr(object, "property_name", default_value)  # default is optional

The @property decorator is for creating getters and setters.

For example, we could modify the setting behavior to place restrictions the value being set:

class Protective(object):

    @property
    def protected_value(self):
        return self._protected_value

    @protected_value.setter
    def protected_value(self, value):
        if acceptable(value): # e.g. type or range check
            self._protected_value = value

In general, we want to avoid using property and just use direct attributes.

This is what is expected by users of Python. Following the rule of least-surprise, you should try to give your users what they expect unless you have a very compelling reason to the contrary.

Demonstration

For example, say we needed our object"s protected attribute to be an integer between 0 and 100 inclusive, and prevent its deletion, with appropriate messages to inform the user of its proper usage:

class Protective(object):
    """protected property demo"""
    #
    def __init__(self, start_protected_value=0):
        self.protected_value = start_protected_value
    # 
    @property
    def protected_value(self):
        return self._protected_value
    #
    @protected_value.setter
    def protected_value(self, value):
        if value != int(value):
            raise TypeError("protected_value must be an integer")
        if 0 <= value <= 100:
            self._protected_value = int(value)
        else:
            raise ValueError("protected_value must be " +
                             "between 0 and 100 inclusive")
    #
    @protected_value.deleter
    def protected_value(self):
        raise AttributeError("do not delete, protected_value can be set to 0")

(Note that __init__ refers to self.protected_value but the property methods refer to self._protected_value. This is so that __init__ uses the property through the public API, ensuring it is "protected".)

And usage:

>>> p1 = Protective(3)
>>> p1.protected_value
3
>>> p1 = Protective(5.0)
>>> p1.protected_value
5
>>> p2 = Protective(-5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in __init__
  File "<stdin>", line 15, in protected_value
ValueError: protectected_value must be between 0 and 100 inclusive
>>> p1.protected_value = 7.3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 17, in protected_value
TypeError: protected_value must be an integer
>>> p1.protected_value = 101
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 15, in protected_value
ValueError: protectected_value must be between 0 and 100 inclusive
>>> del p1.protected_value
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 18, in protected_value
AttributeError: do not delete, protected_value can be set to 0

Do the names matter?

Yes they do. .setter and .deleter make copies of the original property. This allows subclasses to properly modify behavior without altering the behavior in the parent.

class Obj:
    """property demo"""
    #
    @property
    def get_only(self):
        return self._attribute
    #
    @get_only.setter
    def get_or_set(self, value):
        self._attribute = value
    #
    @get_or_set.deleter
    def get_set_or_delete(self):
        del self._attribute

Now for this to work, you have to use the respective names:

obj = Obj()
# obj.get_only = "value" # would error
obj.get_or_set = "value"  
obj.get_set_or_delete = "new value"
the_value = obj.get_only
del obj.get_set_or_delete
# del obj.get_or_set # would error

I"m not sure where this would be useful, but the use-case is if you want a get, set, and/or delete-only property. Probably best to stick to semantically same property having the same name.

Conclusion

Start with simple attributes.

If you later need functionality around the setting, getting, and deleting, you can add it with the property decorator.

Avoid functions named set_... and get_... - that"s what properties are for.

Understanding a Python List | Splitting an array three-way around a given range: StackOverflow Questions

How do you split a list into evenly sized chunks?

Question by jespern

I have a list of arbitrary length, and I need to split it up into equal size chunks and operate on it. There are some obvious ways to do this, like keeping a counter and two lists, and when the second list fills up, add it to the first list and empty the second list for the next round of data, but this is potentially extremely expensive.

I was wondering if anyone had a good solution to this for lists of any length, e.g. using generators.

I was looking for something useful in itertools but I couldn"t find anything obviously useful. Might"ve missed it, though.

Related question: What is the most “pythonic” way to iterate over a list in chunks?

Split Strings into words with multiple word boundary delimiters

I think what I want to do is a fairly common task but I"ve found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

["hey", "you", "what", "are", "you", "doing", "here"]

But Python"s str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

Split string with multiple delimiters in Python

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ";" or ", " That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

("b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]" , "mesitylene [000108-67-8]", "polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]") 

How to split a string into a list?

I want my Python function to split a sentence (input) and store each word in a list. My current code splits the sentence, but does not store the words as a list. How do I do that?

def split_line(text):

    # split the text
    words = text.split()

    # for each word in the line:
    for word in words:

        # print the word
        print(words)

Split string on whitespace in Python

I"m looking for the Python equivalent of

String str = "many   fancy word 
hello    	hi";
String whiteSpaceRegex = "\s";
String[] words = str.split(whiteSpaceRegex);

["many", "fancy", "word", "hello", "hi"]

How to split a string into a list of characters in Python?

I"ve tried to look around the web for answers to splitting a string into a list of characters but I can"t seem to find a simple method.

str.split(//) does not seem to work like Ruby does. Is there a simple way of doing this without looping?

Split string every nth character?

Is it possible to split a string every nth character?

For example, suppose I have a string containing the following:

"1234567890"

How can I get it to look like this:

["12","34","56","78","90"]

Split by comma and strip whitespace in Python

I have some python code that splits on comma, but doesn"t strip the whitespace:

>>> string = "blah, lots  ,  of ,  spaces, here "
>>> mylist = string.split(",")
>>> print mylist
["blah", " lots  ", "  of ", "  spaces", " here "]

I would rather end up with whitespace removed like this:

["blah", "lots", "of", "spaces", "here"]

I am aware that I could loop through the list and strip() each item but, as this is Python, I"m guessing there"s a quicker, easier and more elegant way of doing it.

Splitting on first occurrence

What would be the best way to split a string on the first occurrence of a delimiter?

For example:

"123mango abcd mango kiwi peach"

splitting on the first mango to get:

"abcd mango kiwi peach"

Split a list based on a condition?

What"s the best way, both aesthetically and from a performance perspective, to split a list of items into multiple lists based on a conditional? The equivalent of:

good = [x for x in mylist if x in goodvals]
bad  = [x for x in mylist if x not in goodvals]

is there a more elegant way to do this?

Update: here"s the actual use case, to better explain what I"m trying to do:

# files looks like: [ ("file1.jpg", 33L, ".jpg"), ("file2.avi", 999L, ".avi"), ... ]
IMAGE_TYPES = (".jpg",".jpeg",".gif",".bmp",".png")
images = [f for f in files if f[2].lower() in IMAGE_TYPES]
anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]

Answer #1

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #2

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(["col1","col2"]).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(["col1", "col2"]).size().reset_index(name="counts")


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let"s use .size() to get the row counts:

In [3]: df.groupby(["col1", "col2"]).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let"s use .size().reset_index(name="counts") to get the row counts:

In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(["col1", "col2"])
   ...: .agg({
   ...:     "col3": ["mean", "count"], 
   ...:     "col4": ["median", "min", "count"]
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(["col1", "col2"])
   ...: counts = gb.size().to_frame(name="counts")
   ...: (counts
   ...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
   ...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
   ...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["E", "F"],
   ...:         ["E", "F"],
   ...:         ["G", "H"] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
   ...: )
   ...: 
   ...: df[["col3", "col4", "col5", "col6"]] = 
   ...:     df[["col3", "col4", "col5", "col6"]].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

Answer #3

TL;DR version:

For the simple case of:

  • I have a text column with a delimiter and I want two columns

The simplest solution is:

df[["A", "B"]] = df["AB"].str.split(" ", 1, expand=True)

You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

In detail:

Andy Hayden"s solution is most excellent in demonstrating the power of the str.extract() method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>> import pandas as pd
>>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2"]})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df["AB_split"] = df["AB"].str.split("-")
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

1: If you"re unsure what the first two parameters of .str.split() do, I recommend the docs for the plain Python version of the method.

But how do you go from:

  • a column containing two-element lists

to:

  • two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the .str attribute of a column.

It"s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an "indexing" interface for getting each element of a string by its index:

>>> df["AB"].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df["AB"].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .str doesn"t really care if each element it"s indexing is actually a string, as long as it can be indexed, so:

>>> df["AB"].str.split("-", 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df["AB"].str.split("-", 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it"s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df["A"], df["B"] = df["AB"].str.split("-", 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

>>> df["AB"].str.split("-", 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

>>> df = df[["AB"]]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df["AB"].str.split("-", 1, expand=True).rename(columns={0:"A", 1:"B"}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn"t deal well with splits of different lengths:

>>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2", "A3-B3-C3"]})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df["A"], df["B"], df["C"] = df["AB"].str.split("-")
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>> 

But expand=True handles it nicely by placing None in the columns for which there aren"t enough "splits":

>>> df.join(
...     df["AB"].str.split("-", expand=True).rename(
...         columns={0:"A", 1:"B", 2:"C"}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

Answer #4

There are several ways to select rows from a Pandas dataframe:

  1. Boolean indexing (df[df["col"] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column "A" == "foo"

(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)


Setup

The first thing we"ll need is to identify a condition that will act as our criterion for selecting rows. We"ll start with the OP"s case column_name == some_value, and include some other common use cases.

Borrowing from @unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({"A": "foo bar foo bar foo bar foo foo".split(),
                   "B": "one one two three two two one three".split(),
                   "C": np.arange(8), "D": np.arange(8) * 2})

1. Boolean indexing

... Boolean indexing requires finding the true value of each row"s "A" column being equal to "foo", then using those truth values to identify which rows to keep. Typically, we"d name this series, an array of truth values, mask. We"ll do so here as well.

mask = df["A"] == "foo"

We can then use this mask to slice or index the data frame

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn"t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


2. Positional indexing

Positional indexing (df.iloc[...]) has its use cases, but this isn"t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

mask = df["A"] == "foo"
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

df.set_index("A", append=True, drop=False).xs("foo", level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

df.query("A == "foo"")

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask.

mask alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd.Series

mask = df["A"].values == "foo"

I"ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

%timeit mask = df["A"].values == "foo"
%timeit mask = df["A"] == "foo"

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

Next, we"ll look at the timing for slicing with one mask versus the other.

mask = df["A"].values == "foo"
%timeit df[mask]
mask = df["A"] == "foo"
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren"t as pronounced. We"ll see if this holds up over more robust testing.


mask alternative 2 We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

Instead of df[mask] we will do this

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

%%timeit
mask = d1["A"].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

%%timeit
mask = d1["A"].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.


mask alternative 3

@unutbu also shows us how to use pd.Series.isin to account for each element of df["A"] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely "foo". But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

mask = df["A"].isin(["foo"])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We"ll use np.in1d

mask = np.in1d(df["A"].values, ["foo"])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing

I"ll include other concepts mentioned in other posts as well for reference.

Code Below

Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You"ll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.

res.T.plot(loglog=True)

Enter image description here

Functions

def mask_standard(df):
    mask = df["A"] == "foo"
    return df[mask]

def mask_standard_loc(df):
    mask = df["A"] == "foo"
    return df.loc[mask]

def mask_with_values(df):
    mask = df["A"].values == "foo"
    return df[mask]

def mask_with_values_loc(df):
    mask = df["A"].values == "foo"
    return df.loc[mask]

def query(df):
    return df.query("A == "foo"")

def xs_label(df):
    return df.set_index("A", append=True, drop=False).xs("foo", level=-1)

def mask_with_isin(df):
    mask = df["A"].isin(["foo"])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df["A"].values, ["foo"])
    return df[mask]

Testing

res = pd.DataFrame(
    index=[
        "mask_standard", "mask_standard_loc", "mask_with_values", "mask_with_values_loc",
        "query", "xs_label", "mask_with_isin", "mask_with_in1d"
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = "{}(d)".format(i)
        setp = "from __main__ import d, {}".format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing

Looking at the special case when we have a single non-object dtype for the entire data frame.

Code Below

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn"t worth it past a few hundred rows.

spec.T.plot(loglog=True)

Enter image description here

Functions

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

def mask_with_values(df):
    mask = df["A"].values == "foo"
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df["A"].values, ["foo"])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df["A"].values, ["foo"])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=["mask_with_values", "mask_with_in1d", "reconstruct"],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = "{}(d)".format(i)
        setp = "from __main__ import d, {}".format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

Answer #5

It"s much easier if you use Response.raw and shutil.copyfileobj():

import requests
import shutil

def download_file(url):
    local_filename = url.split("/")[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, "wb") as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

Answer #6

Explain all in Python?

I keep seeing the variable __all__ set in different __init__.py files.

What does this do?

What does __all__ do?

It declares the semantically "public" names from a module. If there is a name in __all__, users are expected to use it, and they can have the expectation that it will not change.

It also will have programmatic effects:

import *

__all__ in a module, e.g. module.py:

__all__ = ["foo", "Bar"]

means that when you import * from the module, only those names in the __all__ are imported:

from module import *               # imports foo and Bar

Documentation tools

Documentation and code autocompletion tools may (in fact, should) also inspect the __all__ to determine what names to show as available from a module.

__init__.py makes a directory a Python package

From the docs:

The __init__.py files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path.

In the simplest case, __init__.py can just be an empty file, but it can also execute initialization code for the package or set the __all__ variable.

So the __init__.py can declare the __all__ for a package.

Managing an API:

A package is typically made up of modules that may import one another, but that are necessarily tied together with an __init__.py file. That file is what makes the directory an actual Python package. For example, say you have the following files in a package:

package
├── __init__.py
├── module_1.py
└── module_2.py

Let"s create these files with Python so you can follow along - you could paste the following into a Python 3 shell:

from pathlib import Path

package = Path("package")
package.mkdir()

(package / "__init__.py").write_text("""
from .module_1 import *
from .module_2 import *
""")

package_module_1 = package / "module_1.py"
package_module_1.write_text("""
__all__ = ["foo"]
imp_detail1 = imp_detail2 = imp_detail3 = None
def foo(): pass
""")

package_module_2 = package / "module_2.py"
package_module_2.write_text("""
__all__ = ["Bar"]
imp_detail1 = imp_detail2 = imp_detail3 = None
class Bar: pass
""")

And now you have presented a complete api that someone else can use when they import your package, like so:

import package
package.foo()
package.Bar()

And the package won"t have all the other implementation details you used when creating your modules cluttering up the package namespace.

__all__ in __init__.py

After more work, maybe you"ve decided that the modules are too big (like many thousands of lines?) and need to be split up. So you do the following:

package
├── __init__.py
├── module_1
│   ├── foo_implementation.py
│   └── __init__.py
└── module_2
    ├── Bar_implementation.py
    └── __init__.py

First make the subpackage directories with the same names as the modules:

subpackage_1 = package / "module_1"
subpackage_1.mkdir()
subpackage_2 = package / "module_2"
subpackage_2.mkdir()

Move the implementations:

package_module_1.rename(subpackage_1 / "foo_implementation.py")
package_module_2.rename(subpackage_2 / "Bar_implementation.py")

create __init__.pys for the subpackages that declare the __all__ for each:

(subpackage_1 / "__init__.py").write_text("""
from .foo_implementation import *
__all__ = ["foo"]
""")
(subpackage_2 / "__init__.py").write_text("""
from .Bar_implementation import *
__all__ = ["Bar"]
""")

And now you still have the api provisioned at the package level:

>>> import package
>>> package.foo()
>>> package.Bar()
<package.module_2.Bar_implementation.Bar object at 0x7f0c2349d210>

And you can easily add things to your API that you can manage at the subpackage level instead of the subpackage"s module level. If you want to add a new name to the API, you simply update the __init__.py, e.g. in module_2:

from .Bar_implementation import *
from .Baz_implementation import *
__all__ = ["Bar", "Baz"]

And if you"re not ready to publish Baz in the top level API, in your top level __init__.py you could have:

from .module_1 import *       # also constrained by __all__"s
from .module_2 import *       # in the __init__.py"s
__all__ = ["foo", "Bar"]     # further constraining the names advertised

and if your users are aware of the availability of Baz, they can use it:

import package
package.Baz()

but if they don"t know about it, other tools (like pydoc) won"t inform them.

You can later change that when Baz is ready for prime time:

from .module_1 import *
from .module_2 import *
__all__ = ["foo", "Bar", "Baz"]

Prefixing _ versus __all__:

By default, Python will export all names that do not start with an _. You certainly could rely on this mechanism. Some packages in the Python standard library, in fact, do rely on this, but to do so, they alias their imports, for example, in ctypes/__init__.py:

import os as _os, sys as _sys

Using the _ convention can be more elegant because it removes the redundancy of naming the names again. But it adds the redundancy for imports (if you have a lot of them) and it is easy to forget to do this consistently - and the last thing you want is to have to indefinitely support something you intended to only be an implementation detail, just because you forgot to prefix an _ when naming a function.

I personally write an __all__ early in my development lifecycle for modules so that others who might use my code know what they should use and not use.

Most packages in the standard library also use __all__.

When avoiding __all__ makes sense

It makes sense to stick to the _ prefix convention in lieu of __all__ when:

  • You"re still in early development mode and have no users, and are constantly tweaking your API.
  • Maybe you do have users, but you have unittests that cover the API, and you"re still actively adding to the API and tweaking in development.

An export decorator

The downside of using __all__ is that you have to write the names of functions and classes being exported twice - and the information is kept separate from the definitions. We could use a decorator to solve this problem.

I got the idea for such an export decorator from David Beazley"s talk on packaging. This implementation seems to work well in CPython"s traditional importer. If you have a special import hook or system, I do not guarantee it, but if you adopt it, it is fairly trivial to back out - you"ll just need to manually add the names back into the __all__

So in, for example, a utility library, you would define the decorator:

import sys

def export(fn):
    mod = sys.modules[fn.__module__]
    if hasattr(mod, "__all__"):
        mod.__all__.append(fn.__name__)
    else:
        mod.__all__ = [fn.__name__]
    return fn

and then, where you would define an __all__, you do this:

$ cat > main.py
from lib import export
__all__ = [] # optional - we create a list if __all__ is not there.

@export
def foo(): pass

@export
def bar():
    "bar"

def main():
    print("main")

if __name__ == "__main__":
    main()

And this works fine whether run as main or imported by another function.

$ cat > run.py
import main
main.main()

$ python run.py
main

And API provisioning with import * will work too:

$ cat > run.py
from main import *
foo()
bar()
main() # expected to error here, not exported

$ python run.py
Traceback (most recent call last):
  File "run.py", line 4, in <module>
    main() # expected to error here, not exported
NameError: name "main" is not defined

Answer #7

A comment in the Python source code for float objects acknowledges that:

Comparison is pretty much a nightmare

This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.

To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is "too big" to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.

When comparing a float v to an integer/long w, the worst case is that:

  • v and w have the same sign (both positive or both negative),
  • the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
  • the integer w has at least 49 bits,
  • the exponent of the float v is the same as the number of bits in w.

And this is exactly what we have for the values in the question:

>>> import math
>>> math.frexp(562949953420000.7) # gives the float"s (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49

We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.

Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.

This is specific to the CPython implementation of the language.


The comparison in more detail

The float_richcompare function handles the comparison between two values v and w.

Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I"ve left them in where relevant. I"ve also summarised these checks in a list at the foot of the answer.

The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).

The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;
    assert(PyFloat_Check(v));       
    i = PyFloat_AS_DOUBLE(v);       

    if (PyFloat_Check(w))           
        j = PyFloat_AS_DOUBLE(w);   

    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }

Now we know that if w failed these checks, it is not a Python float. Now the function checks if it"s a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:

    else if (PyLong_Check(w)) {
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;

        if (vsign != wsign) {
            /* Magnitudes are irrelevant -- the signs alone
             * determine the outcome.
             */
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
    }   

If this check failed, then v and w have the same sign.

The next check counts the number of bits in the integer w. If it has too many bits then it can"t possibly be held as a float and so must be larger in magnitude than the float v:

    nbits = _PyLong_NumBits(w);
    if (nbits == (size_t)-1 && PyErr_Occurred()) {
        /* This long is so large that size_t isn"t big enough
         * to hold the # of bits.  Replace with little doubles
         * that give the same outcome -- w is so large that
         * its magnitude must exceed the magnitude of any
         * finite float.
         */
        PyErr_Clear();
        i = (double)vsign;
        assert(wsign != 0);
        j = wsign * 2.0;
        goto Compare;
    }

On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:

    if (nbits <= 48) {
        j = PyLong_AsDouble(w);
        /* It"s impossible that <= 48 bits overflowed. */
        assert(j != -1.0 || ! PyErr_Occurred());
        goto Compare;
    }

From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:

    if (nbits <= 48) {
        /* "Multiply both sides" by -1; this also swaps the
         * comparator.
         */
        i = -i;
        op = _Py_SwappedOp[op];
    }

Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:

    (void) frexp(i, &exponent);
    if (exponent < 0 || (size_t)exponent < nbits) {
        i = 1.0;
        j = 2.0;
        goto Compare;
    }

This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.

Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:

    if ((size_t)exponent > nbits) {
        i = 2.0;
        j = 1.0;
        goto Compare;
    }

If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.

The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).

    {
        double fracpart;
        double intpart;
        PyObject *result = NULL;
        PyObject *one = NULL;
        PyObject *vv = NULL;
        PyObject *ww = w;

        // snip

        fracpart = modf(i, &intpart); // split i (the double that v mapped to)
        vv = PyLong_FromDouble(intpart);

        // snip

        if (fracpart != 0.0) {
            /* Shift left, and or a 1 bit into vv
             * to represent the lost fraction.
             */
            PyObject *temp;

            one = PyLong_FromLong(1);

            temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
            ww = temp;

            temp = PyNumber_Lshift(vv, one);
            vv = temp;

            temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
            vv = temp;
        }
        // snip
    }
}

For brevity I"ve left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.


Here is a summary of the checks that are performed by the comparison function.

Let v be a float and cast it as a C double. Now, if w is also a float:

  • Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.

  • If not, compare v and w directly by their representations as C doubles.

If w is an integer:

  • Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.

  • (The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.

  • Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.

  • (w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)

  • Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.

  • If the exponent of v is greater than the number of bits in w then v is greater than w.

  • (The exponent is the same as the number of bits in w.)

  • The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.

Answer #8

To somewhat expand on the earlier answers here, there are a number of details which are commonly overlooked.

  • Prefer subprocess.run() over subprocess.check_call() and friends over subprocess.call() over subprocess.Popen() over os.system() over os.popen()
  • Understand and probably use text=True, aka universal_newlines=True.
  • Understand the meaning of shell=True or shell=False and how it changes quoting and the availability of shell conveniences.
  • Understand differences between sh and Bash
  • Understand how a subprocess is separate from its parent, and generally cannot change the parent.
  • Avoid running the Python interpreter as a subprocess of Python.

These topics are covered in some more detail below.

Prefer subprocess.run() or subprocess.check_call()

The subprocess.Popen() function is a low-level workhorse but it is tricky to use correctly and you end up copy/pasting multiple lines of code ... which conveniently already exist in the standard library as a set of higher-level wrapper functions for various purposes, which are presented in more detail in the following.

Here"s a paragraph from the documentation:

The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.

Unfortunately, the availability of these wrapper functions differs between Python versions.

  • subprocess.run() was officially introduced in Python 3.5. It is meant to replace all of the following.
  • subprocess.check_output() was introduced in Python 2.7 / 3.1. It is basically equivalent to subprocess.run(..., check=True, stdout=subprocess.PIPE).stdout
  • subprocess.check_call() was introduced in Python 2.5. It is basically equivalent to subprocess.run(..., check=True)
  • subprocess.call() was introduced in Python 2.4 in the original subprocess module (PEP-324). It is basically equivalent to subprocess.run(...).returncode

High-level API vs subprocess.Popen()

The refactored and extended subprocess.run() is more logical and more versatile than the older legacy functions it replaces. It returns a CompletedProcess object which has various methods which allow you to retrieve the exit status, the standard output, and a few other results and status indicators from the finished subprocess.

subprocess.run() is the way to go if you simply need a program to run and return control to Python. For more involved scenarios (background processes, perhaps with interactive I/O with the Python parent program) you still need to use subprocess.Popen() and take care of all the plumbing yourself. This requires a fairly intricate understanding of all the moving parts and should not be undertaken lightly. The simpler Popen object represents the (possibly still-running) process which needs to be managed from your code for the remainder of the lifetime of the subprocess.

It should perhaps be emphasized that just subprocess.Popen() merely creates a process. If you leave it at that, you have a subprocess running concurrently alongside with Python, so a "background" process. If it doesn"t need to do input or output or otherwise coordinate with you, it can do useful work in parallel with your Python program.

Avoid os.system() and os.popen()

Since time eternal (well, since Python 2.5) the os module documentation has contained the recommendation to prefer subprocess over os.system():

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

The problems with system() are that it"s obviously system-dependent and doesn"t offer ways to interact with the subprocess. It simply runs, with standard output and standard error outside of Python"s reach. The only information Python receives back is the exit status of the command (zero means success, though the meaning of non-zero values is also somewhat system-dependent).

PEP-324 (which was already mentioned above) contains a more detailed rationale for why os.system is problematic and how subprocess attempts to solve those issues.

os.popen() used to be even more strongly discouraged:

Deprecated since version 2.6: This function is obsolete. Use the subprocess module.

However, since sometime in Python 3, it has been reimplemented to simply use subprocess, and redirects to the subprocess.Popen() documentation for details.

Understand and usually use check=True

You"ll also notice that subprocess.call() has many of the same limitations as os.system(). In regular use, you should generally check whether the process finished successfully, which subprocess.check_call() and subprocess.check_output() do (where the latter also returns the standard output of the finished subprocess). Similarly, you should usually use check=True with subprocess.run() unless you specifically need to allow the subprocess to return an error status.

In practice, with check=True or subprocess.check_*, Python will throw a CalledProcessError exception if the subprocess returns a nonzero exit status.

A common error with subprocess.run() is to omit check=True and be surprised when downstream code fails if the subprocess failed.

On the other hand, a common problem with check_call() and check_output() was that users who blindly used these functions were surprised when the exception was raised e.g. when grep did not find a match. (You should probably replace grep with native Python code anyway, as outlined below.)

All things counted, you need to understand how shell commands return an exit code, and under what conditions they will return a non-zero (error) exit code, and make a conscious decision how exactly it should be handled.

Understand and probably use text=True aka universal_newlines=True

Since Python 3, strings internal to Python are Unicode strings. But there is no guarantee that a subprocess generates Unicode output, or strings at all.

(If the differences are not immediately obvious, Ned Batchelder"s Pragmatic Unicode is recommended, if not outright obligatory, reading. There is a 36-minute video presentation behind the link if you prefer, though reading the page yourself will probably take significantly less time.)

Deep down, Python has to fetch a bytes buffer and interpret it somehow. If it contains a blob of binary data, it shouldn"t be decoded into a Unicode string, because that"s error-prone and bug-inducing behavior - precisely the sort of pesky behavior which riddled many Python 2 scripts, before there was a way to properly distinguish between encoded text and binary data.

With text=True, you tell Python that you, in fact, expect back textual data in the system"s default encoding, and that it should be decoded into a Python (Unicode) string to the best of Python"s ability (usually UTF-8 on any moderately up to date system, except perhaps Windows?)

If that"s not what you request back, Python will just give you bytes strings in the stdout and stderr strings. Maybe at some later point you do know that they were text strings after all, and you know their encoding. Then, you can decode them.

normal = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True,
    text=True)
print(normal.stdout)

convoluted = subprocess.run([external, arg],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    check=True)
# You have to know (or guess) the encoding
print(convoluted.stdout.decode("utf-8"))

Python 3.7 introduced the shorter and more descriptive and understandable alias text for the keyword argument which was previously somewhat misleadingly called universal_newlines.

Understand shell=True vs shell=False

With shell=True you pass a single string to your shell, and the shell takes it from there.

With shell=False you pass a list of arguments to the OS, bypassing the shell.

When you don"t have a shell, you save a process and get rid of a fairly substantial amount of hidden complexity, which may or may not harbor bugs or even security problems.

On the other hand, when you don"t have a shell, you don"t have redirection, wildcard expansion, job control, and a large number of other shell features.

A common mistake is to use shell=True and then still pass Python a list of tokens, or vice versa. This happens to work in some cases, but is really ill-defined and could break in interesting ways.

# XXX AVOID THIS BUG
buggy = subprocess.run("dig +short stackoverflow.com")

# XXX AVOID THIS BUG TOO
broken = subprocess.run(["dig", "+short", "stackoverflow.com"],
    shell=True)

# XXX DEFINITELY AVOID THIS
pathological = subprocess.run(["dig +short stackoverflow.com"],
    shell=True)

correct = subprocess.run(["dig", "+short", "stackoverflow.com"],
    # Probably don"t forget these, too
    check=True, text=True)

# XXX Probably better avoid shell=True
# but this is nominally correct
fixed_but_fugly = subprocess.run("dig +short stackoverflow.com",
    shell=True,
    # Probably don"t forget these, too
    check=True, text=True)

The common retort "but it works for me" is not a useful rebuttal unless you understand exactly under what circumstances it could stop working.

Refactoring Example

Very often, the features of the shell can be replaced with native Python code. Simple Awk or sed scripts should probably simply be translated to Python instead.

To partially illustrate this, here is a typical but slightly silly example which involves many shell features.

cmd = """while read -r x;
   do ping -c 3 "$x" | grep "round-trip min/avg/max"
   done <hosts.txt"""

# Trivial but horrible
results = subprocess.run(
    cmd, shell=True, universal_newlines=True, check=True)
print(results.stdout)

# Reimplement with shell=False
with open("hosts.txt") as hosts:
    for host in hosts:
        host = host.rstrip("
")  # drop newline
        ping = subprocess.run(
             ["ping", "-c", "3", host],
             text=True,
             stdout=subprocess.PIPE,
             check=True)
        for line in ping.stdout.split("
"):
             if "round-trip min/avg/max" in line:
                 print("{}: {}".format(host, line))

Some things to note here:

  • With shell=False you don"t need the quoting that the shell requires around strings. Putting quotes anyway is probably an error.
  • It often makes sense to run as little code as possible in a subprocess. This gives you more control over execution from within your Python code.
  • Having said that, complex shell pipelines are tedious and sometimes challenging to reimplement in Python.

The refactored code also illustrates just how much the shell really does for you with a very terse syntax -- for better or for worse. Python says explicit is better than implicit but the Python code is rather verbose and arguably looks more complex than this really is. On the other hand, it offers a number of points where you can grab control in the middle of something else, as trivially exemplified by the enhancement that we can easily include the host name along with the shell command output. (This is by no means challenging to do in the shell, either, but at the expense of yet another diversion and perhaps another process.)

Common Shell Constructs

For completeness, here are brief explanations of some of these shell features, and some notes on how they can perhaps be replaced with native Python facilities.

  • Globbing aka wildcard expansion can be replaced with glob.glob() or very often with simple Python string comparisons like for file in os.listdir("."): if not file.endswith(".png"): continue. Bash has various other expansion facilities like .{png,jpg} brace expansion and {1..100} as well as tilde expansion (~ expands to your home directory, and more generally ~account to the home directory of another user)
  • Shell variables like $SHELL or $my_exported_var can sometimes simply be replaced with Python variables. Exported shell variables are available as e.g. os.environ["SHELL"] (the meaning of export is to make the variable available to subprocesses -- a variable which is not available to subprocesses will obviously not be available to Python running as a subprocess of the shell, or vice versa. The env= keyword argument to subprocess methods allows you to define the environment of the subprocess as a dictionary, so that"s one way to make a Python variable visible to a subprocess). With shell=False you will need to understand how to remove any quotes; for example, cd "$HOME" is equivalent to os.chdir(os.environ["HOME"]) without quotes around the directory name. (Very often cd is not useful or necessary anyway, and many beginners omit the double quotes around the variable and get away with it until one day ...)
  • Redirection allows you to read from a file as your standard input, and write your standard output to a file. grep "foo" <inputfile >outputfile opens outputfile for writing and inputfile for reading, and passes its contents as standard input to grep, whose standard output then lands in outputfile. This is not generally hard to replace with native Python code.
  • Pipelines are a form of redirection. echo foo | nl runs two subprocesses, where the standard output of echo is the standard input of nl (on the OS level, in Unix-like systems, this is a single file handle). If you cannot replace one or both ends of the pipeline with native Python code, perhaps think about using a shell after all, especially if the pipeline has more than two or three processes (though look at the pipes module in the Python standard library or a number of more modern and versatile third-party competitors).
  • Job control lets you interrupt jobs, run them in the background, return them to the foreground, etc. The basic Unix signals to stop and continue a process are of course available from Python, too. But jobs are a higher-level abstraction in the shell which involve process groups etc which you have to understand if you want to do something like this from Python.
  • Quoting in the shell is potentially confusing until you understand that everything is basically a string. So ls -l / is equivalent to "ls" "-l" "/" but the quoting around literals is completely optional. Unquoted strings which contain shell metacharacters undergo parameter expansion, whitespace tokenization and wildcard expansion; double quotes prevent whitespace tokenization and wildcard expansion but allow parameter expansions (variable substitution, command substitution, and backslash processing). This is simple in theory but can get bewildering, especially when there are several layers of interpretation (a remote shell command, for example).

Understand differences between sh and Bash

subprocess runs your shell commands with /bin/sh unless you specifically request otherwise (except of course on Windows, where it uses the value of the COMSPEC variable). This means that various Bash-only features like arrays, [[ etc are not available.

If you need to use Bash-only syntax, you can pass in the path to the shell as executable="/bin/bash" (where of course if your Bash is installed somewhere else, you need to adjust the path).

subprocess.run("""
    # This for loop syntax is Bash only
    for((i=1;i<=$#;i++)); do
        # Arrays are Bash-only
        array[i]+=123
    done""",
    shell=True, check=True,
    executable="/bin/bash")

A subprocess is separate from its parent, and cannot change it

A somewhat common mistake is doing something like

subprocess.run("cd /tmp", shell=True)
subprocess.run("pwd", shell=True)  # Oops, doesn"t print /tmp

The same thing will happen if the first subprocess tries to set an environment variable, which of course will have disappeared when you run another subprocess, etc.

A child process runs completely separate from Python, and when it finishes, Python has no idea what it did (apart from the vague indicators that it can infer from the exit status and output from the child process). A child generally cannot change the parent"s environment; it cannot set a variable, change the working directory, or, in so many words, communicate with its parent without cooperation from the parent.

The immediate fix in this particular case is to run both commands in a single subprocess;

subprocess.run("cd /tmp; pwd", shell=True)

though obviously this particular use case isn"t very useful; instead, use the cwd keyword argument, or simply os.chdir() before running the subprocess. Similarly, for setting a variable, you can manipulate the environment of the current process (and thus also its children) via

os.environ["foo"] = "bar"

or pass an environment setting to a child process with

subprocess.run("echo "$foo"", shell=True, env={"foo": "bar"})

(not to mention the obvious refactoring subprocess.run(["echo", "bar"]); but echo is a poor example of something to run in a subprocess in the first place, of course).

Don"t run Python from Python

This is slightly dubious advice; there are certainly situations where it does make sense or is even an absolute requirement to run the Python interpreter as a subprocess from a Python script. But very frequently, the correct approach is simply to import the other Python module into your calling script and call its functions directly.

If the other Python script is under your control, and it isn"t a module, consider turning it into one. (This answer is too long already so I will not delve into details here.)

If you need parallelism, you can run Python functions in subprocesses with the multiprocessing module. There is also threading which runs multiple tasks in a single process (which is more lightweight and gives you more control, but also more constrained in that threads within a process are tightly coupled, and bound to a single GIL.)

Answer #9

urllib has been split up in Python 3.

The urllib.urlencode() function is now urllib.parse.urlencode(),

the urllib.urlopen() function is now urllib.request.urlopen().

Answer #10

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo"s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

All Fitted Distributions

Best Fit Distribution

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
matplotlib.style.use("ggplot")

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Best holders
    best_distributions = []

    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):

        print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

        distribution = getattr(st, distribution)

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can"t be fit
            with warnings.catch_warnings():
                warnings.filterwarnings("ignore")
                
                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                
                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        
        except Exception:
            pass

    
    return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
    """Generate distributions"s Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index("YEAR").values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])[1]["color"])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u"El Niño sea temp.
 All Fitted Distributions")
ax.set_xlabel(u"Temp (°C)")
ax.set_ylabel("Frequency")

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)

param_names = (best_dist[0].shapes + ", loc, scale").split(", ") if best_dist[0].shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = "{}({})".format(best_dist[0].name, param_str)

ax.set_title(u"El Niño sea temp. with best fit distribution 
" + dist_str)
ax.set_xlabel(u"Temp. (°C)")
ax.set_ylabel("Frequency")

Get Solution for free from DataCamp guru