Splitting a large file into separate modules in C / C ++, Java and Python

File handling | Python Methods and Functions | sep | split

This approach is doomed to failure and usually requires rewriting from scratch.

So, to solve this scenario, we can try to divide the problem into several sub-problems and then try to solve it one by one.

This not only makes our task easier, but also allows us to achieve

Now the big question is how to "break" not theoretically, but SOFTWARE .

We will see several different types of such units in popular languages ​​such as C / C++, Python & amp; Java.

C / C++

For illustrative purposes,

Let's assume we have all the basic linked list inserts in one program. Since there are many methods (functions), we cannot clutter the program by writing all the method definitions above the required main function. But even if we did, there might be a problem of ordering methods, when one method should be before another, and so on.

So, to solve this problem, we can declare all prototypes at the beginning of the program, and then the main method and below that we can define them in any particular order:

Program :

FullLinkedList.c

// insert a complete linked list

 
# include & lt; stdio.h & gt;
# include & lt; stdlib.h & gt;

 
// ------------ --------------------
// Announcements - START:
// --------------------------------

 

struct Node;

struct Node * create_node ( int data);

void b_insert ( struct Node ** head, int data);

void n_insert ( struct Node ** head, int data, int pos);

void e_insert ( struct Node ** head, int data);

void display ( struct Node * temp);

 
// ------------ --------------------
// Announcements - END:
// --------------------------------

 

int main ()

{

struct Node * head = NULL;

 

int ch, data, pos;

 

printf ( " Linked List: " );

while (1) {

printf ( "1.Insert at Beginning" );

printf ( "2.Insert at Nth Position" );

printf ( "3.Insert At Ending" );

printf ( "4.Display" );

printf ( "0.Exit" );

printf ( "Enter your choice:" );

scanf ( "% d" , & amp; ch);

 

switch (ch) {

  case 1:

  printf ( "Enter the data:" );

scanf ( "% d" , & amp; data);

b_insert (& amp; head, data);

break ;

 

case 2:

  printf ( "Enter the data:" );

scanf ( "% d" , & amp; data);

 

printf ( " Enter the Position: " );

scanf ( "% d" , & amp; pos);

n_insert (& amp; head, data, pos);

break ;

 

case 3:

  printf ( "Enter the data:" );

scanf ( "% d" , & amp; data);

e_insert (& amp; head, data);

break ;

 

case 4:

  display (head);

break ;

 

case 0:

  return 0;

 

default :

  printf ( "Wrong Choice" );

}

}

}

 
// -------------------------- ------
// Definitions - START:
// ---- ----------------------------

 

struct Node {

int data;

struct Node * next;

};

 

struct Node * create_node ( int data)

{

struct Node * temp

= ( struct Node *)

malloc ( sizeof ( struct Node));

temp- & gt; data = data;

temp- & gt; next = NULL;

 

return temp;

}

 

void b_insert ( struct Node ** head, int data)

{

  struct Node * new_node = create_node (data);

 

new_node- & gt; next = * head;

* head = new_node;

}

 

void n_insert ( struct Node ** head, int data, int pos)

{

if (* head == NULL) {

b_insert (head, data);

return ;

}

 

  struct Node * new_node = create_node (data);

 

struct Node * temp = * head;

 

for ( int i = 0; i & lt; pos - 2; ++ i)

temp = temp- & gt; next;

 

new_node- & gt; next = temp- & gt; next;

temp- & gt; next = new_node;

}

 

void e_insert ( struct Node ** head, int data)

{

  if (* head == NULL) {

b_insert (head, data);

return ;

}

 

  struct Node * temp = * head;

 

while (temp- & gt; next! = NULL)

  temp = temp- & gt; next;

 

struct Node * new_node = create_node (data);

temp- & gt; next = new_node;

}

 

void display ( struct Node * temp)

{

printf ( "The elements are:" );

while (temp! = NULL) {

printf ( "% d" , temp- & gt; data);

temp = temp- & gt; next;

}

printf ( " " );

}

 
// --------------------------------
// Definitions - END
// --------------------- -----------

Compiling the code: we can compile the above program:

 gcc linkedlist.c -o linkedlist 

And it works!

The main problems in the above code:
We can already see the main problems with the program, it is not so easy to work with the code either individually or in a group .

If someone wants to work with the above program, then some of the many problems that person is facing:

  1. One has to go through the Full Source File. to improve or improve some functionality.
  2. Cannot be easily reused the program as a basis for other projects.
  3. The code is very cluttered and is not attractive at all, which makes it very difficult to navigate through the code.

In the case of a group project or large programs, the above approach is guaranteed to increase overall costs, energy, and failure rates.

Correct approach:

We see that these lines begin in every C / C++ program that starts with "#include".
This means to include all functions declared in the "library" header (.h files) and defined possibly in library.c / cpp files.

These lines are preprocessed at compile time.

We can manually try to create such a library for our own purposes.

Important things to remember:

  1. The ".h" files contain only prototype declarations (such as functions, structures) and global variables.
  2. The ".c / .cpp" files contain the actual implementation ( declaration definitions in header files)
  3. When compiling all source files, make sure that multiple definitions of the same function, variable, etc. do not exist for the same project. (VERY IMPORTANT)
  4. Use static functions, to restrict yourself to the file in which they are declared.
  5. Use the extern keyword to use variables referenced by and external files.
  6. When using C++, be careful with namespaces, always use namespace_name :: function (), to avoid collisions.
  7. By splitting the program into smaller codes:
    After examining the above program, we can see how this large program can be split into suitable small parts and then easily processed.

    The above program has essentially 2 main functions:
    1) Create, insert and save data in nodes.
    2) Displaying nodes

    This way I can split the program accordingly so that:
    1) The main file is & gt; Program driver, Nice Wrapper from Insertion Modules and where we use additional files.
    2) Paste - & gt; The real implementation lies here.

    With the important points mentioned, the program is divided into:

    linkedlist.c - & gt; Contains Driver Program
    insert.c - & gt; Contains Code for insertion

    linkedlist.h - & gt; Contains the necessary Node declarations
    insert.h - & gt; Contains the necessary Node Insertion Declarations

    In each header file we start with:

     #ifndef FILENAME_H #define FILENAME_H Declarations ... #endif 

    The reason we write our declarations between #ifndef, #define and #endif, is to prevent multiple declarations of identifiers such as data types, variables, etc., when one and the same header file is called in a new file belonging to the same project.

    For this sample program:

    insert.h - & gt; Contains the declaration of the insert node as well as the declaration of the node itself.

    It is very important to remember that the compiler can see declarations in the header file, but if you try to write code that includes the declaration elsewhere, this will result in an error, as the compiler compiles each .c file individually before moving on to the link step. ,

    connectedlist.h - & gt; A helper file that contains Node and its Display declarations that must be included for files that use them.

    insert.c - & gt; Include a Node declaration via #include "connectedlist.h", which contains the declaration as well as all other method definitions declared in insert.h.

    connectedlist.c - & gt; Simple Wrapper, containing an infinite loop prompting the user to insert integer data at the required positions, and also containing a method that displays the list.

    And the last thing to keep in mind is that meaningless inclusion files into each other can lead to multiple overrides (s) and will result in an error.

    Taking the above into account, you should carefully divide into suitable routines.

    linkedlist.h

    // connectedlist.h

     
    # ifndef LINKED_LIST_H
    # define LINKED_LIST_H

     

    struct Node {

    int data;

    struct Node * next;

    };

     

    void display ( struct Node * temp);

     
    # endif

    insert.h

    // insert.h

     
    # ifndef INSERT_H
    #define INSERT_H

     

    struct Node;

    struct Node * create_node ( int data);

    void b_insert ( struct Node ** head, int data);

    void n_insert ( struct Node ** head, int data, int pos);

    void e_insert ( struct Node ** head, int data);

     
    # endif

    insert.c

    // insert.c

     
    # include "linkedlist.h"
    // & quot; & quot; so the preprocessor looks
    // to the current directory and
    // standard library files later.

     
    # include & lt; stdlib.h & gt;

     

    struct Node * create_node ( int data)

    {

    struct Node * temp = ( struct Node *) malloc ( sizeof ( struct Node));

    temp- & gt; data = data;

    temp- & gt; next = NULL;

     

    return temp;

    }

     

    void b_insert ( struct Node ** head, int data)

    {

      struct Node * new_node = create_node (data);

     

    new_node- & gt; next = * head;

    * head = new_node;

    }

     

    void n_insert ( struct Node ** head, int data, int pos)

    {

    if (* head == NULL) {

    b_insert (head, data);

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       return ;

    }

     

      struct Node * new_node = create_node (data);

     

    struct Node * temp = * head;

     

    for ( int i = 0; i & lt; pos - 2; ++ i)

    temp = temp- & gt; next;

     

    new_node- & gt; next = temp- & gt; next;

    temp- & gt; next = new_node;

    <



    Splitting a large file into separate modules in C / C ++, Java and Python: StackOverflow Questions

    How to print number with commas as thousands separators?

    I am trying to print an integer in Python 2.6.1 with commas as thousands separators. For example, I want to show the number 1234567 as 1,234,567. How would I go about doing this? I have seen many examples on Google, but I am looking for the simplest practical way.

    It does not need to be locale-specific to decide between periods and commas. I would prefer something as simple as reasonably possible.

    How would you make a comma-separated string from a list of strings?

    Question by mweerden

    What would be your preferred way to concatenate strings from a sequence such that between every two consecutive pairs a comma is added. That is, how do you map, for instance, ["a", "b", "c"] to "a,b,c"? (The cases ["s"] and [] should be mapped to "s" and "", respectively.)

    I usually end up using something like "".join(map(lambda x: x+",",l))[:-1], but also feeling somewhat unsatisfied.

    Separation of business logic and data access in django

    I am writing a project in Django and I see that 80% of the code is in the file models.py. This code is confusing and, after a certain time, I cease to understand what is really happening.

    Here is what bothers me:

    1. I find it ugly that my model level (which was supposed to be responsible only for the work with data from a database) is also sending email, walking on API to other services, etc.
    2. Also, I find it unacceptable to place business logic in the view, because this way it becomes difficult to control. For example, in my application there are at least three ways to create new instances of User, but technically it should create them uniformly.
    3. I do not always notice when the methods and properties of my models become non-deterministic and when they develop side effects.

    Here is a simple example. At first, the User model was like this:

    class User(db.Models):
    
        def get_present_name(self):
            return self.name or "Anonymous"
    
        def activate(self):
            self.status = "activated"
            self.save()
    

    Over time, it turned into this:

    class User(db.Models):
    
        def get_present_name(self): 
            # property became non-deterministic in terms of database
            # data is taken from another service by api
            return remote_api.request_user_name(self.uid) or "Anonymous" 
    
        def activate(self):
            # method now has a side effect (send message to user)
            self.status = "activated"
            self.save()
            send_mail("Your account is activated!", "…", [self.email])
    

    What I want is to separate entities in my code:

    1. Entities of my database, persistence level: What data does my application keep?
    2. Entities of my application, business logic level: What does my application do?

    What are the good practices to implement such an approach that can be applied in Django?

    Extracting just Month and Year separately from Pandas Datetime column

    I have a Dataframe, df, with the following column:

    df["ArrivalDate"] =
    ...
    936   2012-12-31
    938   2012-12-29
    965   2012-12-31
    966   2012-12-31
    967   2012-12-31
    968   2012-12-31
    969   2012-12-31
    970   2012-12-29
    971   2012-12-31
    972   2012-12-29
    973   2012-12-29
    ...
    

    The elements of the column are pandas.tslib.Timestamp.

    I want to just include the year and month. I thought there would be simple way to do it, but I can"t figure it out.

    Here"s what I"ve tried:

    df["ArrivalDate"].resample("M", how = "mean")
    

    I got the following error:

    Only valid with DatetimeIndex or PeriodIndex 
    

    Then I tried:

    df["ArrivalDate"].apply(lambda(x):x[:-2])
    

    I got the following error:

    "Timestamp" object has no attribute "__getitem__" 
    

    Any suggestions?

    Edit: I sort of figured it out.

    df.index = df["ArrivalDate"]
    

    Then, I can resample another column using the index.

    But I"d still like a method for reconfiguring the entire column. Any ideas?

    In Python, how do I split a string and keep the separators?

    Here"s the simplest way to explain this. Here"s what I"m using:

    re.split("W", "foo/bar spam
    eggs")
    -> ["foo", "bar", "spam", "eggs"]
    

    Here"s what I want:

    someMethod("W", "foo/bar spam
    eggs")
    -> ["foo", "/", "bar", " ", "spam", "
    ", "eggs"]
    

    The reason is that I want to split a string into tokens, manipulate it, then put it back together again.

    Split (explode) pandas dataframe string entry to separate rows

    I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ","). For example, a should become b:

    In [7]: a
    Out[7]: 
        var1  var2
    0  a,b,c     1
    1  d,e,f     2
    
    In [8]: b
    Out[8]: 
      var1  var2
    0    a     1
    1    b     1
    2    c     1
    3    d     2
    4    e     2
    5    f     2
    

    So far, I have tried various simple functions, but the .apply method seems to only accept one row as return value when it is used on an axis, and I can"t get .transform to work. Any suggestions would be much appreciated!

    Example data:

    from pandas import DataFrame
    import numpy as np
    a = DataFrame([{"var1": "a,b,c", "var2": 1},
                   {"var1": "d,e,f", "var2": 2}])
    b = DataFrame([{"var1": "a", "var2": 1},
                   {"var1": "b", "var2": 1},
                   {"var1": "c", "var2": 1},
                   {"var1": "d", "var2": 2},
                   {"var1": "e", "var2": 2},
                   {"var1": "f", "var2": 2}])
    

    I know this won"t work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:

    def fun(row):
        letters = row["var1"]
        letters = letters.split(",")
        out = np.array([row] * len(letters))
        out["var1"] = letters
    a["idx"] = range(a.shape[0])
    z = a.groupby("idx")
    z.transform(fun)
    

    Simpler way to create dictionary of separate variables?

    I would like to be able to get the name of a variable as a string but I don"t know if Python has that much introspection capabilities. Something like:

    >>> print(my_var.__name__)
    "my_var"
    

    I want to do that because I have a bunch of variables I"d like to turn into a dictionary like :

    bar = True
    foo = False
    >>> my_dict = dict(bar=bar, foo=foo)
    >>> print my_dict 
    {"foo": False, "bar": True}
    

    But I"d like something more automatic than that.

    Python have locals() and vars(), so I guess there is a way.

    Split / Explode a column of dictionaries into separate columns with pandas

    I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:

    Station ID     Pollutants
    8809           {"a": "46", "b": "3", "c": "12"}
    8810           {"a": "36", "b": "5", "c": "8"}
    8811           {"b": "2", "c": "7"}
    8812           {"c": "11"}
    8813           {"a": "82", "c": "15"}
    

    I need to split this column into separate columns, so that the DataFrame `df2 looks like this:

    Station ID     a      b       c
    8809           46     3       12
    8810           36     5       8
    8811           NaN    2       7
    8812           NaN    NaN     11
    8813           82     NaN     15
    

    The major issue I"m having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: "a", "b", and "c". And they always appear in the same order ("a" first, "b" second, "c" third).

    The following code USED to work and return exactly what I wanted (df2).

    objs = [df, pandas.DataFrame(df["Pollutant Levels"].tolist()).iloc[:, :3]]
    df2 = pandas.concat(objs, axis=1).drop("Pollutant Levels", axis=1)
    print(df2)
    

    I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

    IndexError: out-of-bounds on slice (end) 
    

    I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

    Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

    EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

    #My data format 
    u{"a": "1", "b": "2", "c": "3"}
    
    #and not
    {u"a": "1", u"b": "2", u"c": "3"}
    

    The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

    How to use "/" (directory separator) in both Linux and Windows in Python?

    I have written a code in python which uses / to make a particular file in a folder, if I want to use the code in windows it will not work, is there a way by which I can use the code in Windows and Linux.

    In python I am using this code:

    pathfile=os.path.dirname(templateFile)
    rootTree.write(""+pathfile+"/output/log.txt")
    

    When I will use my code in suppose windows machine my code will not work.

    How do I use "/" (directory separator) in both Linux and Windows?

    How can I plot separate Pandas DataFrames as subplots?

    I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I"m unfortunately failing to come up with a solution to how and would highly appreciate some help.

    Answer #1

    Recommendation for beginners:

    This is my personal recommendation for beginners: start by learning virtualenv and pip, tools which work with both Python 2 and 3 and in a variety of situations, and pick up other tools once you start needing them.

    PyPI packages not in the standard library:

    • virtualenv is a very popular tool that creates isolated Python environments for Python libraries. If you"re not familiar with this tool, I highly recommend learning it, as it is a very useful tool, and I"ll be making comparisons to it for the rest of this answer.

    It works by installing a bunch of files in a directory (eg: env/), and then modifying the PATH environment variable to prefix it with a custom bin directory (eg: env/bin/). An exact copy of the python or python3 binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It"s not part of Python"s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using pip.

    • pyenv is used to isolate Python versions. For example, you may want to test your code against Python 2.7, 3.6, 3.7 and 3.8, so you"ll need a way to switch between them. Once activated, it prefixes the PATH environment variable with ~/.pyenv/shims, where there are special files matching the Python commands (python, pip). These are not copies of the Python-shipped commands; they are special scripts that decide on the fly which version of Python to run based on the PYENV_VERSION environment variable, or the .python-version file, or the ~/.pyenv/version file. pyenv also makes the process of downloading and installing multiple Python versions easier, using the command pyenv install.

    • pyenv-virtualenv is a plugin for pyenv by the same author as pyenv, to allow you to use pyenv and virtualenv at the same time conveniently. However, if you"re using Python 3.3 or later, pyenv-virtualenv will try to run python -m venv if it is available, instead of virtualenv. You can use virtualenv and pyenv together without pyenv-virtualenv, if you don"t want the convenience features.

    • virtualenvwrapper is a set of extensions to virtualenv (see docs). It gives you commands like mkvirtualenv, lssitepackages, and especially workon for switching between different virtualenv directories. This tool is especially useful if you want multiple virtualenv directories.

    • pyenv-virtualenvwrapper is a plugin for pyenv by the same author as pyenv, to conveniently integrate virtualenvwrapper into pyenv.

    • pipenv aims to combine Pipfile, pip and virtualenv into one command on the command-line. The virtualenv directory typically gets placed in ~/.local/share/virtualenvs/XXX, with XXX being a hash of the path of the project directory. This is different from virtualenv, where the directory is typically in the current working directory. pipenv is meant to be used when developing Python applications (as opposed to libraries). There are alternatives to pipenv, such as poetry, which I won"t list here since this question is only about the packages that are similarly named.

    Standard library:

    • pyvenv (not to be confused with pyenv in the previous section) is a script shipped with Python 3 but deprecated in Python 3.6 as it had problems (not to mention the confusing name). In Python 3.6+, the exact equivalent is python3 -m venv.

    • venv is a package shipped with Python 3, which you can run using python3 -m venv (although for some reason some distros separate it out into a separate distro package, such as python3-venv on Ubuntu/Debian). It serves the same purpose as virtualenv, but only has a subset of its features (see a comparison here). virtualenv continues to be more popular than venv, especially since the former supports both Python 2 and 3.

    Answer #2

    In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

    TLDR:

    The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

    1. faster attribute access.
    2. space savings in memory.

    The space savings is from

    1. Storing value references in slots instead of __dict__.
    2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

    Quick Caveats

    Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

    class Base:
        __slots__ = "foo", "bar"
    
    class Right(Base):
        __slots__ = "baz", 
    
    class Wrong(Base):
        __slots__ = "foo", "bar", "baz"        # redundant foo and bar
    

    Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

    >>> from sys import getsizeof
    >>> getsizeof(Right()), getsizeof(Wrong())
    (56, 72)
    

    This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

    >>> w = Wrong()
    >>> w.foo = "foo"
    >>> Base.foo.__get__(w)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: foo
    >>> Wrong.foo.__get__(w)
    "foo"
    

    The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

    To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

    See section on multiple inheritance below for an example.

    Requirements:

    • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

    • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

    There are a lot of details if you wish to keep reading.

    Why use __slots__: Faster attribute access.

    The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

    It is trivial to demonstrate measurably significant faster access:

    import timeit
    
    class Foo(object): __slots__ = "foo",
    
    class Bar(object): pass
    
    slotted = Foo()
    not_slotted = Bar()
    
    def get_set_delete_fn(obj):
        def get_set_delete():
            obj.foo = "foo"
            obj.foo
            del obj.foo
        return get_set_delete
    

    and

    >>> min(timeit.repeat(get_set_delete_fn(slotted)))
    0.2846834529991611
    >>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
    0.3664822799983085
    

    The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

    >>> 0.3664822799983085 / 0.2846834529991611
    1.2873325658284342
    

    In Python 2 on Windows I have measured it about 15% faster.

    Why use __slots__: Memory Savings

    Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

    My own contribution to the documentation clearly states the reasons behind this:

    The space saved over using __dict__ can be significant.

    SQLAlchemy attributes a lot of memory savings to __slots__.

    To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

    In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

    For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

           Python 2.7             Python 3.6
    attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
    none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
    one    48         56 + 272    48         56 + 112
    two    56         56 + 272    56         56 + 112
    six    88         56 + 1040   88         56 + 152
    11     128        56 + 1040   128        56 + 240
    22     216        56 + 3344   216        56 + 408     
    43     384        56 + 3344   384        56 + 752
    

    So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

    Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

    >>> Foo.foo
    <member "foo" of "Foo" objects>
    >>> type(Foo.foo)
    <class "member_descriptor">
    >>> getsizeof(Foo.foo)
    72
    

    Demonstration of __slots__:

    To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

    class Base(object): 
        __slots__ = ()
    

    now:

    >>> b = Base()
    >>> b.a = "a"
    Traceback (most recent call last):
      File "<pyshell#38>", line 1, in <module>
        b.a = "a"
    AttributeError: "Base" object has no attribute "a"
    

    Or subclass another class that defines __slots__

    class Child(Base):
        __slots__ = ("a",)
    

    and now:

    c = Child()
    c.a = "a"
    

    but:

    >>> c.b = "b"
    Traceback (most recent call last):
      File "<pyshell#42>", line 1, in <module>
        c.b = "b"
    AttributeError: "Child" object has no attribute "b"
    

    To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

    class SlottedWithDict(Child): 
        __slots__ = ("__dict__", "b")
    
    swd = SlottedWithDict()
    swd.a = "a"
    swd.b = "b"
    swd.c = "c"
    

    and

    >>> swd.__dict__
    {"c": "c"}
    

    Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

    class NoSlots(Child): pass
    ns = NoSlots()
    ns.a = "a"
    ns.b = "b"
    

    And:

    >>> ns.__dict__
    {"b": "b"}
    

    However, __slots__ may cause problems for multiple inheritance:

    class BaseA(object): 
        __slots__ = ("a",)
    
    class BaseB(object): 
        __slots__ = ("b",)
    

    Because creating a child class from parents with both non-empty slots fails:

    >>> class Child(BaseA, BaseB): __slots__ = ()
    Traceback (most recent call last):
      File "<pyshell#68>", line 1, in <module>
        class Child(BaseA, BaseB): __slots__ = ()
    TypeError: Error when calling the metaclass bases
        multiple bases have instance lay-out conflict
    

    If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

    from abc import ABC
    
    class AbstractA(ABC):
        __slots__ = ()
    
    class BaseA(AbstractA): 
        __slots__ = ("a",)
    
    class AbstractB(ABC):
        __slots__ = ()
    
    class BaseB(AbstractB): 
        __slots__ = ("b",)
    
    class Child(AbstractA, AbstractB): 
        __slots__ = ("a", "b")
    
    c = Child() # no problem!
    

    Add "__dict__" to __slots__ to get dynamic assignment:

    class Foo(object):
        __slots__ = "bar", "baz", "__dict__"
    

    and now:

    >>> foo = Foo()
    >>> foo.boink = "boink"
    

    So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

    When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

    Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

    You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

    Set to empty tuple when subclassing a namedtuple:

    The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

    from collections import namedtuple
    class MyNT(namedtuple("MyNT", "bar baz")):
        """MyNT is an immutable and lightweight object"""
        __slots__ = ()
    

    usage:

    >>> nt = MyNT("bar", "baz")
    >>> nt.bar
    "bar"
    >>> nt.baz
    "baz"
    

    And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

    >>> nt.quux = "quux"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: "MyNT" object has no attribute "quux"
    

    You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

    Biggest Caveat: Multiple inheritance

    Even when non-empty slots are the same for multiple parents, they cannot be used together:

    class Foo(object): 
        __slots__ = "foo", "bar"
    class Bar(object):
        __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()
    
    >>> class Baz(Foo, Bar): pass
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: Error when calling the metaclass bases
        multiple bases have instance lay-out conflict
    

    Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

    class Foo(object): __slots__ = ()
    class Bar(object): __slots__ = ()
    class Baz(Foo, Bar): __slots__ = ("foo", "bar")
    b = Baz()
    b.foo, b.bar = "foo", "bar"
    

    You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

    Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

    To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

    class AbstractBase:
        __slots__ = ()
        def __init__(self, a, b):
            self.a = a
            self.b = b
        def __repr__(self):
            return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"
    

    We could use the above directly by inheriting and declaring the expected slots:

    class Foo(AbstractBase):
        __slots__ = "a", "b"
    

    But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

    class AbstractBaseC:
        __slots__ = ()
        @property
        def c(self):
            print("getting c!")
            return self._c
        @c.setter
        def c(self, arg):
            print("setting c!")
            self._c = arg
    

    Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

    class Concretion(AbstractBase, AbstractBaseC):
        __slots__ = "a b _c".split()
    

    And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

    >>> c = Concretion("a", "b")
    >>> c.c = c
    setting c!
    >>> c.c
    getting c!
    Concretion("a", "b")
    >>> c.d = "d"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: "Concretion" object has no attribute "d"
    

    Other cases to avoid slots:

    • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
    • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
    • Avoid them if you insist on providing default values via class attributes for instance variables.

    You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

    Critiques of other answers

    The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

    Do not "only use __slots__ when instantiating lots of objects"

    I quote:

    "You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

    Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

    Why?

    If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

    __slots__ contributes to reusability when creating interfaces or mixins.

    It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

    __slots__ doesn"t break pickling

    When pickling a slotted object, you may find it complains with a misleading TypeError:

    >>> pickle.loads(pickle.dumps(f))
    TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
    

    This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

    >>> pickle.loads(pickle.dumps(f, -1))
    <__main__.Foo object at 0x1129C770>
    

    in Python 2.7:

    >>> pickle.loads(pickle.dumps(f, 2))
    <__main__.Foo object at 0x1129C770>
    

    in Python 3.6

    >>> pickle.loads(pickle.dumps(f, 4))
    <__main__.Foo object at 0x1129C770>
    

    So I would keep this in mind, as it is a solved problem.

    Critique of the (until Oct 2, 2016) accepted answer

    The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

    The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

    The second half is wishful thinking, and off the mark:

    While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

    Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

    The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

    They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

    It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

    The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

    Memory usage evidence

    Create some normal objects and slotted objects:

    >>> class Foo(object): pass
    >>> class Bar(object): __slots__ = ()
    

    Instantiate a million of them:

    >>> foos = [Foo() for f in xrange(1000000)]
    >>> bars = [Bar() for b in xrange(1000000)]
    

    Inspect with guppy.hpy().heap():

    >>> guppy.hpy().heap()
    Partition of a set of 2028259 objects. Total size = 99763360 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0 1000000  49 64000000  64  64000000  64 __main__.Foo
         1     169   0 16281480  16  80281480  80 list
         2 1000000  49 16000000  16  96281480  97 __main__.Bar
         3   12284   1   987472   1  97268952  97 str
    ...
    

    Access the regular objects and their __dict__ and inspect again:

    >>> for f in foos:
    ...     f.__dict__
    >>> guppy.hpy().heap()
    Partition of a set of 3028258 objects. Total size = 379763480 bytes.
     Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
         0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
         1 1000000  33  64000000  17 344000000  91 __main__.Foo
         2     169   0  16281480   4 360281480  95 list
         3 1000000  33  16000000   4 376281480  99 __main__.Bar
         4   12284   0    987472   0 377268952  99 str
    ...
    

    This is consistent with the history of Python, from Unifying types and classes in Python 2.2

    If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

    Answer #3

    os.listdir() - list in the current directory

    With listdir in os module you get the files and the folders in the current dir

     import os
     arr = os.listdir()
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    Looking in a directory

    arr = os.listdir("c:\files")
    

    glob from glob

    with glob you can specify a type of file to list like this

    import glob
    
    txtfiles = []
    for file in glob.glob("*.txt"):
        txtfiles.append(file)
    

    glob in a list comprehension

    mylist = [f for f in glob.glob("*.txt")]
    

    get the full path of only files in the current directory

    import os
    from os import listdir
    from os.path import isfile, join
    
    cwd = os.getcwd()
    onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
    os.path.isfile(os.path.join(cwd, f))]
    print(onlyfiles) 
    
    ["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
    

    Getting the full path name with os.path.abspath

    You get the full path in return

     import os
     files_path = [os.path.abspath(x) for x in os.listdir()]
     print(files_path)
     
     ["F:\documentiapplications.txt", "F:\documenticollections.txt"]
    

    Walk: going through sub directories

    os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

    import os
    
    # Getting the current work directory (cwd)
    thisdir = os.getcwd()
    
    # r=root, d=directories, f = files
    for r, d, f in os.walk(thisdir):
        for file in f:
            if file.endswith(".docx"):
                print(os.path.join(r, file))
    

    os.listdir(): get files in the current directory (Python 2)

    In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

     import os
     arr = os.listdir(".")
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    To go up in the directory tree

    # Method 1
    x = os.listdir("..")
    
    # Method 2
    x= os.listdir("/")
    

    Get files: os.listdir() in a particular directory (Python 2 and 3)

     import os
     arr = os.listdir("F:\python")
     print(arr)
     
     >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
    

    Get files of a particular subdirectory with os.listdir()

    import os
    
    x = os.listdir("./content")
    

    os.walk(".") - current directory

     import os
     arr = next(os.walk("."))[2]
     print(arr)
     
     >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
    

    next(os.walk(".")) and os.path.join("dir", "file")

     import os
     arr = []
     for d,r,f in next(os.walk("F:\_python")):
         for file in f:
             arr.append(os.path.join(r,file))
    
     for f in arr:
         print(files)
    
    >>> F:\_python\dict_class.py
    >>> F:\_python\programmi.txt
    

    next(os.walk("F:\") - get the full path - list comprehension

     [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
     
     >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
    

    os.walk - get full path - all files in sub dirs**

    x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
    print(x)
    
    >>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]
    

    os.listdir() - get only txt files

     arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
     print(arr_txt)
     
     >>> ["work.txt", "3ebooks.txt"]
    

    Using glob to get the full path of the files

    If I should need the absolute path of the files:

    from path import path
    from glob import glob
    x = [path(f).abspath() for f in glob("F:\*.txt")]
    for f in x:
        print(f)
    
    >>> F:acquistionline.txt
    >>> F:acquisti_2018.txt
    >>> F:ootstrap_jquery_ecc.txt
    

    Using os.path.isfile to avoid directories in the list

    import os.path
    listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
    print(listOfFiles)
    
    >>> ["a simple game.py", "data.txt", "decorator.py"]
    

    Using pathlib from Python 3.4

    import pathlib
    
    flist = []
    for p in pathlib.Path(".").iterdir():
        if p.is_file():
            print(p)
            flist.append(p)
    
     >>> error.PNG
     >>> exemaker.bat
     >>> guiprova.mp3
     >>> setup.py
     >>> speak_gui2.py
     >>> thumb.PNG
    

    With list comprehension:

    flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
    

    Alternatively, use pathlib.Path() instead of pathlib.Path(".")

    Use glob method in pathlib.Path()

    import pathlib
    
    py = pathlib.Path().glob("*.py")
    for file in py:
        print(file)
    
    >>> stack_overflow_list.py
    >>> stack_overflow_list_tkinter.py
    

    Get all and only files with os.walk

    import os
    x = [i[2] for i in os.walk(".")]
    y=[]
    for t in x:
        for f in t:
            y.append(f)
    print(y)
    
    >>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
    

    Get only files with next and walk in a directory

     import os
     x = next(os.walk("F://python"))[2]
     print(x)
     
     >>> ["calculator.bat","calculator.py"]
    

    Get only directories with next and walk in a directory

     import os
     next(os.walk("F://python"))[1] # for the current dir use (".")
     
     >>> ["python3","others"]
    

    Get all the subdir names with walk

    for r,d,f in os.walk("F:\_python"):
        for dirs in d:
            print(dirs)
    
    >>> .vscode
    >>> pyexcel
    >>> pyschool.py
    >>> subtitles
    >>> _metaprogramming
    >>> .ipynb_checkpoints
    

    os.scandir() from Python 3.5 and greater

    import os
    x = [f.name for f in os.scandir() if f.is_file()]
    print(x)
    
    >>> ["calculator.bat","calculator.py"]
    
    # Another example with scandir (a little variation from docs.python.org)
    # This one is more efficient than os.listdir.
    # In this case, it shows the files only in the current directory
    # where the script is executed.
    
    import os
    with os.scandir() as i:
        for entry in i:
            if entry.is_file():
                print(entry.name)
    
    >>> ebookmaker.py
    >>> error.PNG
    >>> exemaker.bat
    >>> guiprova.mp3
    >>> setup.py
    >>> speakgui4.py
    >>> speak_gui2.py
    >>> speak_gui3.py
    >>> thumb.PNG
    

    Examples:

    Ex. 1: How many files are there in the subdirectories?

    In this example, we look for the number of files that are included in all the directory and its subdirectories.

    import os
    
    def count(dir, counter=0):
        "returns number of files in dir and subdirs"
        for pack in os.walk(dir):
            for f in pack[2]:
                counter += 1
        return dir + " : " + str(counter) + "files"
    
    print(count("F:\python"))
    
    >>> "F:\python" : 12057 files"
    

    Ex.2: How to copy all files from a directory to another?

    A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

    import os
    import shutil
    from path import path
    
    destination = "F:\file_copied"
    # os.makedirs(destination)
    
    def copyfile(dir, filetype="pptx", counter=0):
        "Searches for pptx (or other - pptx is the default) files and copies them"
        for pack in os.walk(dir):
            for f in pack[2]:
                if f.endswith(filetype):
                    fullpath = pack[0] + "\" + f
                    print(fullpath)
                    shutil.copy(fullpath, destination)
                    counter += 1
        if counter > 0:
            print("-" * 30)
            print("	==> Found in: `" + dir + "` : " + str(counter) + " files
    ")
    
    for dir in os.listdir():
        "searches for folders that starts with `_`"
        if dir[0] == "_":
            # copyfile(dir, filetype="pdf")
            copyfile(dir, filetype="txt")
    
    
    >>> _compiti18Compito Contabilità 1conti.txt
    >>> _compiti18Compito Contabilità 1modula4.txt
    >>> _compiti18Compito Contabilità 1moduloa4.txt
    >>> ------------------------
    >>> ==> Found in: `_compiti18` : 3 files
    

    Ex. 3: How to get all the files in a txt file

    In case you want to create a txt file with all the file names:

    import os
    mylist = ""
    with open("filelist.txt", "w", encoding="utf-8") as file:
        for eachfile in os.listdir():
            mylist += eachfile + "
    "
        file.write(mylist)
    

    Example: txt with all the files of an hard drive

    """
    We are going to save a txt file with all the files in your directory.
    We will use the function walk()
    """
    
    import os
    
    # see all the methods of os
    # print(*dir(os), sep=", ")
    listafile = []
    percorso = []
    with open("lista_file.txt", "w", encoding="utf-8") as testo:
        for root, dirs, files in os.walk("D:\"):
            for file in files:
                listafile.append(file)
                percorso.append(root + "\" + file)
                testo.write(file + "
    ")
    listafile.sort()
    print("N. of files", len(listafile))
    with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
        for file in listafile:
            testo_ordinato.write(file + "
    ")
    
    with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
        for file in percorso:
            file_percorso.write(file + "
    ")
    
    os.system("lista_file.txt")
    os.system("lista_file_ordinata.txt")
    os.system("percorso.txt")
    

    All the file of C: in one text file

    This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

    import os
    
    with open("file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk("C:\"):
            for file in f:
                filewrite.write(f"{r + file}
    ")
    

    How to write a file with all paths in a folder of a type

    With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

    import os
    
    def searchfiles(extension=".ttf", folder="H:\"):
        "Create a txt file with all the file of a type"
        with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
            for r, d, f in os.walk(folder):
                for file in f:
                    if file.endswith(extension):
                        filewrite.write(f"{r + file}
    ")
    
    # looking for png file (fonts) in the hard disk H:
    searchfiles(".png", "H:\")
    
    >>> H:4bs_18Dolphins5.png
    >>> H:4bs_18Dolphins6.png
    >>> H:4bs_18Dolphins7.png
    >>> H:5_18marketing htmlassetsimageslogo2.png
    >>> H:7z001.png
    >>> H:7z002.png
    

    (New) Find all files and open them with tkinter GUI

    I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

    import tkinter as tk
    import os
    
    def searchfiles(extension=".txt", folder="H:\"):
        "insert all files in the listbox"
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    lb.insert(0, r + "\" + file)
    
    def open_file():
        os.startfile(lb.get(lb.curselection()[0]))
    
    root = tk.Tk()
    root.geometry("400x400")
    bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
    bt.pack()
    lb = tk.Listbox(root)
    lb.pack(fill="both", expand=1)
    lb.bind("<Double-Button>", lambda x: open_file())
    root.mainloop()
    

    Answer #4

    You can disable any Python warnings via the PYTHONWARNINGS environment variable. In this case, you want:

    export PYTHONWARNINGS="ignore:Unverified HTTPS request"
    

    To disable using Python code (requests >= 2.16.0):

    import urllib3
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    

    For requests < 2.16.0, see original answer below.

    Original answer

    The reason doing urllib3.disable_warnings() didn"t work for you is because it looks like you"re using a separate instance of urllib3 vendored inside of requests.

    I gather this based on the path here: /usr/lib/python2.6/site-packages/requests/packages/urllib3/connectionpool.py

    To disable warnings in requests" vendored urllib3, you"ll need to import that specific instance of the module:

    import requests
    from requests.packages.urllib3.exceptions import InsecureRequestWarning
    
    requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    

    Answer #5

    Because [] and {} are literal syntax. Python can create bytecode just to create the list or dictionary objects:

    >>> import dis
    >>> dis.dis(compile("[]", "", "eval"))
      1           0 BUILD_LIST               0
                  3 RETURN_VALUE        
    >>> dis.dis(compile("{}", "", "eval"))
      1           0 BUILD_MAP                0
                  3 RETURN_VALUE        
    

    list() and dict() are separate objects. Their names need to be resolved, the stack has to be involved to push the arguments, the frame has to be stored to retrieve later, and a call has to be made. That all takes more time.

    For the empty case, that means you have at the very least a LOAD_NAME (which has to search through the global namespace as well as the builtins module) followed by a CALL_FUNCTION, which has to preserve the current frame:

    >>> dis.dis(compile("list()", "", "eval"))
      1           0 LOAD_NAME                0 (list)
                  3 CALL_FUNCTION            0
                  6 RETURN_VALUE        
    >>> dis.dis(compile("dict()", "", "eval"))
      1           0 LOAD_NAME                0 (dict)
                  3 CALL_FUNCTION            0
                  6 RETURN_VALUE        
    

    You can time the name lookup separately with timeit:

    >>> import timeit
    >>> timeit.timeit("list", number=10**7)
    0.30749011039733887
    >>> timeit.timeit("dict", number=10**7)
    0.4215109348297119
    

    The time discrepancy there is probably a dictionary hash collision. Subtract those times from the times for calling those objects, and compare the result against the times for using literals:

    >>> timeit.timeit("[]", number=10**7)
    0.30478692054748535
    >>> timeit.timeit("{}", number=10**7)
    0.31482696533203125
    >>> timeit.timeit("list()", number=10**7)
    0.9991960525512695
    >>> timeit.timeit("dict()", number=10**7)
    1.0200958251953125
    

    So having to call the object takes an additional 1.00 - 0.31 - 0.30 == 0.39 seconds per 10 million calls.

    You can avoid the global lookup cost by aliasing the global names as locals (using a timeit setup, everything you bind to a name is a local):

    >>> timeit.timeit("_list", "_list = list", number=10**7)
    0.1866450309753418
    >>> timeit.timeit("_dict", "_dict = dict", number=10**7)
    0.19016098976135254
    >>> timeit.timeit("_list()", "_list = list", number=10**7)
    0.841480016708374
    >>> timeit.timeit("_dict()", "_dict = dict", number=10**7)
    0.7233691215515137
    

    but you never can overcome that CALL_FUNCTION cost.

    Answer #6

    The datetime module is your friend:

    import datetime
    now = datetime.datetime.now()
    print(now.year, now.month, now.day, now.hour, now.minute, now.second)
    # 2015 5 6 8 53 40
    

    You don"t need separate variables, the attributes on the returned datetime object have all you need.

    Answer #7

    Are dictionaries ordered in Python 3.6+?

    They are insertion ordered[1]. As of Python 3.6, for the CPython implementation of Python, dictionaries remember the order of items inserted. This is considered an implementation detail in Python 3.6; you need to use OrderedDict if you want insertion ordering that"s guaranteed across other implementations of Python (and other ordered behavior[1]).

    As of Python 3.7, this is no longer an implementation detail and instead becomes a language feature. From a python-dev message by GvR:

    Make it so. "Dict keeps insertion order" is the ruling. Thanks!

    This simply means that you can depend on it. Other implementations of Python must also offer an insertion ordered dictionary if they wish to be a conforming implementation of Python 3.7.


    How does the Python 3.6 dictionary implementation perform better[2] than the older one while preserving element order?

    Essentially, by keeping two arrays.

    • The first array, dk_entries, holds the entries (of type PyDictKeyEntry) for the dictionary in the order that they were inserted. Preserving order is achieved by this being an append only array where new items are always inserted at the end (insertion order).

    • The second, dk_indices, holds the indices for the dk_entries array (that is, values that indicate the position of the corresponding entry in dk_entries). This array acts as the hash table. When a key is hashed it leads to one of the indices stored in dk_indices and the corresponding entry is fetched by indexing dk_entries. Since only indices are kept, the type of this array depends on the overall size of the dictionary (ranging from type int8_t(1 byte) to int32_t/int64_t (4/8 bytes) on 32/64 bit builds)

    In the previous implementation, a sparse array of type PyDictKeyEntry and size dk_size had to be allocated; unfortunately, it also resulted in a lot of empty space since that array was not allowed to be more than 2/3 * dk_size full for performance reasons. (and the empty space still had PyDictKeyEntry size!).

    This is not the case now since only the required entries are stored (those that have been inserted) and a sparse array of type intX_t (X depending on dict size) 2/3 * dk_sizes full is kept. The empty space changed from type PyDictKeyEntry to intX_t.

    So, obviously, creating a sparse array of type PyDictKeyEntry is much more memory demanding than a sparse array for storing ints.

    You can see the full conversation on Python-Dev regarding this feature if interested, it is a good read.


    In the original proposal made by Raymond Hettinger, a visualization of the data structures used can be seen which captures the gist of the idea.

    For example, the dictionary:

    d = {"timmy": "red", "barry": "green", "guido": "blue"}
    

    is currently stored as [keyhash, key, value]:

    entries = [["--", "--", "--"],
               [-8522787127447073495, "barry", "green"],
               ["--", "--", "--"],
               ["--", "--", "--"],
               ["--", "--", "--"],
               [-9092791511155847987, "timmy", "red"],
               ["--", "--", "--"],
               [-6480567542315338377, "guido", "blue"]]
    

    Instead, the data should be organized as follows:

    indices =  [None, 1, None, None, None, 0, None, 2]
    entries =  [[-9092791511155847987, "timmy", "red"],
                [-8522787127447073495, "barry", "green"],
                [-6480567542315338377, "guido", "blue"]]
    

    As you can visually now see, in the original proposal, a lot of space is essentially empty to reduce collisions and make look-ups faster. With the new approach, you reduce the memory required by moving the sparseness where it"s really required, in the indices.


    [1]: I say "insertion ordered" and not "ordered" since, with the existence of OrderedDict, "ordered" suggests further behavior that the `dict` object *doesn"t provide*. OrderedDicts are reversible, provide order sensitive methods and, mainly, provide an order-sensive equality tests (`==`, `!=`). `dict`s currently don"t offer any of those behaviors/methods.
    [2]: The new dictionary implementations performs better **memory wise** by being designed more compactly; that"s the main benefit here. Speed wise, the difference isn"t so drastic, there"s places where the new dict might introduce slight regressions ([key-lookups, for example][10]) while in others (iteration and resizing come to mind) a performance boost should be present. Overall, the performance of the dictionary, especially in real-life situations, improves due to the compactness introduced.

    Answer #8

    tl;dr / quick fix

    • Don"t decode/encode willy nilly
    • Don"t assume your strings are UTF-8 encoded
    • Try to convert strings to Unicode strings as soon as possible in your code
    • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
    • Don"t be tempted to use quick reload hacks

    Unicode Zen in Python 2.x - The Long Version

    Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

    UnicodeDecodeError: "ascii" codec can"t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

    In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

    The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

    Unicode strings can be declared in your code using the u prefix to strings. E.g.

    >>> my_u = u"my ünicôdé strįng"
    >>> type(my_u)
    <type "unicode">
    

    Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

    Gotchas

    Conversion from str to Unicode can happen even when you don"t explicitly call unicode().

    The following scenarios cause UnicodeDecodeError exceptions:

    # Explicit conversion without encoding
    unicode("€")
    
    # New style format string into Unicode string
    # Python will try to convert value string to Unicode first
    u"The currency is: {}".format("€")
    
    # Old style format string into Unicode string
    # Python will try to convert value string to Unicode first
    u"The currency is: %s" % "€"
    
    # Append string to Unicode
    # Python will try to convert string to Unicode first
    u"The currency is: " + "€"         
    

    Examples

    In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: Diagram of a string being converted to a Python Unicode string

    In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can"t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

    Diagram of a string being converted to a Python Unicode string with the wrong encoding

    The Unicode Sandwich

    It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

    Input / Decode

    Source code

    If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

    u"Zürich"
    

    To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

    # encoding: utf-8
    

    This is only necessary when you have non-ASCII in your source code.

    Files

    Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

    import io
    with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
         my_unicode_string = my_file.read() 
    

    my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you"ve probably used the wrong encoding value.

    CSV Files

    The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

    Use it like above but pass the opened file to it:

    from backports import csv
    import io
    with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
        for row in csv.reader(my_file):
            yield row
    

    Databases

    Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

    MySQL

    In the connection string add:

    charset="utf8",
    use_unicode=True
    

    E.g.

    >>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
    
    PostgreSQL

    Add:

    psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
    psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
    

    HTTP

    Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

    Manually

    If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you"ve probably got the wrong encoding.

    The meat of the sandwich

    Work with Unicodes as you would normal strs.

    Output

    stdout / printing

    print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

    An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

    Files

    Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

    Database

    The same configuration for reading will allow Unicodes to be written directly.

    Python 3

    Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

    The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

    Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

    Why you shouldn"t use sys.setdefaultencoding("utf8")

    It"s a nasty hack (there"s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

    Answer #9

    The short answer, or TL;DR

    Basically, eval is used to evaluate a single dynamically generated Python expression, and exec is used to execute dynamically generated Python code only for its side effects.

    eval and exec have these two differences:

    1. eval accepts only a single expression, exec can take a code block that has Python statements: loops, try: except:, class and function/method definitions and so on.

      An expression in Python is whatever you can have as the value in a variable assignment:

      a_variable = (anything you can put within these parentheses is an expression)
      
    2. eval returns the value of the given expression, whereas exec ignores the return value from its code, and always returns None (in Python 2 it is a statement and cannot be used as an expression, so it really does not return anything).

    In versions 1.0 - 2.7, exec was a statement, because CPython needed to produce a different kind of code object for functions that used exec for its side effects inside the function.

    In Python 3, exec is a function; its use has no effect on the compiled bytecode of the function where it is used.


    Thus basically:

    >>> a = 5
    >>> eval("37 + a")   # it is an expression
    42
    >>> exec("37 + a")   # it is an expression statement; value is ignored (None is returned)
    >>> exec("a = 47")   # modify a global variable as a side effect
    >>> a
    47
    >>> eval("a = 47")  # you cannot evaluate a statement
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        a = 47
          ^
    SyntaxError: invalid syntax
    

    The compile in "exec" mode compiles any number of statements into a bytecode that implicitly always returns None, whereas in "eval" mode it compiles a single expression into bytecode that returns the value of that expression.

    >>> eval(compile("42", "<string>", "exec"))  # code returns None
    >>> eval(compile("42", "<string>", "eval"))  # code returns 42
    42
    >>> exec(compile("42", "<string>", "eval"))  # code returns 42,
    >>>                                          # but ignored by exec
    

    In the "eval" mode (and thus with the eval function if a string is passed in), the compile raises an exception if the source code contains statements or anything else beyond a single expression:

    >>> compile("for i in range(3): print(i)", "<string>", "eval")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        for i in range(3): print(i)
          ^
    SyntaxError: invalid syntax
    

    Actually the statement "eval accepts only a single expression" applies only when a string (which contains Python source code) is passed to eval. Then it is internally compiled to bytecode using compile(source, "<string>", "eval") This is where the difference really comes from.

    If a code object (which contains Python bytecode) is passed to exec or eval, they behave identically, excepting for the fact that exec ignores the return value, still returning None always. So it is possible use eval to execute something that has statements, if you just compiled it into bytecode before instead of passing it as a string:

    >>> eval(compile("if 1: print("Hello")", "<string>", "exec"))
    Hello
    >>>
    

    works without problems, even though the compiled code contains statements. It still returns None, because that is the return value of the code object returned from compile.

    In the "eval" mode (and thus with the eval function if a string is passed in), the compile raises an exception if the source code contains statements or anything else beyond a single expression:

    >>> compile("for i in range(3): print(i)", "<string>". "eval")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        for i in range(3): print(i)
          ^
    SyntaxError: invalid syntax
    

    The longer answer, a.k.a the gory details

    exec and eval

    The exec function (which was a statement in Python 2) is used for executing a dynamically created statement or program:

    >>> program = """
    for i in range(3):
        print("Python is cool")
    """
    >>> exec(program)
    Python is cool
    Python is cool
    Python is cool
    >>> 
    

    The eval function does the same for a single expression, and returns the value of the expression:

    >>> a = 2
    >>> my_calculation = "42 * a"
    >>> result = eval(my_calculation)
    >>> result
    84
    

    exec and eval both accept the program/expression to be run either as a str, unicode or bytes object containing source code, or as a code object which contains Python bytecode.

    If a str/unicode/bytes containing source code was passed to exec, it behaves equivalently to:

    exec(compile(source, "<string>", "exec"))
    

    and eval similarly behaves equivalent to:

    eval(compile(source, "<string>", "eval"))
    

    Since all expressions can be used as statements in Python (these are called the Expr nodes in the Python abstract grammar; the opposite is not true), you can always use exec if you do not need the return value. That is to say, you can use either eval("my_func(42)") or exec("my_func(42)"), the difference being that eval returns the value returned by my_func, and exec discards it:

    >>> def my_func(arg):
    ...     print("Called with %d" % arg)
    ...     return arg * 2
    ... 
    >>> exec("my_func(42)")
    Called with 42
    >>> eval("my_func(42)")
    Called with 42
    84
    >>> 
    

    Of the 2, only exec accepts source code that contains statements, like def, for, while, import, or class, the assignment statement (a.k.a a = 42), or entire programs:

    >>> exec("for i in range(3): print(i)")
    0
    1
    2
    >>> eval("for i in range(3): print(i)")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        for i in range(3): print(i)
          ^
    SyntaxError: invalid syntax
    

    Both exec and eval accept 2 additional positional arguments - globals and locals - which are the global and local variable scopes that the code sees. These default to the globals() and locals() within the scope that called exec or eval, but any dictionary can be used for globals and any mapping for locals (including dict of course). These can be used not only to restrict/modify the variables that the code sees, but are often also used for capturing the variables that the executed code creates:

    >>> g = dict()
    >>> l = dict()
    >>> exec("global a; a, b = 123, 42", g, l)
    >>> g["a"]
    123
    >>> l
    {"b": 42}
    

    (If you display the value of the entire g, it would be much longer, because exec and eval add the built-ins module as __builtins__ to the globals automatically if it is missing).

    In Python 2, the official syntax for the exec statement is actually exec code in globals, locals, as in

    >>> exec "global a; a, b = 123, 42" in g, l
    

    However the alternate syntax exec(code, globals, locals) has always been accepted too (see below).

    compile

    The compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1) built-in can be used to speed up repeated invocations of the same code with exec or eval by compiling the source into a code object beforehand. The mode parameter controls the kind of code fragment the compile function accepts and the kind of bytecode it produces. The choices are "eval", "exec" and "single":

    • "eval" mode expects a single expression, and will produce bytecode that when run will return the value of that expression:

      >>> dis.dis(compile("a + b", "<string>", "eval"))
        1           0 LOAD_NAME                0 (a)
                    3 LOAD_NAME                1 (b)
                    6 BINARY_ADD
                    7 RETURN_VALUE
      
    • "exec" accepts any kinds of python constructs from single expressions to whole modules of code, and executes them as if they were module top-level statements. The code object returns None:

      >>> dis.dis(compile("a + b", "<string>", "exec"))
        1           0 LOAD_NAME                0 (a)
                    3 LOAD_NAME                1 (b)
                    6 BINARY_ADD
                    7 POP_TOP                             <- discard result
                    8 LOAD_CONST               0 (None)   <- load None on stack
                   11 RETURN_VALUE                        <- return top of stack
      
    • "single" is a limited form of "exec" which accepts a source code containing a single statement (or multiple statements separated by ;) if the last statement is an expression statement, the resulting bytecode also prints the repr of the value of that expression to the standard output(!).

      An if-elif-else chain, a loop with else, and try with its except, else and finally blocks is considered a single statement.

      A source fragment containing 2 top-level statements is an error for the "single", except in Python 2 there is a bug that sometimes allows multiple toplevel statements in the code; only the first is compiled; the rest are ignored:

      In Python 2.7.8:

      >>> exec(compile("a = 5
      a = 6", "<string>", "single"))
      >>> a
      5
      

      And in Python 3.4.2:

      >>> exec(compile("a = 5
      a = 6", "<string>", "single"))
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "<string>", line 1
          a = 5
              ^
      SyntaxError: multiple statements found while compiling a single statement
      

      This is very useful for making interactive Python shells. However, the value of the expression is not returned, even if you eval the resulting code.

    Thus greatest distinction of exec and eval actually comes from the compile function and its modes.


    In addition to compiling source code to bytecode, compile supports compiling abstract syntax trees (parse trees of Python code) into code objects; and source code into abstract syntax trees (the ast.parse is written in Python and just calls compile(source, filename, mode, PyCF_ONLY_AST)); these are used for example for modifying source code on the fly, and also for dynamic code creation, as it is often easier to handle the code as a tree of nodes instead of lines of text in complex cases.


    While eval only allows you to evaluate a string that contains a single expression, you can eval a whole statement, or even a whole module that has been compiled into bytecode; that is, with Python 2, print is a statement, and cannot be evalled directly:

    >>> eval("for i in range(3): print("Python is cool")")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        for i in range(3): print("Python is cool")
          ^
    SyntaxError: invalid syntax
    

    compile it with "exec" mode into a code object and you can eval it; the eval function will return None.

    >>> code = compile("for i in range(3): print("Python is cool")",
                       "foo.py", "exec")
    >>> eval(code)
    Python is cool
    Python is cool
    Python is cool
    

    If one looks into eval and exec source code in CPython 3, this is very evident; they both call PyEval_EvalCode with same arguments, the only difference being that exec explicitly returns None.

    Syntax differences of exec between Python 2 and Python 3

    One of the major differences in Python 2 is that exec is a statement and eval is a built-in function (both are built-in functions in Python 3). It is a well-known fact that the official syntax of exec in Python 2 is exec code [in globals[, locals]].

    Unlike majority of the Python 2-to-3 porting guides seem to suggest, the exec statement in CPython 2 can be also used with syntax that looks exactly like the exec function invocation in Python 3. The reason is that Python 0.9.9 had the exec(code, globals, locals) built-in function! And that built-in function was replaced with exec statement somewhere before Python 1.0 release.

    Since it was desirable to not break backwards compatibility with Python 0.9.9, Guido van Rossum added a compatibility hack in 1993: if the code was a tuple of length 2 or 3, and globals and locals were not passed into the exec statement otherwise, the code would be interpreted as if the 2nd and 3rd element of the tuple were the globals and locals respectively. The compatibility hack was not mentioned even in Python 1.4 documentation (the earliest available version online); and thus was not known to many writers of the porting guides and tools, until it was documented again in November 2012:

    The first expression may also be a tuple of length 2 or 3. In this case, the optional parts must be omitted. The form exec(expr, globals) is equivalent to exec expr in globals, while the form exec(expr, globals, locals) is equivalent to exec expr in globals, locals. The tuple form of exec provides compatibility with Python 3, where exec is a function rather than a statement.

    Yes, in CPython 2.7 that it is handily referred to as being a forward-compatibility option (why confuse people over that there is a backward compatibility option at all), when it actually had been there for backward-compatibility for two decades.

    Thus while exec is a statement in Python 1 and Python 2, and a built-in function in Python 3 and Python 0.9.9,

    >>> exec("print(a)", globals(), {"a": 42})
    42
    

    has had identical behaviour in possibly every widely released Python version ever; and works in Jython 2.5.2, PyPy 2.3.1 (Python 2.7.6) and IronPython 2.6.1 too (kudos to them following the undocumented behaviour of CPython closely).

    What you cannot do in Pythons 1.0 - 2.7 with its compatibility hack, is to store the return value of exec into a variable:

    Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
    [GCC 5.3.1 20160413] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> a = exec("print(42)")
      File "<stdin>", line 1
        a = exec("print(42)")
               ^
    SyntaxError: invalid syntax
    

    (which wouldn"t be useful in Python 3 either, as exec always returns None), or pass a reference to exec:

    >>> call_later(exec, "print(42)", delay=1000)
      File "<stdin>", line 1
        call_later(exec, "print(42)", delay=1000)
                      ^
    SyntaxError: invalid syntax
    

    Which a pattern that someone might actually have used, though unlikely;

    Or use it in a list comprehension:

    >>> [exec(i) for i in ["print(42)", "print(foo)"]
      File "<stdin>", line 1
        [exec(i) for i in ["print(42)", "print(foo)"]
            ^
    SyntaxError: invalid syntax
    

    which is abuse of list comprehensions (use a for loop instead!).

    Answer #10

    TL;DR version:

    For the simple case of:

    • I have a text column with a delimiter and I want two columns

    The simplest solution is:

    df[["A", "B"]] = df["AB"].str.split(" ", 1, expand=True)
    

    You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

    Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

    In detail:

    Andy Hayden"s solution is most excellent in demonstrating the power of the str.extract() method.

    But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

    >>> import pandas as pd
    >>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2"]})
    >>> df
    
          AB
    0  A1-B1
    1  A2-B2
    >>> df["AB_split"] = df["AB"].str.split("-")
    >>> df
    
          AB  AB_split
    0  A1-B1  [A1, B1]
    1  A2-B2  [A2, B2]
    

    1: If you"re unsure what the first two parameters of .str.split() do, I recommend the docs for the plain Python version of the method.

    But how do you go from:

    • a column containing two-element lists

    to:

    • two columns, each containing the respective element of the lists?

    Well, we need to take a closer look at the .str attribute of a column.

    It"s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

    >>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
    >>> upper_lower_df
    
       U
    0  A
    1  B
    2  C
    >>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
    >>> upper_lower_df
    
       U  L
    0  A  a
    1  B  b
    2  C  c
    

    But it also has an "indexing" interface for getting each element of a string by its index:

    >>> df["AB"].str[0]
    
    0    A
    1    A
    Name: AB, dtype: object
    
    >>> df["AB"].str[1]
    
    0    1
    1    2
    Name: AB, dtype: object
    

    Of course, this indexing interface of .str doesn"t really care if each element it"s indexing is actually a string, as long as it can be indexed, so:

    >>> df["AB"].str.split("-", 1).str[0]
    
    0    A1
    1    A2
    Name: AB, dtype: object
    
    >>> df["AB"].str.split("-", 1).str[1]
    
    0    B1
    1    B2
    Name: AB, dtype: object
    

    Then, it"s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

    >>> df["A"], df["B"] = df["AB"].str.split("-", 1).str
    >>> df
    
          AB  AB_split   A   B
    0  A1-B1  [A1, B1]  A1  B1
    1  A2-B2  [A2, B2]  A2  B2
    

    Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

    >>> df["AB"].str.split("-", 1, expand=True)
    
        0   1
    0  A1  B1
    1  A2  B2
    

    So, another way of accomplishing what we wanted is to do:

    >>> df = df[["AB"]]
    >>> df
    
          AB
    0  A1-B1
    1  A2-B2
    
    >>> df.join(df["AB"].str.split("-", 1, expand=True).rename(columns={0:"A", 1:"B"}))
    
          AB   A   B
    0  A1-B1  A1  B1
    1  A2-B2  A2  B2
    

    The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn"t deal well with splits of different lengths:

    >>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2", "A3-B3-C3"]})
    >>> df
             AB
    0     A1-B1
    1     A2-B2
    2  A3-B3-C3
    >>> df["A"], df["B"], df["C"] = df["AB"].str.split("-")
    Traceback (most recent call last):
      [...]    
    ValueError: Length of values does not match length of index
    >>> 
    

    But expand=True handles it nicely by placing None in the columns for which there aren"t enough "splits":

    >>> df.join(
    ...     df["AB"].str.split("-", expand=True).rename(
    ...         columns={0:"A", 1:"B", 2:"C"}
    ...     )
    ... )
             AB   A   B     C
    0     A1-B1  A1  B1  None
    1     A2-B2  A2  B2  None
    2  A3-B3-C3  A3  B3    C3
    

    Splitting a large file into separate modules in C / C ++, Java and Python: StackOverflow Questions

    How do you split a list into evenly sized chunks?

    Question by jespern

    I have a list of arbitrary length, and I need to split it up into equal size chunks and operate on it. There are some obvious ways to do this, like keeping a counter and two lists, and when the second list fills up, add it to the first list and empty the second list for the next round of data, but this is potentially extremely expensive.

    I was wondering if anyone had a good solution to this for lists of any length, e.g. using generators.

    I was looking for something useful in itertools but I couldn"t find anything obviously useful. Might"ve missed it, though.

    Related question: What is the most “pythonic” way to iterate over a list in chunks?

    Split Strings into words with multiple word boundary delimiters

    I think what I want to do is a fairly common task but I"ve found no reference on the web. I have text with punctuation, and I want a list of the words.

    "Hey, you - what are you doing here!?"
    

    should be

    ["hey", "you", "what", "are", "you", "doing", "here"]
    

    But Python"s str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

    Split string with multiple delimiters in Python

    I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

    I have a string that needs to be split by either a ";" or ", " That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

    Example string:

    "b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"
    

    should be split into a list containing the following:

    ("b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]" , "mesitylene [000108-67-8]", "polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]") 
    

    How to split a string into a list?

    I want my Python function to split a sentence (input) and store each word in a list. My current code splits the sentence, but does not store the words as a list. How do I do that?

    def split_line(text):
    
        # split the text
        words = text.split()
    
        # for each word in the line:
        for word in words:
    
            # print the word
            print(words)
    

    Split string on whitespace in Python

    I"m looking for the Python equivalent of

    String str = "many   fancy word 
    hello    	hi";
    String whiteSpaceRegex = "\s";
    String[] words = str.split(whiteSpaceRegex);
    
    ["many", "fancy", "word", "hello", "hi"]
    

    How to split a string into a list of characters in Python?

    I"ve tried to look around the web for answers to splitting a string into a list of characters but I can"t seem to find a simple method.

    str.split(//) does not seem to work like Ruby does. Is there a simple way of doing this without looping?

    Split string every nth character?

    Is it possible to split a string every nth character?

    For example, suppose I have a string containing the following:

    "1234567890"
    

    How can I get it to look like this:

    ["12","34","56","78","90"]
    

    Split by comma and strip whitespace in Python

    I have some python code that splits on comma, but doesn"t strip the whitespace:

    >>> string = "blah, lots  ,  of ,  spaces, here "
    >>> mylist = string.split(",")
    >>> print mylist
    ["blah", " lots  ", "  of ", "  spaces", " here "]
    

    I would rather end up with whitespace removed like this:

    ["blah", "lots", "of", "spaces", "here"]
    

    I am aware that I could loop through the list and strip() each item but, as this is Python, I"m guessing there"s a quicker, easier and more elegant way of doing it.

    Splitting on first occurrence

    What would be the best way to split a string on the first occurrence of a delimiter?

    For example:

    "123mango abcd mango kiwi peach"
    

    splitting on the first mango to get:

    "abcd mango kiwi peach"
    

    Split a list based on a condition?

    What"s the best way, both aesthetically and from a performance perspective, to split a list of items into multiple lists based on a conditional? The equivalent of:

    good = [x for x in mylist if x in goodvals]
    bad  = [x for x in mylist if x not in goodvals]
    

    is there a more elegant way to do this?

    Update: here"s the actual use case, to better explain what I"m trying to do:

    # files looks like: [ ("file1.jpg", 33L, ".jpg"), ("file2.avi", 999L, ".avi"), ... ]
    IMAGE_TYPES = (".jpg",".jpeg",".gif",".bmp",".png")
    images = [f for f in files if f[2].lower() in IMAGE_TYPES]
    anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]
    

    Answer #1

    In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

    TLDR:

    The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

    1. faster attribute access.
    2. space savings in memory.

    The space savings is from

    1. Storing value references in slots instead of __dict__.
    2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

    Quick Caveats

    Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

    class Base:
        __slots__ = "foo", "bar"
    
    class Right(Base):
        __slots__ = "baz", 
    
    class Wrong(Base):
        __slots__ = "foo", "bar", "baz"        # redundant foo and bar
    

    Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

    >>> from sys import getsizeof
    >>> getsizeof(Right()), getsizeof(Wrong())
    (56, 72)
    

    This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

    >>> w = Wrong()
    >>> w.foo = "foo"
    >>> Base.foo.__get__(w)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: foo
    >>> Wrong.foo.__get__(w)
    "foo"
    

    The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

    To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

    See section on multiple inheritance below for an example.

    Requirements:

    • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

    • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

    There are a lot of details if you wish to keep reading.

    Why use __slots__: Faster attribute access.

    The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

    It is trivial to demonstrate measurably significant faster access:

    import timeit
    
    class Foo(object): __slots__ = "foo",
    
    class Bar(object): pass
    
    slotted = Foo()
    not_slotted = Bar()
    
    def get_set_delete_fn(obj):
        def get_set_delete():
            obj.foo = "foo"
            obj.foo
            del obj.foo
        return get_set_delete
    

    and

    >>> min(timeit.repeat(get_set_delete_fn(slotted)))
    0.2846834529991611
    >>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
    0.3664822799983085
    

    The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

    >>> 0.3664822799983085 / 0.2846834529991611
    1.2873325658284342
    

    In Python 2 on Windows I have measured it about 15% faster.

    Why use __slots__: Memory Savings

    Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

    My own contribution to the documentation clearly states the reasons behind this:

    The space saved over using __dict__ can be significant.

    SQLAlchemy attributes a lot of memory savings to __slots__.

    To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

    In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

    For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

           Python 2.7             Python 3.6
    attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
    none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
    one    48         56 + 272    48         56 + 112
    two    56         56 + 272    56         56 + 112
    six    88         56 + 1040   88         56 + 152
    11     128        56 + 1040   128        56 + 240
    22     216        56 + 3344   216        56 + 408     
    43     384        56 + 3344   384        56 + 752
    

    So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

    Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

    >>> Foo.foo
    <member "foo" of "Foo" objects>
    >>> type(Foo.foo)
    <class "member_descriptor">
    >>> getsizeof(Foo.foo)
    72
    

    Demonstration of __slots__:

    To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

    class Base(object): 
        __slots__ = ()
    

    now:

    >>> b = Base()
    >>> b.a = "a"
    Traceback (most recent call last):
      File "<pyshell#38>", line 1, in <module>
        b.a = "a"
    AttributeError: "Base" object has no attribute "a"
    

    Or subclass another class that defines __slots__

    class Child(Base):
        __slots__ = ("a",)
    

    and now:

    c = Child()
    c.a = "a"
    

    but:

    >>> c.b = "b"
    Traceback (most recent call last):
      File "<pyshell#42>", line 1, in <module>
        c.b = "b"
    AttributeError: "Child" object has no attribute "b"
    

    To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

    class SlottedWithDict(Child): 
        __slots__ = ("__dict__", "b")
    
    swd = SlottedWithDict()
    swd.a = "a"
    swd.b = "b"
    swd.c = "c"
    

    and

    >>> swd.__dict__
    {"c": "c"}
    

    Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

    class NoSlots(Child): pass
    ns = NoSlots()
    ns.a = "a"
    ns.b = "b"
    

    And:

    >>> ns.__dict__
    {"b": "b"}
    

    However, __slots__ may cause problems for multiple inheritance:

    class BaseA(object): 
        __slots__ = ("a",)
    
    class BaseB(object): 
        __slots__ = ("b",)
    

    Because creating a child class from parents with both non-empty slots fails:

    >>> class Child(BaseA, BaseB): __slots__ = ()
    Traceback (most recent call last):
      File "<pyshell#68>", line 1, in <module>
        class Child(BaseA, BaseB): __slots__ = ()
    TypeError: Error when calling the metaclass bases
        multiple bases have instance lay-out conflict
    

    If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

    from abc import ABC
    
    class AbstractA(ABC):
        __slots__ = ()
    
    class BaseA(AbstractA): 
        __slots__ = ("a",)
    
    class AbstractB(ABC):
        __slots__ = ()
    
    class BaseB(AbstractB): 
        __slots__ = ("b",)
    
    class Child(AbstractA, AbstractB): 
        __slots__ = ("a", "b")
    
    c = Child() # no problem!
    

    Add "__dict__" to __slots__ to get dynamic assignment:

    class Foo(object):
        __slots__ = "bar", "baz", "__dict__"
    

    and now:

    >>> foo = Foo()
    >>> foo.boink = "boink"
    

    So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

    When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

    Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

    You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

    Set to empty tuple when subclassing a namedtuple:

    The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

    from collections import namedtuple
    class MyNT(namedtuple("MyNT", "bar baz")):
        """MyNT is an immutable and lightweight object"""
        __slots__ = ()
    

    usage:

    >>> nt = MyNT("bar", "baz")
    >>> nt.bar
    "bar"
    >>> nt.baz
    "baz"
    

    And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

    >>> nt.quux = "quux"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: "MyNT" object has no attribute "quux"
    

    You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

    Biggest Caveat: Multiple inheritance

    Even when non-empty slots are the same for multiple parents, they cannot be used together:

    class Foo(object): 
        __slots__ = "foo", "bar"
    class Bar(object):
        __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()
    
    >>> class Baz(Foo, Bar): pass
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: Error when calling the metaclass bases
        multiple bases have instance lay-out conflict
    

    Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

    class Foo(object): __slots__ = ()
    class Bar(object): __slots__ = ()
    class Baz(Foo, Bar): __slots__ = ("foo", "bar")
    b = Baz()
    b.foo, b.bar = "foo", "bar"
    

    You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

    Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

    To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

    class AbstractBase:
        __slots__ = ()
        def __init__(self, a, b):
            self.a = a
            self.b = b
        def __repr__(self):
            return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"
    

    We could use the above directly by inheriting and declaring the expected slots:

    class Foo(AbstractBase):
        __slots__ = "a", "b"
    

    But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

    class AbstractBaseC:
        __slots__ = ()
        @property
        def c(self):
            print("getting c!")
            return self._c
        @c.setter
        def c(self, arg):
            print("setting c!")
            self._c = arg
    

    Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

    class Concretion(AbstractBase, AbstractBaseC):
        __slots__ = "a b _c".split()
    

    And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

    >>> c = Concretion("a", "b")
    >>> c.c = c
    setting c!
    >>> c.c
    getting c!
    Concretion("a", "b")
    >>> c.d = "d"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: "Concretion" object has no attribute "d"
    

    Other cases to avoid slots:

    • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
    • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
    • Avoid them if you insist on providing default values via class attributes for instance variables.

    You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

    Critiques of other answers

    The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

    Do not "only use __slots__ when instantiating lots of objects"

    I quote:

    "You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

    Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

    Why?

    If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

    __slots__ contributes to reusability when creating interfaces or mixins.

    It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

    __slots__ doesn"t break pickling

    When pickling a slotted object, you may find it complains with a misleading TypeError:

    >>> pickle.loads(pickle.dumps(f))
    TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
    

    This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

    >>> pickle.loads(pickle.dumps(f, -1))
    <__main__.Foo object at 0x1129C770>
    

    in Python 2.7:

    >>> pickle.loads(pickle.dumps(f, 2))
    <__main__.Foo object at 0x1129C770>
    

    in Python 3.6

    >>> pickle.loads(pickle.dumps(f, 4))
    <__main__.Foo object at 0x1129C770>
    

    So I would keep this in mind, as it is a solved problem.

    Critique of the (until Oct 2, 2016) accepted answer

    The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

    The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

    The second half is wishful thinking, and off the mark:

    While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

    Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

    The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

    They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

    It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

    The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

    Memory usage evidence

    Create some normal objects and slotted objects:

    >>> class Foo(object): pass
    >>> class Bar(object): __slots__ = ()
    

    Instantiate a million of them:

    >>> foos = [Foo() for f in xrange(1000000)]
    >>> bars = [Bar() for b in xrange(1000000)]
    

    Inspect with guppy.hpy().heap():

    >>> guppy.hpy().heap()
    Partition of a set of 2028259 objects. Total size = 99763360 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0 1000000  49 64000000  64  64000000  64 __main__.Foo
         1     169   0 16281480  16  80281480  80 list
         2 1000000  49 16000000  16  96281480  97 __main__.Bar
         3   12284   1   987472   1  97268952  97 str
    ...
    

    Access the regular objects and their __dict__ and inspect again:

    >>> for f in foos:
    ...     f.__dict__
    >>> guppy.hpy().heap()
    Partition of a set of 3028258 objects. Total size = 379763480 bytes.
     Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
         0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
         1 1000000  33  64000000  17 344000000  91 __main__.Foo
         2     169   0  16281480   4 360281480  95 list
         3 1000000  33  16000000   4 376281480  99 __main__.Bar
         4   12284   0    987472   0 377268952  99 str
    ...
    

    This is consistent with the history of Python, from Unifying types and classes in Python 2.2

    If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

    Answer #2

    Quick Answer:

    The simplest way to get row counts per group is by calling .size(), which returns a Series:

    df.groupby(["col1","col2"]).size()
    


    Usually you want this result as a DataFrame (instead of a Series) so you can do:

    df.groupby(["col1", "col2"]).size().reset_index(name="counts")
    


    If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


    Detailed example:

    Consider the following example dataframe:

    In [2]: df
    Out[2]: 
      col1 col2  col3  col4  col5  col6
    0    A    B  0.20 -0.61 -0.49  1.49
    1    A    B -1.53 -1.01 -0.39  1.82
    2    A    B -0.44  0.27  0.72  0.11
    3    A    B  0.28 -1.32  0.38  0.18
    4    C    D  0.12  0.59  0.81  0.66
    5    C    D -0.13 -1.65 -1.64  0.50
    6    C    D -1.42 -0.11 -0.18 -0.44
    7    E    F -0.00  1.42 -0.26  1.17
    8    E    F  0.91 -0.47  1.35 -0.34
    9    G    H  1.48 -0.63 -1.14  0.17
    

    First let"s use .size() to get the row counts:

    In [3]: df.groupby(["col1", "col2"]).size()
    Out[3]: 
    col1  col2
    A     B       4
    C     D       3
    E     F       2
    G     H       1
    dtype: int64
    

    Then let"s use .size().reset_index(name="counts") to get the row counts:

    In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
    Out[4]: 
      col1 col2  counts
    0    A    B       4
    1    C    D       3
    2    E    F       2
    3    G    H       1
    


    Including results for more statistics

    When you want to calculate statistics on grouped data, it usually looks like this:

    In [5]: (df
       ...: .groupby(["col1", "col2"])
       ...: .agg({
       ...:     "col3": ["mean", "count"], 
       ...:     "col4": ["median", "min", "count"]
       ...: }))
    Out[5]: 
                col4                  col3      
              median   min count      mean count
    col1 col2                                   
    A    B    -0.810 -1.32     4 -0.372500     4
    C    D    -0.110 -1.65     3 -0.476667     3
    E    F     0.475 -0.47     2  0.455000     2
    G    H    -0.630 -0.63     1  1.480000     1
    

    The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

    To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

    In [6]: gb = df.groupby(["col1", "col2"])
       ...: counts = gb.size().to_frame(name="counts")
       ...: (counts
       ...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
       ...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
       ...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
       ...:  .reset_index()
       ...: )
       ...: 
    Out[6]: 
      col1 col2  counts  col3_mean  col4_median  col4_min
    0    A    B       4  -0.372500       -0.810     -1.32
    1    C    D       3  -0.476667       -0.110     -1.65
    2    E    F       2   0.455000        0.475     -0.47
    3    G    H       1   1.480000       -0.630     -0.63
    



    Footnotes

    The code used to generate the test data is shown below:

    In [1]: import numpy as np
       ...: import pandas as pd 
       ...: 
       ...: keys = np.array([
       ...:         ["A", "B"],
       ...:         ["A", "B"],
       ...:         ["A", "B"],
       ...:         ["A", "B"],
       ...:         ["C", "D"],
       ...:         ["C", "D"],
       ...:         ["C", "D"],
       ...:         ["E", "F"],
       ...:         ["E", "F"],
       ...:         ["G", "H"] 
       ...:         ])
       ...: 
       ...: df = pd.DataFrame(
       ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
       ...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
       ...: )
       ...: 
       ...: df[["col3", "col4", "col5", "col6"]] = 
       ...:     df[["col3", "col4", "col5", "col6"]].astype(float)
       ...: 
    


    Disclaimer:

    If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

    Answer #3

    TL;DR version:

    For the simple case of:

    • I have a text column with a delimiter and I want two columns

    The simplest solution is:

    df[["A", "B"]] = df["AB"].str.split(" ", 1, expand=True)
    

    You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

    Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

    In detail:

    Andy Hayden"s solution is most excellent in demonstrating the power of the str.extract() method.

    But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

    >>> import pandas as pd
    >>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2"]})
    >>> df
    
          AB
    0  A1-B1
    1  A2-B2
    >>> df["AB_split"] = df["AB"].str.split("-")
    >>> df
    
          AB  AB_split
    0  A1-B1  [A1, B1]
    1  A2-B2  [A2, B2]
    

    1: If you"re unsure what the first two parameters of .str.split() do, I recommend the docs for the plain Python version of the method.

    But how do you go from:

    • a column containing two-element lists

    to:

    • two columns, each containing the respective element of the lists?

    Well, we need to take a closer look at the .str attribute of a column.

    It"s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

    >>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
    >>> upper_lower_df
    
       U
    0  A
    1  B
    2  C
    >>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
    >>> upper_lower_df
    
       U  L
    0  A  a
    1  B  b
    2  C  c
    

    But it also has an "indexing" interface for getting each element of a string by its index:

    >>> df["AB"].str[0]
    
    0    A
    1    A
    Name: AB, dtype: object
    
    >>> df["AB"].str[1]
    
    0    1
    1    2
    Name: AB, dtype: object
    

    Of course, this indexing interface of .str doesn"t really care if each element it"s indexing is actually a string, as long as it can be indexed, so:

    >>> df["AB"].str.split("-", 1).str[0]
    
    0    A1
    1    A2
    Name: AB, dtype: object
    
    >>> df["AB"].str.split("-", 1).str[1]
    
    0    B1
    1    B2
    Name: AB, dtype: object
    

    Then, it"s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

    >>> df["A"], df["B"] = df["AB"].str.split("-", 1).str
    >>> df
    
          AB  AB_split   A   B
    0  A1-B1  [A1, B1]  A1  B1
    1  A2-B2  [A2, B2]  A2  B2
    

    Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

    >>> df["AB"].str.split("-", 1, expand=True)
    
        0   1
    0  A1  B1
    1  A2  B2
    

    So, another way of accomplishing what we wanted is to do:

    >>> df = df[["AB"]]
    >>> df
    
          AB
    0  A1-B1
    1  A2-B2
    
    >>> df.join(df["AB"].str.split("-", 1, expand=True).rename(columns={0:"A", 1:"B"}))
    
          AB   A   B
    0  A1-B1  A1  B1
    1  A2-B2  A2  B2
    

    The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn"t deal well with splits of different lengths:

    >>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2", "A3-B3-C3"]})
    >>> df
             AB
    0     A1-B1
    1     A2-B2
    2  A3-B3-C3
    >>> df["A"], df["B"], df["C"] = df["AB"].str.split("-")
    Traceback (most recent call last):
      [...]    
    ValueError: Length of values does not match length of index
    >>> 
    

    But expand=True handles it nicely by placing None in the columns for which there aren"t enough "splits":

    >>> df.join(
    ...     df["AB"].str.split("-", expand=True).rename(
    ...         columns={0:"A", 1:"B", 2:"C"}
    ...     )
    ... )
             AB   A   B     C
    0     A1-B1  A1  B1  None
    1     A2-B2  A2  B2  None
    2  A3-B3-C3  A3  B3    C3
    

    Answer #4

    There are several ways to select rows from a Pandas dataframe:

    1. Boolean indexing (df[df["col"] == value] )
    2. Positional indexing (df.iloc[...])
    3. Label indexing (df.xs(...))
    4. df.query(...) API

    Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column "A" == "foo"

    (Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)


    Setup

    The first thing we"ll need is to identify a condition that will act as our criterion for selecting rows. We"ll start with the OP"s case column_name == some_value, and include some other common use cases.

    Borrowing from @unutbu:

    import pandas as pd, numpy as np
    
    df = pd.DataFrame({"A": "foo bar foo bar foo bar foo foo".split(),
                       "B": "one one two three two two one three".split(),
                       "C": np.arange(8), "D": np.arange(8) * 2})
    

    1. Boolean indexing

    ... Boolean indexing requires finding the true value of each row"s "A" column being equal to "foo", then using those truth values to identify which rows to keep. Typically, we"d name this series, an array of truth values, mask. We"ll do so here as well.

    mask = df["A"] == "foo"
    

    We can then use this mask to slice or index the data frame

    df[mask]
    
         A      B  C   D
    0  foo    one  0   0
    2  foo    two  2   4
    4  foo    two  4   8
    6  foo    one  6  12
    7  foo  three  7  14
    

    This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn"t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


    2. Positional indexing

    Positional indexing (df.iloc[...]) has its use cases, but this isn"t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

    mask = df["A"] == "foo"
    pos = np.flatnonzero(mask)
    df.iloc[pos]
    
         A      B  C   D
    0  foo    one  0   0
    2  foo    two  2   4
    4  foo    two  4   8
    6  foo    one  6  12
    7  foo  three  7  14
    

    3. Label indexing

    Label indexing can be very handy, but in this case, we are again doing more work for no benefit

    df.set_index("A", append=True, drop=False).xs("foo", level=1)
    
         A      B  C   D
    0  foo    one  0   0
    2  foo    two  2   4
    4  foo    two  4   8
    6  foo    one  6  12
    7  foo  three  7  14
    

    4. df.query() API

    pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

    df.query("A == "foo"")
    
         A      B  C   D
    0  foo    one  0   0
    2  foo    two  2   4
    4  foo    two  4   8
    6  foo    one  6  12
    7  foo  three  7  14
    

    My preference is to use the Boolean mask

    Actual improvements can be made by modifying how we create our Boolean mask.

    mask alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd.Series

    mask = df["A"].values == "foo"
    

    I"ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

    %timeit mask = df["A"].values == "foo"
    %timeit mask = df["A"] == "foo"
    
    5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

    Next, we"ll look at the timing for slicing with one mask versus the other.

    mask = df["A"].values == "foo"
    %timeit df[mask]
    mask = df["A"] == "foo"
    %timeit df[mask]
    
    219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    The performance gains aren"t as pronounced. We"ll see if this holds up over more robust testing.


    mask alternative 2 We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

    Instead of df[mask] we will do this

    pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
    

    If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

    %timeit df[m]
    %timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
    
    216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    However, if the data frame is not of mixed type, this is a very useful way to do it.

    Given

    np.random.seed([3,1415])
    d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))
    
    d1
    
       A  B  C  D  E
    0  0  2  7  3  8
    1  7  0  6  8  6
    2  0  2  0  4  9
    3  7  3  2  4  3
    4  3  6  7  7  4
    5  5  3  7  5  9
    6  8  7  6  4  7
    7  6  2  6  6  5
    8  2  8  7  5  8
    9  4  7  6  1  5
    

    %%timeit
    mask = d1["A"].values == 7
    d1[mask]
    
    179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Versus

    %%timeit
    mask = d1["A"].values == 7
    pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
    
    87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    We cut the time in half.


    mask alternative 3

    @unutbu also shows us how to use pd.Series.isin to account for each element of df["A"] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely "foo". But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

    mask = df["A"].isin(["foo"])
    df[mask]
    
         A      B  C   D
    0  foo    one  0   0
    2  foo    two  2   4
    4  foo    two  4   8
    6  foo    one  6  12
    7  foo  three  7  14
    

    However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We"ll use np.in1d

    mask = np.in1d(df["A"].values, ["foo"])
    df[mask]
    
         A      B  C   D
    0  foo    one  0   0
    2  foo    two  2   4
    4  foo    two  4   8
    6  foo    one  6  12
    7  foo  three  7  14
    

    Timing

    I"ll include other concepts mentioned in other posts as well for reference.

    Code Below

    Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

    res.div(res.min())
    
                             10        30        100       300       1000      3000      10000     30000
    mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
    mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
    mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
    mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
    query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
    xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
    mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
    mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175
    

    You"ll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.

    res.T.plot(loglog=True)
    

    Enter image description here

    Functions

    def mask_standard(df):
        mask = df["A"] == "foo"
        return df[mask]
    
    def mask_standard_loc(df):
        mask = df["A"] == "foo"
        return df.loc[mask]
    
    def mask_with_values(df):
        mask = df["A"].values == "foo"
        return df[mask]
    
    def mask_with_values_loc(df):
        mask = df["A"].values == "foo"
        return df.loc[mask]
    
    def query(df):
        return df.query("A == "foo"")
    
    def xs_label(df):
        return df.set_index("A", append=True, drop=False).xs("foo", level=-1)
    
    def mask_with_isin(df):
        mask = df["A"].isin(["foo"])
        return df[mask]
    
    def mask_with_in1d(df):
        mask = np.in1d(df["A"].values, ["foo"])
        return df[mask]
    

    Testing

    res = pd.DataFrame(
        index=[
            "mask_standard", "mask_standard_loc", "mask_with_values", "mask_with_values_loc",
            "query", "xs_label", "mask_with_isin", "mask_with_in1d"
        ],
        columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
        dtype=float
    )
    
    for j in res.columns:
        d = pd.concat([df] * j, ignore_index=True)
        for i in res.index:a
            stmt = "{}(d)".format(i)
            setp = "from __main__ import d, {}".format(i)
            res.at[i, j] = timeit(stmt, setp, number=50)
    

    Special Timing

    Looking at the special case when we have a single non-object dtype for the entire data frame.

    Code Below

    spec.div(spec.min())
    
                         10        30        100       300       1000      3000      10000     30000
    mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
    mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
    reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735
    

    Turns out, reconstruction isn"t worth it past a few hundred rows.

    spec.T.plot(loglog=True)
    

    Enter image description here

    Functions

    np.random.seed([3,1415])
    d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))
    
    def mask_with_values(df):
        mask = df["A"].values == "foo"
        return df[mask]
    
    def mask_with_in1d(df):
        mask = np.in1d(df["A"].values, ["foo"])
        return df[mask]
    
    def reconstruct(df):
        v = df.values
        mask = np.in1d(df["A"].values, ["foo"])
        return pd.DataFrame(v[mask], df.index[mask], df.columns)
    
    spec = pd.DataFrame(
        index=["mask_with_values", "mask_with_in1d", "reconstruct"],
        columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
        dtype=float
    )
    

    Testing

    for j in spec.columns:
        d = pd.concat([df] * j, ignore_index=True)
        for i in spec.index:
            stmt = "{}(d)".format(i)
            setp = "from __main__ import d, {}".format(i)
            spec.at[i, j] = timeit(stmt, setp, number=50)
    

    Answer #5

    It"s much easier if you use Response.raw and shutil.copyfileobj():

    import requests
    import shutil
    
    def download_file(url):
        local_filename = url.split("/")[-1]
        with requests.get(url, stream=True) as r:
            with open(local_filename, "wb") as f:
                shutil.copyfileobj(r.raw, f)
    
        return local_filename
    

    This streams the file to disk without using excessive memory, and the code is simple.

    Answer #6

    Explain all in Python?

    I keep seeing the variable __all__ set in different __init__.py files.

    What does this do?

    What does __all__ do?

    It declares the semantically "public" names from a module. If there is a name in __all__, users are expected to use it, and they can have the expectation that it will not change.

    It also will have programmatic effects:

    import *

    __all__ in a module, e.g. module.py:

    __all__ = ["foo", "Bar"]
    

    means that when you import * from the module, only those names in the __all__ are imported:

    from module import *               # imports foo and Bar
    

    Documentation tools

    Documentation and code autocompletion tools may (in fact, should) also inspect the __all__ to determine what names to show as available from a module.

    __init__.py makes a directory a Python package

    From the docs:

    The __init__.py files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path.

    In the simplest case, __init__.py can just be an empty file, but it can also execute initialization code for the package or set the __all__ variable.

    So the __init__.py can declare the __all__ for a package.

    Managing an API:

    A package is typically made up of modules that may import one another, but that are necessarily tied together with an __init__.py file. That file is what makes the directory an actual Python package. For example, say you have the following files in a package:

    package
    ├── __init__.py
    ├── module_1.py
    └── module_2.py
    

    Let"s create these files with Python so you can follow along - you could paste the following into a Python 3 shell:

    from pathlib import Path
    
    package = Path("package")
    package.mkdir()
    
    (package / "__init__.py").write_text("""
    from .module_1 import *
    from .module_2 import *
    """)
    
    package_module_1 = package / "module_1.py"
    package_module_1.write_text("""
    __all__ = ["foo"]
    imp_detail1 = imp_detail2 = imp_detail3 = None
    def foo(): pass
    """)
    
    package_module_2 = package / "module_2.py"
    package_module_2.write_text("""
    __all__ = ["Bar"]
    imp_detail1 = imp_detail2 = imp_detail3 = None
    class Bar: pass
    """)
    

    And now you have presented a complete api that someone else can use when they import your package, like so:

    import package
    package.foo()
    package.Bar()
    

    And the package won"t have all the other implementation details you used when creating your modules cluttering up the package namespace.

    __all__ in __init__.py

    After more work, maybe you"ve decided that the modules are too big (like many thousands of lines?) and need to be split up. So you do the following:

    package
    ├── __init__.py
    ├── module_1
    │   ├── foo_implementation.py
    │   └── __init__.py
    └── module_2
        ├── Bar_implementation.py
        └── __init__.py
    

    First make the subpackage directories with the same names as the modules:

    subpackage_1 = package / "module_1"
    subpackage_1.mkdir()
    subpackage_2 = package / "module_2"
    subpackage_2.mkdir()
    

    Move the implementations:

    package_module_1.rename(subpackage_1 / "foo_implementation.py")
    package_module_2.rename(subpackage_2 / "Bar_implementation.py")
    

    create __init__.pys for the subpackages that declare the __all__ for each:

    (subpackage_1 / "__init__.py").write_text("""
    from .foo_implementation import *
    __all__ = ["foo"]
    """)
    (subpackage_2 / "__init__.py").write_text("""
    from .Bar_implementation import *
    __all__ = ["Bar"]
    """)
    

    And now you still have the api provisioned at the package level:

    >>> import package
    >>> package.foo()
    >>> package.Bar()
    <package.module_2.Bar_implementation.Bar object at 0x7f0c2349d210>
    

    And you can easily add things to your API that you can manage at the subpackage level instead of the subpackage"s module level. If you want to add a new name to the API, you simply update the __init__.py, e.g. in module_2:

    from .Bar_implementation import *
    from .Baz_implementation import *
    __all__ = ["Bar", "Baz"]
    

    And if you"re not ready to publish Baz in the top level API, in your top level __init__.py you could have:

    from .module_1 import *       # also constrained by __all__"s
    from .module_2 import *       # in the __init__.py"s
    __all__ = ["foo", "Bar"]     # further constraining the names advertised
    

    and if your users are aware of the availability of Baz, they can use it:

    import package
    package.Baz()
    

    but if they don"t know about it, other tools (like pydoc) won"t inform them.

    You can later change that when Baz is ready for prime time:

    from .module_1 import *
    from .module_2 import *
    __all__ = ["foo", "Bar", "Baz"]
    

    Prefixing _ versus __all__:

    By default, Python will export all names that do not start with an _. You certainly could rely on this mechanism. Some packages in the Python standard library, in fact, do rely on this, but to do so, they alias their imports, for example, in ctypes/__init__.py:

    import os as _os, sys as _sys
    

    Using the _ convention can be more elegant because it removes the redundancy of naming the names again. But it adds the redundancy for imports (if you have a lot of them) and it is easy to forget to do this consistently - and the last thing you want is to have to indefinitely support something you intended to only be an implementation detail, just because you forgot to prefix an _ when naming a function.

    I personally write an __all__ early in my development lifecycle for modules so that others who might use my code know what they should use and not use.

    Most packages in the standard library also use __all__.

    When avoiding __all__ makes sense

    It makes sense to stick to the _ prefix convention in lieu of __all__ when:

    • You"re still in early development mode and have no users, and are constantly tweaking your API.
    • Maybe you do have users, but you have unittests that cover the API, and you"re still actively adding to the API and tweaking in development.

    An export decorator

    The downside of using __all__ is that you have to write the names of functions and classes being exported twice - and the information is kept separate from the definitions. We could use a decorator to solve this problem.

    I got the idea for such an export decorator from David Beazley"s talk on packaging. This implementation seems to work well in CPython"s traditional importer. If you have a special import hook or system, I do not guarantee it, but if you adopt it, it is fairly trivial to back out - you"ll just need to manually add the names back into the __all__

    So in, for example, a utility library, you would define the decorator:

    import sys
    
    def export(fn):
        mod = sys.modules[fn.__module__]
        if hasattr(mod, "__all__"):
            mod.__all__.append(fn.__name__)
        else:
            mod.__all__ = [fn.__name__]
        return fn
    

    and then, where you would define an __all__, you do this:

    $ cat > main.py
    from lib import export
    __all__ = [] # optional - we create a list if __all__ is not there.
    
    @export
    def foo(): pass
    
    @export
    def bar():
        "bar"
    
    def main():
        print("main")
    
    if __name__ == "__main__":
        main()
    

    And this works fine whether run as main or imported by another function.

    $ cat > run.py
    import main
    main.main()
    
    $ python run.py
    main
    

    And API provisioning with import * will work too:

    $ cat > run.py
    from main import *
    foo()
    bar()
    main() # expected to error here, not exported
    
    $ python run.py
    Traceback (most recent call last):
      File "run.py", line 4, in <module>
        main() # expected to error here, not exported
    NameError: name "main" is not defined
    

    Answer #7

    A comment in the Python source code for float objects acknowledges that:

    Comparison is pretty much a nightmare

    This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.

    To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is "too big" to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.

    When comparing a float v to an integer/long w, the worst case is that:

    • v and w have the same sign (both positive or both negative),
    • the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
    • the integer w has at least 49 bits,
    • the exponent of the float v is the same as the number of bits in w.

    And this is exactly what we have for the values in the question:

    >>> import math
    >>> math.frexp(562949953420000.7) # gives the float"s (significand, exponent) pair
    (0.9999999999976706, 49)
    >>> (562949953421000).bit_length()
    49
    

    We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.

    Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.

    This is specific to the CPython implementation of the language.


    The comparison in more detail

    The float_richcompare function handles the comparison between two values v and w.

    Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I"ve left them in where relevant. I"ve also summarised these checks in a list at the foot of the answer.

    The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).

    The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:

    static PyObject*
    float_richcompare(PyObject *v, PyObject *w, int op)
    {
        double i, j;
        int r = 0;
        assert(PyFloat_Check(v));       
        i = PyFloat_AS_DOUBLE(v);       
    
        if (PyFloat_Check(w))           
            j = PyFloat_AS_DOUBLE(w);   
    
        else if (!Py_IS_FINITE(i)) {
            if (PyLong_Check(w))
                j = 0.0;
            else
                goto Unimplemented;
        }
    

    Now we know that if w failed these checks, it is not a Python float. Now the function checks if it"s a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:

        else if (PyLong_Check(w)) {
            int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
            int wsign = _PyLong_Sign(w);
            size_t nbits;
            int exponent;
    
            if (vsign != wsign) {
                /* Magnitudes are irrelevant -- the signs alone
                 * determine the outcome.
                 */
                i = (double)vsign;
                j = (double)wsign;
                goto Compare;
            }
        }   
    

    If this check failed, then v and w have the same sign.

    The next check counts the number of bits in the integer w. If it has too many bits then it can"t possibly be held as a float and so must be larger in magnitude than the float v:

        nbits = _PyLong_NumBits(w);
        if (nbits == (size_t)-1 && PyErr_Occurred()) {
            /* This long is so large that size_t isn"t big enough
             * to hold the # of bits.  Replace with little doubles
             * that give the same outcome -- w is so large that
             * its magnitude must exceed the magnitude of any
             * finite float.
             */
            PyErr_Clear();
            i = (double)vsign;
            assert(wsign != 0);
            j = wsign * 2.0;
            goto Compare;
        }
    

    On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:

        if (nbits <= 48) {
            j = PyLong_AsDouble(w);
            /* It"s impossible that <= 48 bits overflowed. */
            assert(j != -1.0 || ! PyErr_Occurred());
            goto Compare;
        }
    

    From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:

        if (nbits <= 48) {
            /* "Multiply both sides" by -1; this also swaps the
             * comparator.
             */
            i = -i;
            op = _Py_SwappedOp[op];
        }
    

    Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:

        (void) frexp(i, &exponent);
        if (exponent < 0 || (size_t)exponent < nbits) {
            i = 1.0;
            j = 2.0;
            goto Compare;
        }
    

    This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.

    Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:

        if ((size_t)exponent > nbits) {
            i = 2.0;
            j = 1.0;
            goto Compare;
        }
    

    If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.

    The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).

        {
            double fracpart;
            double intpart;
            PyObject *result = NULL;
            PyObject *one = NULL;
            PyObject *vv = NULL;
            PyObject *ww = w;
    
            // snip
    
            fracpart = modf(i, &intpart); // split i (the double that v mapped to)
            vv = PyLong_FromDouble(intpart);
    
            // snip
    
            if (fracpart != 0.0) {
                /* Shift left, and or a 1 bit into vv
                 * to represent the lost fraction.
                 */
                PyObject *temp;
    
                one = PyLong_FromLong(1);
    
                temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
                ww = temp;
    
                temp = PyNumber_Lshift(vv, one);
                vv = temp;
    
                temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
                vv = temp;
            }
            // snip
        }
    }
    

    For brevity I"ve left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.


    Here is a summary of the checks that are performed by the comparison function.

    Let v be a float and cast it as a C double. Now, if w is also a float:

    • Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.

    • If not, compare v and w directly by their representations as C doubles.

    If w is an integer:

    • Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.

    • (The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.

    • Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.

    • (w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)

    • Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.

    • If the exponent of v is greater than the number of bits in w then v is greater than w.

    • (The exponent is the same as the number of bits in w.)

    • The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.

    Answer #8

    To somewhat expand on the earlier answers here, there are a number of details which are commonly overlooked.

    • Prefer subprocess.run() over subprocess.check_call() and friends over subprocess.call() over subprocess.Popen() over os.system() over os.popen()
    • Understand and probably use text=True, aka universal_newlines=True.
    • Understand the meaning of shell=True or shell=False and how it changes quoting and the availability of shell conveniences.
    • Understand differences between sh and Bash
    • Understand how a subprocess is separate from its parent, and generally cannot change the parent.
    • Avoid running the Python interpreter as a subprocess of Python.

    These topics are covered in some more detail below.

    Prefer subprocess.run() or subprocess.check_call()

    The subprocess.Popen() function is a low-level workhorse but it is tricky to use correctly and you end up copy/pasting multiple lines of code ... which conveniently already exist in the standard library as a set of higher-level wrapper functions for various purposes, which are presented in more detail in the following.

    Here"s a paragraph from the documentation:

    The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.

    Unfortunately, the availability of these wrapper functions differs between Python versions.

    • subprocess.run() was officially introduced in Python 3.5. It is meant to replace all of the following.
    • subprocess.check_output() was introduced in Python 2.7 / 3.1. It is basically equivalent to subprocess.run(..., check=True, stdout=subprocess.PIPE).stdout
    • subprocess.check_call() was introduced in Python 2.5. It is basically equivalent to subprocess.run(..., check=True)
    • subprocess.call() was introduced in Python 2.4 in the original subprocess module (PEP-324). It is basically equivalent to subprocess.run(...).returncode

    High-level API vs subprocess.Popen()

    The refactored and extended subprocess.run() is more logical and more versatile than the older legacy functions it replaces. It returns a CompletedProcess object which has various methods which allow you to retrieve the exit status, the standard output, and a few other results and status indicators from the finished subprocess.

    subprocess.run() is the way to go if you simply need a program to run and return control to Python. For more involved scenarios (background processes, perhaps with interactive I/O with the Python parent program) you still need to use subprocess.Popen() and take care of all the plumbing yourself. This requires a fairly intricate understanding of all the moving parts and should not be undertaken lightly. The simpler Popen object represents the (possibly still-running) process which needs to be managed from your code for the remainder of the lifetime of the subprocess.

    It should perhaps be emphasized that just subprocess.Popen() merely creates a process. If you leave it at that, you have a subprocess running concurrently alongside with Python, so a "background" process. If it doesn"t need to do input or output or otherwise coordinate with you, it can do useful work in parallel with your Python program.

    Avoid os.system() and os.popen()

    Since time eternal (well, since Python 2.5) the os module documentation has contained the recommendation to prefer subprocess over os.system():

    The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

    The problems with system() are that it"s obviously system-dependent and doesn"t offer ways to interact with the subprocess. It simply runs, with standard output and standard error outside of Python"s reach. The only information Python receives back is the exit status of the command (zero means success, though the meaning of non-zero values is also somewhat system-dependent).

    PEP-324 (which was already mentioned above) contains a more detailed rationale for why os.system is problematic and how subprocess attempts to solve those issues.

    os.popen() used to be even more strongly discouraged:

    Deprecated since version 2.6: This function is obsolete. Use the subprocess module.

    However, since sometime in Python 3, it has been reimplemented to simply use subprocess, and redirects to the subprocess.Popen() documentation for details.

    Understand and usually use check=True

    You"ll also notice that subprocess.call() has many of the same limitations as os.system(). In regular use, you should generally check whether the process finished successfully, which subprocess.check_call() and subprocess.check_output() do (where the latter also returns the standard output of the finished subprocess). Similarly, you should usually use check=True with subprocess.run() unless you specifically need to allow the subprocess to return an error status.

    In practice, with check=True or subprocess.check_*, Python will throw a CalledProcessError exception if the subprocess returns a nonzero exit status.

    A common error with subprocess.run() is to omit check=True and be surprised when downstream code fails if the subprocess failed.

    On the other hand, a common problem with check_call() and check_output() was that users who blindly used these functions were surprised when the exception was raised e.g. when grep did not find a match. (You should probably replace grep with native Python code anyway, as outlined below.)

    All things counted, you need to understand how shell commands return an exit code, and under what conditions they will return a non-zero (error) exit code, and make a conscious decision how exactly it should be handled.

    Understand and probably use text=True aka universal_newlines=True

    Since Python 3, strings internal to Python are Unicode strings. But there is no guarantee that a subprocess generates Unicode output, or strings at all.

    (If the differences are not immediately obvious, Ned Batchelder"s Pragmatic Unicode is recommended, if not outright obligatory, reading. There is a 36-minute video presentation behind the link if you prefer, though reading the page yourself will probably take significantly less time.)

    Deep down, Python has to fetch a bytes buffer and interpret it somehow. If it contains a blob of binary data, it shouldn"t be decoded into a Unicode string, because that"s error-prone and bug-inducing behavior - precisely the sort of pesky behavior which riddled many Python 2 scripts, before there was a way to properly distinguish between encoded text and binary data.

    With text=True, you tell Python that you, in fact, expect back textual data in the system"s default encoding, and that it should be decoded into a Python (Unicode) string to the best of Python"s ability (usually UTF-8 on any moderately up to date system, except perhaps Windows?)

    If that"s not what you request back, Python will just give you bytes strings in the stdout and stderr strings. Maybe at some later point you do know that they were text strings after all, and you know their encoding. Then, you can decode them.

    normal = subprocess.run([external, arg],
        stdout=subprocess.PIPE, stderr=subprocess.PIPE,
        check=True,
        text=True)
    print(normal.stdout)
    
    convoluted = subprocess.run([external, arg],
        stdout=subprocess.PIPE, stderr=subprocess.PIPE,
        check=True)
    # You have to know (or guess) the encoding
    print(convoluted.stdout.decode("utf-8"))
    

    Python 3.7 introduced the shorter and more descriptive and understandable alias text for the keyword argument which was previously somewhat misleadingly called universal_newlines.

    Understand shell=True vs shell=False

    With shell=True you pass a single string to your shell, and the shell takes it from there.

    With shell=False you pass a list of arguments to the OS, bypassing the shell.

    When you don"t have a shell, you save a process and get rid of a fairly substantial amount of hidden complexity, which may or may not harbor bugs or even security problems.

    On the other hand, when you don"t have a shell, you don"t have redirection, wildcard expansion, job control, and a large number of other shell features.

    A common mistake is to use shell=True and then still pass Python a list of tokens, or vice versa. This happens to work in some cases, but is really ill-defined and could break in interesting ways.

    # XXX AVOID THIS BUG
    buggy = subprocess.run("dig +short stackoverflow.com")
    
    # XXX AVOID THIS BUG TOO
    broken = subprocess.run(["dig", "+short", "stackoverflow.com"],
        shell=True)
    
    # XXX DEFINITELY AVOID THIS
    pathological = subprocess.run(["dig +short stackoverflow.com"],
        shell=True)
    
    correct = subprocess.run(["dig", "+short", "stackoverflow.com"],
        # Probably don"t forget these, too
        check=True, text=True)
    
    # XXX Probably better avoid shell=True
    # but this is nominally correct
    fixed_but_fugly = subprocess.run("dig +short stackoverflow.com",
        shell=True,
        # Probably don"t forget these, too
        check=True, text=True)
    

    The common retort "but it works for me" is not a useful rebuttal unless you understand exactly under what circumstances it could stop working.

    Refactoring Example

    Very often, the features of the shell can be replaced with native Python code. Simple Awk or sed scripts should probably simply be translated to Python instead.

    To partially illustrate this, here is a typical but slightly silly example which involves many shell features.

    cmd = """while read -r x;
       do ping -c 3 "$x" | grep "round-trip min/avg/max"
       done <hosts.txt"""
    
    # Trivial but horrible
    results = subprocess.run(
        cmd, shell=True, universal_newlines=True, check=True)
    print(results.stdout)
    
    # Reimplement with shell=False
    with open("hosts.txt") as hosts:
        for host in hosts:
            host = host.rstrip("
    ")  # drop newline
            ping = subprocess.run(
                 ["ping", "-c", "3", host],
                 text=True,
                 stdout=subprocess.PIPE,
                 check=True)
            for line in ping.stdout.split("
    "):
                 if "round-trip min/avg/max" in line:
                     print("{}: {}".format(host, line))
    

    Some things to note here:

    • With shell=False you don"t need the quoting that the shell requires around strings. Putting quotes anyway is probably an error.
    • It often makes sense to run as little code as possible in a subprocess. This gives you more control over execution from within your Python code.
    • Having said that, complex shell pipelines are tedious and sometimes challenging to reimplement in Python.

    The refactored code also illustrates just how much the shell really does for you with a very terse syntax -- for better or for worse. Python says explicit is better than implicit but the Python code is rather verbose and arguably looks more complex than this really is. On the other hand, it offers a number of points where you can grab control in the middle of something else, as trivially exemplified by the enhancement that we can easily include the host name along with the shell command output. (This is by no means challenging to do in the shell, either, but at the expense of yet another diversion and perhaps another process.)

    Common Shell Constructs

    For completeness, here are brief explanations of some of these shell features, and some notes on how they can perhaps be replaced with native Python facilities.

    • Globbing aka wildcard expansion can be replaced with glob.glob() or very often with simple Python string comparisons like for file in os.listdir("."): if not file.endswith(".png"): continue. Bash has various other expansion facilities like .{png,jpg} brace expansion and {1..100} as well as tilde expansion (~ expands to your home directory, and more generally ~account to the home directory of another user)
    • Shell variables like $SHELL or $my_exported_var can sometimes simply be replaced with Python variables. Exported shell variables are available as e.g. os.environ["SHELL"] (the meaning of export is to make the variable available to subprocesses -- a variable which is not available to subprocesses will obviously not be available to Python running as a subprocess of the shell, or vice versa. The env= keyword argument to subprocess methods allows you to define the environment of the subprocess as a dictionary, so that"s one way to make a Python variable visible to a subprocess). With shell=False you will need to understand how to remove any quotes; for example, cd "$HOME" is equivalent to os.chdir(os.environ["HOME"]) without quotes around the directory name. (Very often cd is not useful or necessary anyway, and many beginners omit the double quotes around the variable and get away with it until one day ...)
    • Redirection allows you to read from a file as your standard input, and write your standard output to a file. grep "foo" <inputfile >outputfile opens outputfile for writing and inputfile for reading, and passes its contents as standard input to grep, whose standard output then lands in outputfile. This is not generally hard to replace with native Python code.
    • Pipelines are a form of redirection. echo foo | nl runs two subprocesses, where the standard output of echo is the standard input of nl (on the OS level, in Unix-like systems, this is a single file handle). If you cannot replace one or both ends of the pipeline with native Python code, perhaps think about using a shell after all, especially if the pipeline has more than two or three processes (though look at the pipes module in the Python standard library or a number of more modern and versatile third-party competitors).
    • Job control lets you interrupt jobs, run them in the background, return them to the foreground, etc. The basic Unix signals to stop and continue a process are of course available from Python, too. But jobs are a higher-level abstraction in the shell which involve process groups etc which you have to understand if you want to do something like this from Python.
    • Quoting in the shell is potentially confusing until you understand that everything is basically a string. So ls -l / is equivalent to "ls" "-l" "/" but the quoting around literals is completely optional. Unquoted strings which contain shell metacharacters undergo parameter expansion, whitespace tokenization and wildcard expansion; double quotes prevent whitespace tokenization and wildcard expansion but allow parameter expansions (variable substitution, command substitution, and backslash processing). This is simple in theory but can get bewildering, especially when there are several layers of interpretation (a remote shell command, for example).

    Understand differences between sh and Bash

    subprocess runs your shell commands with /bin/sh unless you specifically request otherwise (except of course on Windows, where it uses the value of the COMSPEC variable). This means that various Bash-only features like arrays, [[ etc are not available.

    If you need to use Bash-only syntax, you can pass in the path to the shell as executable="/bin/bash" (where of course if your Bash is installed somewhere else, you need to adjust the path).

    subprocess.run("""
        # This for loop syntax is Bash only
        for((i=1;i<=$#;i++)); do
            # Arrays are Bash-only
            array[i]+=123
        done""",
        shell=True, check=True,
        executable="/bin/bash")
    

    A subprocess is separate from its parent, and cannot change it

    A somewhat common mistake is doing something like

    subprocess.run("cd /tmp", shell=True)
    subprocess.run("pwd", shell=True)  # Oops, doesn"t print /tmp
    

    The same thing will happen if the first subprocess tries to set an environment variable, which of course will have disappeared when you run another subprocess, etc.

    A child process runs completely separate from Python, and when it finishes, Python has no idea what it did (apart from the vague indicators that it can infer from the exit status and output from the child process). A child generally cannot change the parent"s environment; it cannot set a variable, change the working directory, or, in so many words, communicate with its parent without cooperation from the parent.

    The immediate fix in this particular case is to run both commands in a single subprocess;

    subprocess.run("cd /tmp; pwd", shell=True)
    

    though obviously this particular use case isn"t very useful; instead, use the cwd keyword argument, or simply os.chdir() before running the subprocess. Similarly, for setting a variable, you can manipulate the environment of the current process (and thus also its children) via

    os.environ["foo"] = "bar"
    

    or pass an environment setting to a child process with

    subprocess.run("echo "$foo"", shell=True, env={"foo": "bar"})
    

    (not to mention the obvious refactoring subprocess.run(["echo", "bar"]); but echo is a poor example of something to run in a subprocess in the first place, of course).

    Don"t run Python from Python

    This is slightly dubious advice; there are certainly situations where it does make sense or is even an absolute requirement to run the Python interpreter as a subprocess from a Python script. But very frequently, the correct approach is simply to import the other Python module into your calling script and call its functions directly.

    If the other Python script is under your control, and it isn"t a module, consider turning it into one. (This answer is too long already so I will not delve into details here.)

    If you need parallelism, you can run Python functions in subprocesses with the multiprocessing module. There is also threading which runs multiple tasks in a single process (which is more lightweight and gives you more control, but also more constrained in that threads within a process are tightly coupled, and bound to a single GIL.)

    Answer #9

    urllib has been split up in Python 3.

    The urllib.urlencode() function is now urllib.parse.urlencode(),

    the urllib.urlopen() function is now urllib.request.urlopen().

    Answer #10

    Distribution Fitting with Sum of Square Error (SSE)

    This is an update and modification to Saullo"s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

    Example Fitting

    Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

    All Distributions

    All Fitted Distributions

    Best Fit Distribution

    Best Fit Distribution

    Example Code

    %matplotlib inline
    
    import warnings
    import numpy as np
    import pandas as pd
    import scipy.stats as st
    import statsmodels.api as sm
    from scipy.stats._continuous_distns import _distn_names
    import matplotlib
    import matplotlib.pyplot as plt
    
    matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
    matplotlib.style.use("ggplot")
    
    # Create models from data
    def best_fit_distribution(data, bins=200, ax=None):
        """Model data by finding best fit distribution to data"""
        # Get histogram of original data
        y, x = np.histogram(data, bins=bins, density=True)
        x = (x + np.roll(x, -1))[:-1] / 2.0
    
        # Best holders
        best_distributions = []
    
        # Estimate distribution parameters from data
        for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):
    
            print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))
    
            distribution = getattr(st, distribution)
    
            # Try to fit the distribution
            try:
                # Ignore warnings from data that can"t be fit
                with warnings.catch_warnings():
                    warnings.filterwarnings("ignore")
                    
                    # fit dist to data
                    params = distribution.fit(data)
    
                    # Separate parts of parameters
                    arg = params[:-2]
                    loc = params[-2]
                    scale = params[-1]
                    
                    # Calculate fitted PDF and error with fit in distribution
                    pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                    sse = np.sum(np.power(y - pdf, 2.0))
                    
                    # if axis pass in add to plot
                    try:
                        if ax:
                            pd.Series(pdf, x).plot(ax=ax)
                        end
                    except Exception:
                        pass
    
                    # identify if this distribution is better
                    best_distributions.append((distribution, params, sse))
            
            except Exception:
                pass
    
        
        return sorted(best_distributions, key=lambda x:x[2])
    
    def make_pdf(dist, params, size=10000):
        """Generate distributions"s Probability Distribution Function """
    
        # Separate parts of parameters
        arg = params[:-2]
        loc = params[-2]
        scale = params[-1]
    
        # Get sane start and end points of distribution
        start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
        end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
    
        # Build PDF and turn into pandas Series
        x = np.linspace(start, end, size)
        y = dist.pdf(x, loc=loc, scale=scale, *arg)
        pdf = pd.Series(y, x)
    
        return pdf
    
    # Load data from statsmodels datasets
    data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index("YEAR").values.ravel())
    
    # Plot for comparison
    plt.figure(figsize=(12,8))
    ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])[1]["color"])
    
    # Save plot limits
    dataYLim = ax.get_ylim()
    
    # Find best fit distribution
    best_distibutions = best_fit_distribution(data, 200, ax)
    best_dist = best_distibutions[0]
    
    # Update plots
    ax.set_ylim(dataYLim)
    ax.set_title(u"El Niño sea temp.
     All Fitted Distributions")
    ax.set_xlabel(u"Temp (°C)")
    ax.set_ylabel("Frequency")
    
    # Make PDF with best params 
    pdf = make_pdf(best_dist[0], best_dist[1])
    
    # Display
    plt.figure(figsize=(12,8))
    ax = pdf.plot(lw=2, label="PDF", legend=True)
    data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)
    
    param_names = (best_dist[0].shapes + ", loc, scale").split(", ") if best_dist[0].shapes else ["loc", "scale"]
    param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist[1])])
    dist_str = "{}({})".format(best_dist[0].name, param_str)
    
    ax.set_title(u"El Niño sea temp. with best fit distribution 
    " + dist_str)
    ax.set_xlabel(u"Temp. (°C)")
    ax.set_ylabel("Frequency")