Python map function to find string with maximum number of ones

find | ones | Python Methods and Functions | String Variables

Examples:

Example Input: matrix = [[0, 1, 1, 1], [0, 0, 1, 1], [1, 1, 1, 1], [0 , 0, 0, 0]] Output: 2

We have a solution for this problem, please refer to Find the line with the maximum number 1 . We can quickly fix this problem in python using the

# Function to find a line with a maximum of 1

def maxOnes ( input ):

# function to sum the map in each line

# given matrix

# will return a list of the sums of all

# in each line, then find in cc max element

result = list ( map ( sum , input ))

print ( result.index ( max (result)))

# Driver program

if __ name__ = = "__ main __" :

input = [[ 0 , 1 , 1 , 1 ], [ 0 , 0 , 1 , 1 ], [ 1 , 1 , 1 , 1 ], [ 0 , 0 , 0 , 0 ]]

maxOnes ( input )

Difficulty: O (M * N)

Exit:

2

Finding the index of an item in a list

Given a list ["foo", "bar", "baz"] and an item in the list "bar", how do I get its index (1) in Python?

Find current directory and file"s directory

In Python, what commands can I use to find:

1. the current directory (where I was in the terminal when I ran the Python script), and
2. where the file I am executing is?

How to find if directory exists in Python

In the os module in Python, is there a way to find if a directory exists, something like:

>>> os.direxists(os.path.join(os.getcwd()), "new_folder")) # in pseudocode
True/False

How do I find the location of my Python site-packages directory?

Question by Daryl Spitzer

How do I find the location of my site-packages directory?

Find all files in a directory with extension .txt in Python

How can I find all the files in a directory having the extension .txt in python?

Find which version of package is installed with pip

Using pip, is it possible to figure out which version of a package is currently installed?

I know about pip install XYZ --upgrade but I am wondering if there is anything like pip info XYZ. If not what would be the best way to tell what version I am currently using.

error: Unable to find vcvarsall.bat

pip install dulwich

But I get a cryptic error message:

error: Unable to find vcvarsall.bat

The same happens if I try installing the package manually:

> python setup.py install
running build_ext
building "dulwich._objects" extension
error: Unable to find vcvarsall.bat

How to use glob() to find files recursively?

This is what I have:

glob(os.path.join("src","*.c"))

but I want to search the subfolders of src. Something like this would work:

glob(os.path.join("src","*.c"))
glob(os.path.join("src","*","*.c"))
glob(os.path.join("src","*","*","*.c"))
glob(os.path.join("src","*","*","*","*.c"))

But this is obviously limited and clunky.

Python: Find in list

I have come across this:

item = someSortOfSelection()
if item in myList:
doMySpecialFunction(item)

but sometimes it does not work with all my items, as if they weren"t recognized in the list (when it"s a list of string).

Is this the most "pythonic" way of finding an item in a list: if x in l:?

How to find out the number of CPUs using python

I want to know the number of CPUs on the local machine using Python. The result should be user/real as output by time(1) when called with an optimally scaling userspace-only program.

How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla for loop)
4. DataFrame.apply(): i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. DataFrame.itertuples() and iteritems()
6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

1. faster attribute access.
2. space savings in memory.

The space savings is from

1. Storing value references in slots instead of __dict__.
2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
__slots__ = "foo", "bar"

class Right(Base):
__slots__ = "baz",

class Wrong(Base):
__slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

• To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

• To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
def get_set_delete():
obj.foo = "foo"
obj.foo
del obj.foo
return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272‚Ä†   16         56 + 112‚Ä† | ‚Ä†if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object):
__slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
__slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child):
__slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object):
__slots__ = ("a",)

class BaseB(object):
__slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
File "<pyshell#68>", line 1, in <module>
class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
__slots__ = ()

class BaseA(AbstractA):
__slots__ = ("a",)

class AbstractB(ABC):
__slots__ = ()

class BaseB(AbstractB):
__slots__ = ("b",)

class Child(AbstractA, AbstractB):
__slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
__slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
"""MyNT is an immutable and lightweight object"""
__slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object):
__slots__ = "foo", "bar"
class Bar(object):
__slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
__slots__ = ()
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
__slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
__slots__ = ()
@property
def c(self):
print("getting c!")
return self._c
@c.setter
def c(self, arg):
print("setting c!")
self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
__slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

• Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
• Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
• Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

<__main__.Foo object at 0x1129C770>

in Python 2.7:

<__main__.Foo object at 0x1129C770>

in Python 3.6

<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
0 1000000  49 64000000  64  64000000  64 __main__.Foo
1     169   0 16281480  16  80281480  80 list
2 1000000  49 16000000  16  96281480  97 __main__.Bar
3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
1 1000000  33  64000000  17 344000000  91 __main__.Foo
2     169   0  16281480   4 360281480  95 list
3 1000000  33  16000000   4 376281480  99 __main__.Bar
4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

os.listdir() - list in the current directory

With listdir in os module you get the files and the folders in the current dir

import os
arr = os.listdir()
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Looking in a directory

arr = os.listdir("c:\files")

glob from glob

with glob you can specify a type of file to list like this

import glob

txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)

glob in a list comprehension

mylist = [f for f in glob.glob("*.txt")]

get the full path of only files in the current directory

import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles)

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]

Getting the full path name with os.path.abspath

You get the full path in return

import os
files_path = [os.path.abspath(x) for x in os.listdir()]
print(files_path)

["F:\documentiapplications.txt", "F:\documenticollections.txt"]

Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if file.endswith(".docx"):
print(os.path.join(r, file))

os.listdir(): get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

import os
arr = os.listdir(".")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

To go up in the directory tree

# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")

Get files: os.listdir() in a particular directory (Python 2 and 3)

import os
arr = os.listdir("F:\python")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Get files of a particular subdirectory with os.listdir()

import os

x = os.listdir("./content")

os.walk(".") - current directory

import os
arr = next(os.walk("."))[2]
print(arr)

>>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]

next(os.walk(".")) and os.path.join("dir", "file")

import os
arr = []
for d,r,f in next(os.walk("F:\_python")):
for file in f:
arr.append(os.path.join(r,file))

for f in arr:
print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt

next(os.walk("F:\") - get the full path - list comprehension

[os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]

>>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]

os.walk - get full path - all files in sub dirs**

x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

os.listdir() - get only txt files

arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
print(arr_txt)

>>> ["work.txt", "3ebooks.txt"]

Using glob to get the full path of the files

If I should need the absolute path of the files:

from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt

Using os.path.isfile to avoid directories in the list

import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]

Using pathlib from Python 3.4

import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
if p.is_file():
print(p)
flist.append(p)

>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speak_gui2.py
>>> thumb.PNG

With list comprehension:

flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]

Alternatively, use pathlib.Path() instead of pathlib.Path(".")

Use glob method in pathlib.Path()

import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py

Get all and only files with os.walk

import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
for f in t:
y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]

Get only files with next and walk in a directory

import os
x = next(os.walk("F://python"))[2]
print(x)

>>> ["calculator.bat","calculator.py"]

Get only directories with next and walk in a directory

import os
next(os.walk("F://python"))[1] # for the current dir use (".")

>>> ["python3","others"]

Get all the subdir names with walk

for r,d,f in os.walk("F:\_python"):
for dirs in d:
print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints

os.scandir() from Python 3.5 and greater

import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
for entry in i:
if entry.is_file():
print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG

Examples:

Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

import os

def count(dir, counter=0):
"returns number of files in dir and subdirs"
for pack in os.walk(dir):
for f in pack[2]:
counter += 1
return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"

Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
"Searches for pptx (or other - pptx is the default) files and copies them"
for pack in os.walk(dir):
for f in pack[2]:
if f.endswith(filetype):
fullpath = pack[0] + "\" + f
print(fullpath)
shutil.copy(fullpath, destination)
counter += 1
if counter > 0:
print("-" * 30)
print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
"searches for folders that starts with `_`"
if dir[0] == "_":
# copyfile(dir, filetype="pdf")
copyfile(dir, filetype="txt")

>>> _compiti18Compito Contabilit√† 1conti.txt
>>> _compiti18Compito Contabilit√† 1modula4.txt
>>> _compiti18Compito Contabilit√† 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files

Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
for eachfile in os.listdir():
mylist += eachfile + "
"
file.write(mylist)

Example: txt with all the files of an hard drive

"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
for root, dirs, files in os.walk("D:\"):
for file in files:
listafile.append(file)
percorso.append(root + "\" + file)
testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
for file in listafile:
testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
for file in percorso:
file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")

All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk("C:\"):
for file in f:
filewrite.write(f"{r + file}
")

How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

import os

def searchfiles(extension=".ttf", folder="H:\"):
"Create a txt file with all the file of a type"
with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png

(New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list.

import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
"insert all files in the listbox"
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
lb.insert(0, r + "\" + file)

def open_file():
os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()

I just used the following which was quite simple. First open a console then cd to where you"ve downloaded your file like some-package.whl and use

pip install some-package.whl

Note: if pip.exe is not recognized, you may find it in the "Scripts" directory from where python has been installed. If pip is not installed, this page can help: How do I install pip on Windows?

Note: for clarification
If you copy the *.whl file to your local drive (ex. C:some-dirsome-file.whl) use the following command line parameters --

pip install C:/some-dir/some-file.whl

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(["col1","col2"]).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(["col1", "col2"]).size().reset_index(name="counts")

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]:
col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let"s use .size() to get the row counts:

In [3]: df.groupby(["col1", "col2"]).size()
Out[3]:
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let"s use .size().reset_index(name="counts") to get the row counts:

In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]:
col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
...: .groupby(["col1", "col2"])
...: .agg({
...:     "col3": ["mean", "count"],
...:     "col4": ["median", "min", "count"]
...: }))
Out[5]:
col4                  col3
median   min count      mean count
col1 col2
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...:  .reset_index()
...: )
...:
Out[6]:
col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["E", "F"],
...:         ["E", "F"],
...:         ["G", "H"]
...:         ])
...:
...: df = pd.DataFrame(
...:     np.hstack([keys,np.random.randn(10,4).round(2)]),
...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...:     df[["col3", "col4", "col5", "col6"]].astype(float)
...:

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

Using a for loop, how do I access the loop index, from 1 to 5 in this case?

Use enumerate to get the index with the element as you iterate:

for index, item in enumerate(items):
print(index, item)

And note that Python"s indexes start at zero, so you would get 0 to 4 with the above. If you want the count, 1 to 5, do this:

count = 0 # in case items is empty and you need it after the loop
for count, item in enumerate(items, start=1):
print(count, item)

Unidiomatic control flow

What you are asking for is the Pythonic equivalent of the following, which is the algorithm most programmers of lower-level languages would use:

index = 0            # Python"s indexing starts at zero
for item in items:   # Python"s for loops are a "for each" loop
print(index, item)
index += 1

Or in languages that do not have a for-each loop:

index = 0
while index < len(items):
print(index, items[index])
index += 1

or sometimes more commonly (but unidiomatically) found in Python:

for index in range(len(items)):
print(index, items[index])

Use the Enumerate Function

Python"s enumerate function reduces the visual clutter by hiding the accounting for the indexes, and encapsulating the iterable into another iterable (an enumerate object) that yields a two-item tuple of the index and the item that the original iterable would provide. That looks like this:

for index, item in enumerate(items, start=0):   # default is zero
print(index, item)

This code sample is fairly well the canonical example of the difference between code that is idiomatic of Python and code that is not. Idiomatic code is sophisticated (but not complicated) Python, written in the way that it was intended to be used. Idiomatic code is expected by the designers of the language, which means that usually this code is not just more readable, but also more efficient.

Getting a count

Even if you don"t need indexes as you go, but you need a count of the iterations (sometimes desirable) you can start with 1 and the final number will be your count.

count = 0 # in case items is empty
for count, item in enumerate(items, start=1):   # default is zero
print(item)

print("there were {0} items printed".format(count))

The count seems to be more what you intend to ask for (as opposed to index) when you said you wanted from 1 to 5.

Breaking it down - a step by step explanation

To break these examples down, say we have a list of items that we want to iterate over with an index:

items = ["a", "b", "c", "d", "e"]

Now we pass this iterable to enumerate, creating an enumerate object:

enumerate_object = enumerate(items) # the enumerate object

We can pull the first item out of this iterable that we would get in a loop with the next function:

iteration = next(enumerate_object) # first iteration from enumerate
print(iteration)

And we see we get a tuple of 0, the first index, and "a", the first item:

(0, "a")

we can use what is referred to as "sequence unpacking" to extract the elements from this two-tuple:

index, item = iteration
#   0,  "a" = (0, "a") # essentially this.

and when we inspect index, we find it refers to the first index, 0, and item refers to the first item, "a".

>>> print(index)
0
>>> print(item)
a

Conclusion

• Python indexes start at zero
• To get these indexes from an iterable as you iterate over it, use the enumerate function
• Using enumerate in the idiomatic way (along with tuple unpacking) creates code that is more readable and maintainable:

So do this:

for index, item in enumerate(items, start=0):   # Python indexes start at zero
print(index, item)

Getting some sort of modification date in a cross-platform way is easy - just call os.path.getmtime(path) and you"ll get the Unix timestamp of when the file at path was last modified.

Getting file creation dates, on the other hand, is fiddly and platform-dependent, differing even between the three big OSes:

Putting this all together, cross-platform code should look something like this...

import os
import platform

def creation_date(path_to_file):
"""
Try to get the date that a file was created, falling back to when it was
See http://stackoverflow.com/a/39501288/1709587 for explanation.
"""
if platform.system() == "Windows":
return os.path.getctime(path_to_file)
else:
stat = os.stat(path_to_file)
try:
return stat.st_birthtime
except AttributeError:
# We"re probably on Linux. No easy way to get creation dates here,
return stat.st_mtime

I noticed that every now and then I need to Google fopen all over again, just to build a mental image of what the primary differences between the modes are. So, I thought a diagram will be faster to read next time. Maybe someone else will find that helpful too.

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep="first")]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset="index", keep="first").set_index("index")
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 ¬µs per loop

>>> %timeit df3[~df3.index.duplicated(keep="first")]
1000 loops, best of 3: 307 ¬µs per loop

Note that you can keep the last element by changing the keep argument to "last".

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul"s example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 ¬µs per loop

>>> %timeit df1[~df1.index.duplicated(keep="last")]
1000 loops, best of 3: 365 ¬µs per loop

Here"s a concise solution which avoids regular expressions and slow in-Python loops:

def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang"s solution is the clear winner, outperforming all others by at least 5x for the large example set.

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python"s string.find.

Is there a list of Pytz Timezones?

I would like to know what are all the possible values for the timezone argument in the Python library pytz. How to do it?

Python strptime() and timezones?

I have a CSV dumpfile from a Blackberry IPD backup, created using IPDDump. The date/time strings in here look something like this (where EST is an Australian time-zone):

Tue Jun 22 07:46:22 EST 2010

I need to be able to parse this date in Python. At first, I tried to use the strptime() function from datettime.

>>> datetime.datetime.strptime("Tue Jun 22 12:10:20 2010 EST", "%a %b %d %H:%M:%S %Y %Z")

However, for some reason, the datetime object that comes back doesn"t seem to have any tzinfo associated with it.

I did read on this page that apparently datetime.strptime silently discards tzinfo, however, I checked the documentation, and I can"t find anything to that effect documented here.

I have been able to get the date parsed using a third-party Python library, dateutil, however I"m still curious as to how I was using the in-built strptime() incorrectly? Is there any way to get strptime() to play nicely with timezones?

Fitting empirical distribution to theoretical ones with Scipy (Python)?

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn"t matter for this problem.

PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.

I don"t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

Is there a way to implement such an analysis in Python (Scipy or Numpy)? Could you present any examples?

Thank you!

Python"s many ways of string formatting ‚Äî are the older ones (going to be) deprecated?

Python has at least six ways of formatting a string:

In [1]: world = "Earth"

# method 1a
In [2]: "Hello, %s" % world
Out[2]: "Hello, Earth"

# method 1b
In [3]: "Hello, %(planet)s" % {"planet": world}
Out[3]: "Hello, Earth"

# method 2a
In [4]: "Hello, {0}".format(world)
Out[4]: "Hello, Earth"

# method 2b
In [5]: "Hello, {planet}".format(planet=world)
Out[5]: "Hello, Earth"

# method 2c
In [6]: f"Hello, {world}"
Out[6]: "Hello, Earth"

In [7]: from string import Template

# method 3
In [8]: Template("Hello, \$planet").substitute(planet=world)
Out[8]: "Hello, Earth"

A brief history of the different methods:

• printf-style formatting has been around since Pythons infancy
• The Template class was introduced in Python 2.4
• The format method was introduced in Python 2.6
• f-strings were introduced in Python 3.6

My questions are:

• Is printf-style formatting deprecated or going to be deprecated?
• In the Template class, is the substitute method deprecated or going to be deprecated? (I"m not talking about safe_substitute, which as I understand it offers unique capabilities)

Similar questions and why I think they"re not duplicates:

• Python string formatting: % vs. .format ‚Äî treats only methods 1 and 2, and asks which one is better; my question is explicitly about deprecation in the light of the Zen of Python

• String formatting options: pros and cons ‚Äî treats only methods 1a and 1b in the question, 1 and 2 in the answer, and also nothing about deprecation

• advanced string formatting vs template strings ‚Äî mostly about methods 1 and 3, and doesn"t address deprecation

• String formatting expressions (Python) ‚Äî answer mentions that the original "%" approach is planned to be deprecated. But what"s the difference between planned to be deprecated, pending deprecation and actual deprecation? And the printf-style method doesn"t raise even a PendingDeprecationWarning, so is this really going to be deprecated? This post is also quite old, so the information may be outdated.

How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla for loop)
4. DataFrame.apply(): i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. DataFrame.itertuples() and iteritems()
6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over zip(df["A"], df["B"], ...) instead of df[["A", "B"]].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Placing the legend (bbox_to_anchor)

A legend is positioned inside the bounding box of the axes using the loc argument to plt.legend.
E.g. loc="upper right" places the legend in the upper right corner of the bounding box, which by default extents from (0,0) to (1,1) in axes coordinates (or in bounding box notation (x0,y0, width, height)=(0,0,1,1)).

To place the legend outside of the axes bounding box, one may specify a tuple (x0,y0) of axes coordinates of the lower left corner of the legend.

plt.legend(loc=(1.04,0))

A more versatile approach is to manually specify the bounding box into which the legend should be placed, using the bbox_to_anchor argument. One can restrict oneself to supply only the (x0, y0) part of the bbox. This creates a zero span box, out of which the legend will expand in the direction given by the loc argument. E.g.

plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")

places the legend outside the axes, such that the upper left corner of the legend is at position (1.04,1) in axes coordinates.

Further examples are given below, where additionally the interplay between different arguments like mode and ncols are shown.

l2 = plt.legend(bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
l3 = plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
l4 = plt.legend(bbox_to_anchor=(0,1.02,1,0.2), loc="lower left",
l5 = plt.legend(bbox_to_anchor=(1,0), loc="lower right",
bbox_transform=fig.transFigure, ncol=3)
l6 = plt.legend(bbox_to_anchor=(0.4,0.8), loc="upper right")

Details about how to interpret the 4-tuple argument to bbox_to_anchor, as in l4, can be found in this question. The mode="expand" expands the legend horizontally inside the bounding box given by the 4-tuple. For a vertically expanded legend, see this question.

Sometimes it may be useful to specify the bounding box in figure coordinates instead of axes coordinates. This is shown in the example l5 from above, where the bbox_transform argument is used to put the legend in the lower left corner of the figure.

Postprocessing

Having placed the legend outside the axes often leads to the undesired situation that it is completely or partially outside the figure canvas.

Solutions to this problem are:

One can adjust the subplot parameters such, that the axes take less space inside the figure (and thereby leave more space to the legend) by using plt.subplots_adjust. E.g.

leaves 30% space on the right-hand side of the figure, where one could place the legend.

• Tight layout
Using plt.tight_layout Allows to automatically adjust the subplot parameters such that the elements in the figure sit tight against the figure edges. Unfortunately, the legend is not taken into account in this automatism, but we can supply a rectangle box that the whole subplots area (including labels) will fit into.

plt.tight_layout(rect=[0,0,0.75,1])

• Saving the figure with bbox_inches = "tight"
The argument bbox_inches = "tight" to plt.savefig can be used to save the figure such that all artist on the canvas (including the legend) are fit into the saved area. If needed, the figure size is automatically adjusted.

plt.savefig("output.png", bbox_inches="tight")

• automatically adjusting the subplot params
A way to automatically adjust the subplot position such that the legend fits inside the canvas without changing the figure size can be found in this answer: Creating figure with exact size and no padding (and legend outside the axes)

Comparison between the cases discussed above:

Alternatives

A figure legend

One may use a legend to the figure instead of the axes, matplotlib.figure.Figure.legend. This has become especially useful for matplotlib version >=2.1, where no special arguments are needed

fig.legend(loc=7)

to create a legend for all artists in the different axes of the figure. The legend is placed using the loc argument, similar to how it is placed inside an axes, but in reference to the whole figure - hence it will be outside the axes somewhat automatically. What remains is to adjust the subplots such that there is no overlap between the legend and the axes. Here the point "Adjust the subplot parameters" from above will be helpful. An example:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi)
colors=["#7aa0c4","#ca82e1" ,"#8bcd50","#e18882"]
fig, axes = plt.subplots(ncols=2)
for i in range(4):
axes[i//2].plot(x,np.sin(x+i), color=colors[i],label="y=sin(x+{})".format(i))

fig.legend(loc=7)
fig.tight_layout()
plt.show()

Legend inside dedicated subplot axes

An alternative to using bbox_to_anchor would be to place the legend in its dedicated subplot axes (lax). Since the legend subplot should be smaller than the plot, we may use gridspec_kw={"width_ratios":[4,1]} at axes creation. We can hide the axes lax.axis("off") but still put a legend in. The legend handles and labels need to obtained from the real plot via h,l = ax.get_legend_handles_labels(), and can then be supplied to the legend in the lax subplot, lax.legend(h,l). A complete example is below.

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = 6,2

fig, (ax,lax) = plt.subplots(ncols=2, gridspec_kw={"width_ratios":[4,1]})
ax.plot(x,y, label="y=sin(x)")
....

h,l = ax.get_legend_handles_labels()
lax.axis("off")

plt.tight_layout()
plt.show()

This produces a plot, which is visually pretty similar to the plot from above:

We could also use the first axes to place the legend, but use the bbox_transform of the legend axes,

ax.legend(bbox_to_anchor=(0,0,1,1), bbox_transform=lax.transAxes)
lax.axis("off")

In this approach, we do not need to obtain the legend handles externally, but we need to specify the bbox_to_anchor argument.

• Consider the matplotlib legend guide with some examples of other stuff you want to do with legends.
• Some example code for placing legends for pie charts may directly be found in answer to this question: Python - Legend overlaps with the pie chart
• The loc argument can take numbers instead of strings, which make calls shorter, however, they are not very intuitively mapped to each other. Here is the mapping for reference:

It helps to install a python package foo on your machine (can also be in virtualenv) so that you can import the package foo from other projects and also from [I]Python prompts.

It does the similar job of pip, easy_install etc.,

Using setup.py

Package - A folder/directory that contains __init__.py file.
Module - A valid python file with .py extension.
Distribution - How one package relates to other packages and modules.

Let"s say you want to install a package named foo. Then you do,

\$ git clone https://github.com/user/foo
\$ cd foo
\$ python setup.py install

Instead, if you don"t want to actually install it but still would like to use it. Then do,

\$ python setup.py develop

This command will create symlinks to the source directory within site-packages instead of copying things. Because of this, it is quite fast (particularly for large packages).

Creating setup.py

If you have your package tree like,

foo
‚îú‚îÄ‚îÄ foo
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ data_struct.py
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ __init__.py
‚îÇ¬†¬† ‚îî‚îÄ‚îÄ internals.py
‚îú‚îÄ‚îÄ requirements.txt
‚îî‚îÄ‚îÄ setup.py

Then, you do the following in your setup.py script so that it can be installed on some machine:

from setuptools import setup

setup(
name="foo",
version="1.0",
description="A useful module",
author="Man Foo",
author_email="[email protected]",
packages=["foo"],  #same as name
install_requires=["wheel", "bar", "greek"], #external packages as dependencies
)

Instead, if your package tree is more complex like the one below:

foo
‚îú‚îÄ‚îÄ foo
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ data_struct.py
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ __init__.py
‚îÇ¬†¬† ‚îî‚îÄ‚îÄ internals.py
‚îú‚îÄ‚îÄ requirements.txt
‚îú‚îÄ‚îÄ scripts
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ cool
‚îÇ¬†¬† ‚îî‚îÄ‚îÄ skype
‚îî‚îÄ‚îÄ setup.py

Then, your setup.py in this case would be like:

from setuptools import setup

setup(
name="foo",
version="1.0",
description="A useful module",
author="Man Foo",
author_email="[email protected]",
packages=["foo"],  #same as name
install_requires=["wheel", "bar", "greek"], #external packages as dependencies
scripts=[
"scripts/cool",
"scripts/skype",
]
)

Add more stuff to (setup.py) & make it decent:

from setuptools import setup

setup(
name="foo",
version="1.0",
description="A useful module",
long_description=long_description,
author="Man Foo",
author_email="[email protected]",
url="http://www.foopackage.com/",
packages=["foo"],  #same as name
install_requires=["wheel", "bar", "greek"], #external packages as dependencies
scripts=[
"scripts/cool",
"scripts/skype",
]
)

The long_description is used in pypi.org as the README description of your package.

At this point there are two options.

• publish in the temporary test.pypi.org server to make oneself familiarize with the procedure, and then publish it on the permanent pypi.org server for the public to use your package.
• publish straight away on the permanent pypi.org server, if you are already familiar with the procedure and have your user credentials (e.g., username, password, package name)

Once your package name is registered in pypi.org, nobody can claim or use it. Python packaging suggests the twine package for uploading purposes (of your package to PyPi). Thus,

(1) the first step is to locally build the distributions using:

# prereq: wheel (pip install wheel)
\$ python setup.py sdist bdist_wheel

\$ twine upload --repository testpypi dist/*

It will take few minutes for the package to appear on test.pypi.org. Once you"re satisfied with it, you can then upload your package to the real & permanent index of pypi.org simply with:

Optionally, you can also sign the files in your package with a GPG by:

tl;dr / quick fix

• Don"t decode/encode willy nilly
• Don"t assume your strings are UTF-8 encoded
• Try to convert strings to Unicode strings as soon as possible in your code
• Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
• Don"t be tempted to use quick reload hacks

Unicode Zen in Python 2.x - The Long Version

Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

UnicodeDecodeError: "ascii" codec can"t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u"my √ºnic√¥d√© strƒØng"
>>> type(my_u)
<type "unicode">

Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

Gotchas

Conversion from str to Unicode can happen even when you don"t explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode("‚Ç¨")

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("‚Ç¨")

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "‚Ç¨"

# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "‚Ç¨"

Examples

In the following diagram, you can see how the word caf√© has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, √© is encoded using two bytes. In "Cp1252", √© is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can"t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

The Unicode Sandwich

It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

Input / Decode

Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u"Z√ºrich"

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Files

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you"ve probably used the wrong encoding value.

CSV Files

The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
yield row

Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

charset="utf8",
use_unicode=True

E.g.

>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
PostgreSQL

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

Manually

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you"ve probably got the wrong encoding.

The meat of the sandwich

Work with Unicodes as you would normal strs.

Output

stdout / printing

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

Files

Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

Database

The same configuration for reading will allow Unicodes to be written directly.

Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

Why you shouldn"t use sys.setdefaultencoding("utf8")

It"s a nasty hack (there"s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

You can either Drop the columns you do not need OR Select the ones you need

# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)

# drop by Name
df1 = df1.drop(["B", "C"], axis=1)

# Select the ones you want
df1 = df[["a","d"]]

Simply put, numpy.newaxis is used to increase the dimension of the existing array by one more dimension, when used once. Thus,

• 1D array will become 2D array

• 2D array will become 3D array

• 3D array will become 4D array

• 4D array will become 5D array

and so on..

Here is a visual illustration which depicts promotion of 1D array to 2D arrays.

Scenario-1: np.newaxis might come in handy when you want to explicitly convert a 1D array to either a row vector or a column vector, as depicted in the above picture.

Example:

# 1D array
In [7]: arr = np.arange(4)
In [8]: arr.shape
Out[8]: (4,)

# make it as row vector by inserting an axis along first dimension
In [9]: row_vec = arr[np.newaxis, :]     # arr[None, :]
In [10]: row_vec.shape
Out[10]: (1, 4)

# make it as column vector by inserting an axis along second dimension
In [11]: col_vec = arr[:, np.newaxis]     # arr[:, None]
In [12]: col_vec.shape
Out[12]: (4, 1)

Scenario-2: When we want to make use of numpy broadcasting as part of some operation, for instance while doing addition of some arrays.

Example:

Let"s say you want to add the following two arrays:

x1 = np.array([1, 2, 3, 4, 5])
x2 = np.array([5, 4, 3])

If you try to add these just like that, NumPy will raise the following ValueError :

ValueError: operands could not be broadcast together with shapes (5,) (3,)

In this situation, you can use np.newaxis to increase the dimension of one of the arrays so that NumPy can broadcast.

In [2]: x1_new = x1[:, np.newaxis]    # x1[:, None]
# now, the shape of x1_new is (5, 1)
# array([[1],
#        [2],
#        [3],
#        [4],
#        [5]])

In [3]: x1_new + x2
Out[3]:
array([[ 6,  5,  4],
[ 7,  6,  5],
[ 8,  7,  6],
[ 9,  8,  7],
[10,  9,  8]])

Alternatively, you can also add new axis to the array x2:

In [6]: x2_new = x2[:, np.newaxis]    # x2[:, None]
In [7]: x2_new     # shape is (3, 1)
Out[7]:
array([[5],
[4],
[3]])

In [8]: x1 + x2_new
Out[8]:
array([[ 6,  7,  8,  9, 10],
[ 5,  6,  7,  8,  9],
[ 4,  5,  6,  7,  8]])

Note: Observe that we get the same result in both cases (but one being the transpose of the other).

Scenario-3: This is similar to scenario-1. But, you can use np.newaxis more than once to promote the array to higher dimensions. Such an operation is sometimes needed for higher order arrays (i.e. Tensors).

Example:

In [124]: arr = np.arange(5*5).reshape(5,5)

In [125]: arr.shape
Out[125]: (5, 5)

# promoting 2D array to a 5D array
In [126]: arr_5D = arr[np.newaxis, ..., np.newaxis, np.newaxis]    # arr[None, ..., None, None]

In [127]: arr_5D.shape
Out[127]: (1, 5, 5, 1, 1)

As an alternative, you can use numpy.expand_dims that has an intuitive axis kwarg.

# adding new axes at 1st, 4th, and last dimension of the resulting array
In [131]: newaxes = (0, 3, -1)
In [132]: arr_5D = np.expand_dims(arr, axis=newaxes)
In [133]: arr_5D.shape
Out[133]: (1, 5, 5, 1, 1)

More background on np.newaxis vs np.reshape

newaxis is also called as a pseudo-index that allows the temporary addition of an axis into a multiarray.

np.newaxis uses the slicing operator to recreate the array while numpy.reshape reshapes the array to the desired layout (assuming that the dimensions match; And this is must for a reshape to happen).

Example

In [13]: A = np.ones((3,4,5,6))
In [14]: B = np.ones((4,6))
In [15]: (A + B[:, np.newaxis, :]).shape     # B[:, None, :]
Out[15]: (3, 4, 5, 6)

In the above example, we inserted a temporary axis between the first and second axes of B (to use broadcasting). A missing axis is filled-in here using np.newaxis to make the broadcasting operation work.

General Tip: You can also use None in place of np.newaxis; These are in fact the same objects.

In [13]: np.newaxis is None
Out[13]: True

P.S. Also see this great answer: newaxis vs reshape to add dimensions

How do I determine the size of an object in Python?

That answer does work for builtin objects directly, but it does not account for what those objects may contain, specifically, what types, such as custom objects, tuples, lists, dicts, and sets contain. They can contain instances each other, as well as numbers, strings and other objects.

Using 64-bit Python 3.6 from the Anaconda distribution, with sys.getsizeof, I have determined the minimum size of the following objects, and note that sets and dicts preallocate space so empty ones don"t grow again until after a set amount (which may vary by implementation of the language):

Python 3:

Empty
Bytes  type        scaling notes
28     int         +4 bytes about every 30 powers of 2
37     bytes       +1 byte per additional byte
49     str         +1-4 per additional character (depending on max width)
48     tuple       +8 per additional item
64     list        +8 for each additional
224    set         5th increases to 736; 21nd, 2272; 85th, 8416; 341, 32992
240    dict        6th increases to 368; 22nd, 1184; 43rd, 2280; 86th, 4704; 171st, 9320
136    func def    does not include default args and other attrs
1056   class def   no slots
56     class inst  has a __dict__ attr, same scaling as dict above
888    class def   with slots
16     __slots__   seems to store in mutable tuple-like structure
first slot grows to 48, and so on.

How do you interpret this? Well say you have a set with 10 items in it. If each item is 100 bytes each, how big is the whole data structure? The set is 736 itself because it has sized up one time to 736 bytes. Then you add the size of the items, so that"s 1736 bytes in total

Some caveats for function and class definitions:

Note each class definition has a proxy __dict__ (48 bytes) structure for class attrs. Each slot has a descriptor (like a property) in the class definition.

Slotted instances start out with 48 bytes on their first element, and increase by 8 each additional. Only empty slotted objects have 16 bytes, and an instance with no data makes very little sense.

Also, each function definition has code objects, maybe docstrings, and other possible attributes, even a __dict__.

Also note that we use sys.getsizeof() because we care about the marginal space usage, which includes the garbage collection overhead for the object, from the docs:

getsizeof() calls the object‚Äôs __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

Also note that resizing lists (e.g. repetitively appending to them) causes them to preallocate space, similarly to sets and dicts. From the listobj.c source code:

/* This over-allocates proportional to the list size, making room
* for additional growth.  The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
* Note: new_allocated won"t overflow because the largest possible value
*       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
*/
new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);

Historical data

Python 2.7 analysis, confirmed with guppy.hpy and sys.getsizeof:

Bytes  type        empty + scaling notes
24     int         NA
28     long        NA
37     str         + 1 byte per additional character
52     unicode     + 4 bytes per additional character
56     tuple       + 8 bytes per additional item
72     list        + 32 for first, 8 for each additional
232    set         sixth item increases to 744; 22nd, 2280; 86th, 8424
280    dict        sixth item increases to 1048; 22nd, 3352; 86th, 12568 *
120    func def    does not include default args and other attrs
64     class inst  has a __dict__ attr, same scaling as dict above
16     __slots__   class with slots has no dict, seems to store in
mutable tuple-like structure.
904    class def   has a proxy __dict__ structure for class attrs
104    old class   makes sense, less stuff, has real dict though.

Note that dictionaries (but not sets) got a more compact representation in Python 3.6

I think 8 bytes per additional item to reference makes a lot of sense on a 64 bit machine. Those 8 bytes point to the place in memory the contained item is at. The 4 bytes are fixed width for unicode in Python 2, if I recall correctly, but in Python 3, str becomes a unicode of width equal to the max width of the characters.

And for more on slots, see this answer.

A More Complete Function

We want a function that searches the elements in lists, tuples, sets, dicts, obj.__dict__"s, and obj.__slots__, as well as other things we may not have yet thought of.

We want to rely on gc.get_referents to do this search because it works at the C level (making it very fast). The downside is that get_referents can return redundant members, so we need to ensure we don"t double count.

Classes, modules, and functions are singletons - they exist one time in memory. We"re not so interested in their size, as there"s not much we can do about them - they"re a part of the program. So we"ll avoid counting them if they happen to be referenced.

We"re going to use a blacklist of types so we don"t include the entire program in our size count.

import sys
from types import ModuleType, FunctionType
from gc import get_referents

# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType

def getsize(obj):
"""sum size of object & members."""
if isinstance(obj, BLACKLIST):
raise TypeError("getsize() does not take argument of type: "+ str(type(obj)))
seen_ids = set()
size = 0
objects = [obj]
while objects:
need_referents = []
for obj in objects:
if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
size += sys.getsizeof(obj)
need_referents.append(obj)
objects = get_referents(*need_referents)
return size

To contrast this with the following whitelisted function, most objects know how to traverse themselves for the purposes of garbage collection (which is approximately what we"re looking for when we want to know how expensive in memory certain objects are. This functionality is used by gc.get_referents.) However, this measure is going to be much more expansive in scope than we intended if we are not careful.

For example, functions know quite a lot about the modules they are created in.

Another point of contrast is that strings that are keys in dictionaries are usually interned so they are not duplicated. Checking for id(key) will also allow us to avoid counting duplicates, which we do in the next section. The blacklist solution skips counting keys that are strings altogether.

Whitelisted Types, Recursive visitor

To cover most of these types myself, instead of relying on the gc module, I wrote this recursive function to try to estimate the size of most Python objects, including most builtins, types in the collections module, and custom types (slotted and otherwise).

This sort of function gives much more fine-grained control over the types we"re going to count for memory usage, but has the danger of leaving important types out:

import sys
from numbers import Number
from collections import deque
from collections.abc import Set, Mapping

ZERO_DEPTH_BASES = (str, bytes, Number, range, bytearray)

def getsize(obj_0):
"""Recursively iterate to sum size of object & members."""
_seen_ids = set()
def inner(obj):
obj_id = id(obj)
if obj_id in _seen_ids:
return 0
size = sys.getsizeof(obj)
if isinstance(obj, ZERO_DEPTH_BASES):
pass # bypass remaining control flow and return
elif isinstance(obj, (tuple, list, Set, deque)):
size += sum(inner(i) for i in obj)
elif isinstance(obj, Mapping) or hasattr(obj, "items"):
size += sum(inner(k) + inner(v) for k, v in getattr(obj, "items")())
# Check for custom object instances - may subclass above too
if hasattr(obj, "__dict__"):
size += inner(vars(obj))
if hasattr(obj, "__slots__"): # can have __slots__ with __dict__
size += sum(inner(getattr(obj, s)) for s in obj.__slots__ if hasattr(obj, s))
return size
return inner(obj_0)

And I tested it rather casually (I should unittest it):

>>> getsize(["a", tuple("bcd"), Foo()])
344
>>> getsize(Foo())
16
>>> getsize(tuple("bcd"))
194
>>> getsize(["a", tuple("bcd"), Foo(), {"foo": "bar", "baz": "bar"}])
752
>>> getsize({"foo": "bar", "baz": "bar"})
400
>>> getsize({})
280
>>> getsize({"foo":"bar"})
360
>>> getsize("foo")
40
>>> class Bar():
...     def baz():
...         pass
>>> getsize(Bar())
352
>>> getsize(Bar().__dict__)
280
>>> sys.getsizeof(Bar())
72
>>> getsize(Bar.__dict__)
872
>>> sys.getsizeof(Bar.__dict__)
280

This implementation breaks down on class definitions and function definitions because we don"t go after all of their attributes, but since they should only exist once in memory for the process, their size really doesn"t matter too much.

How do I get the current time in Python?

The time module

The time module provides functions that tell us the time in "seconds since the epoch" as well as other utilities.

import time

Unix Epoch Time

This is the format you should get timestamps in for saving in databases. It is a simple floating-point number that can be converted to an integer. It is also good for arithmetic in seconds, as it represents the number of seconds since Jan 1, 1970, 00:00:00, and it is memory light relative to the other representations of time we"ll be looking at next:

>>> time.time()
1424233311.771502

This timestamp does not account for leap-seconds, so it"s not linear - leap seconds are ignored. So while it is not equivalent to the international UTC standard, it is close, and therefore quite good for most cases of record-keeping.

This is not ideal for human scheduling, however. If you have a future event you wish to take place at a certain point in time, you"ll want to store that time with a string that can be parsed into a datetime object or a serialized datetime object (these will be described later).

time.ctime

You can also represent the current time in the way preferred by your operating system (which means it can change when you change your system preferences, so don"t rely on this to be standard across all systems, as I"ve seen others expect). This is typically user friendly, but doesn"t typically result in strings one can sort chronologically:

>>> time.ctime()
"Tue Feb 17 23:21:56 2015"

You can hydrate timestamps into human readable form with ctime as well:

>>> time.ctime(1424233311.771502)
"Tue Feb 17 23:21:51 2015"

This conversion is also not good for record-keeping (except in text that will only be parsed by humans - and with improved Optical Character Recognition and Artificial Intelligence, I think the number of these cases will diminish).

datetime module

The datetime module is also quite useful here:

>>> import datetime

datetime.datetime.now

The datetime.now is a class method that returns the current time. It uses the time.localtime without the timezone info (if not given, otherwise see timezone aware below). It has a representation (which would allow you to recreate an equivalent object) echoed on the shell, but when printed (or coerced to a str), it is in human readable (and nearly ISO) format, and the lexicographic sort is equivalent to the chronological sort:

>>> datetime.datetime.now()
datetime.datetime(2015, 2, 17, 23, 43, 49, 94252)
>>> print(datetime.datetime.now())
2015-02-17 23:43:51.782461

datetime"s utcnow

You can get a datetime object in UTC time, a global standard, by doing this:

>>> datetime.datetime.utcnow()
datetime.datetime(2015, 2, 18, 4, 53, 28, 394163)
>>> print(datetime.datetime.utcnow())
2015-02-18 04:53:31.783988

UTC is a time standard that is nearly equivalent to the GMT timezone. (While GMT and UTC do not change for Daylight Savings Time, their users may switch to other timezones, like British Summer Time, during the Summer.)

datetime timezone aware

However, none of the datetime objects we"ve created so far can be easily converted to various timezones. We can solve that problem with the pytz module:

>>> import pytz
>>> then = datetime.datetime.now(pytz.utc)
>>> then
datetime.datetime(2015, 2, 18, 4, 55, 58, 753949, tzinfo=<UTC>)

Equivalently, in Python 3 we have the timezone class with a utc timezone instance attached, which also makes the object timezone aware (but to convert to another timezone without the handy pytz module is left as an exercise to the reader):

>>> datetime.datetime.now(datetime.timezone.utc)
datetime.datetime(2015, 2, 18, 22, 31, 56, 564191, tzinfo=datetime.timezone.utc)

And we see we can easily convert to timezones from the original UTC object.

>>> print(then)
2015-02-18 04:55:58.753949+00:00
>>> print(then.astimezone(pytz.timezone("US/Eastern")))
2015-02-17 23:55:58.753949-05:00

You can also make a naive datetime object aware with the pytz timezone localize method, or by replacing the tzinfo attribute (with replace, this is done blindly), but these are more last resorts than best practices:

>>> pytz.utc.localize(datetime.datetime.utcnow())
datetime.datetime(2015, 2, 18, 6, 6, 29, 32285, tzinfo=<UTC>)
>>> datetime.datetime.utcnow().replace(tzinfo=pytz.utc)
datetime.datetime(2015, 2, 18, 6, 9, 30, 728550, tzinfo=<UTC>)

The pytz module allows us to make our datetime objects timezone aware and convert the times to the hundreds of timezones available in the pytz module.

One could ostensibly serialize this object for UTC time and store that in a database, but it would require far more memory and be more prone to error than simply storing the Unix Epoch time, which I demonstrated first.

The other ways of viewing times are much more error-prone, especially when dealing with data that may come from different time zones. You want there to be no confusion as to which timezone a string or serialized datetime object was intended for.

If you"re displaying the time with Python for the user, ctime works nicely, not in a table (it doesn"t typically sort well), but perhaps in a clock. However, I personally recommend, when dealing with time in Python, either using Unix time, or a timezone aware UTC datetime object.

Python >= 3.5 alternative: unpack into a list literal [*newdict]

New unpacking generalizations (PEP 448) were introduced with Python 3.5 allowing you to now easily do:

>>> newdict = {1:0, 2:0, 3:0}
>>> [*newdict]
[1, 2, 3]

Unpacking with * works with any object that is iterable and, since dictionaries return their keys when iterated through, you can easily create a list by using it within a list literal.

Adding .keys() i.e [*newdict.keys()] might help in making your intent a bit more explicit though it will cost you a function look-up and invocation. (which, in all honesty, isn"t something you should really be worried about).

The *iterable syntax is similar to doing list(iterable) and its behaviour was initially documented in the Calls section of the Python Reference manual. With PEP 448 the restriction on where *iterable could appear was loosened allowing it to also be placed in list, set and tuple literals, the reference manual on Expression lists was also updated to state this.

Though equivalent to list(newdict) with the difference that it"s faster (at least for small dictionaries) because no function call is actually performed:

%timeit [*newdict]
1000000 loops, best of 3: 249 ns per loop

%timeit list(newdict)
1000000 loops, best of 3: 508 ns per loop

%timeit [k for k in newdict]
1000000 loops, best of 3: 574 ns per loop

with larger dictionaries the speed is pretty much the same (the overhead of iterating through a large collection trumps the small cost of a function call).

In a similar fashion, you can create tuples and sets of dictionary keys:

>>> *newdict,
(1, 2, 3)
>>> {*newdict}
{1, 2, 3}

beware of the trailing comma in the tuple case!

You can use

driver.execute_script("window.scrollTo(0, Y)")

where Y is the height (on a fullhd monitor it"s 1080). (Thanks to @lukeis)

You can also use

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

to scroll to the bottom of the page.

If you want to scroll to a page with infinite loading, like social network ones, facebook etc. (thanks to @Cuong Tran)

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(SCROLL_PAUSE_TIME)

# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

another method (thanks to Juanse) is, select an object and

label.sendKeys(Keys.PAGE_DOWN);