# ML | Mathematical explanation of standard deviation and R-squared error

exp | iat | Python Methods and Functions | square

RMSE: root mean square error — it is a measure of how well the regression line fits the data points. RMSE can also be interpreted as the standard deviation of the residual.
Consider these points: (1, 1), (2, 2), (2, 3), (3, 6).
We split the above data points into the 1st list.

`  x =  [1, 2, 2, 3]  y =  [1, 2, 3, 6] `

Code: Regression Plot

 ` import ` ` matplotlib.pyplot as plt ` ` import ` ` math `   ` # dots ` ` plt. plot (x, y) `   ` # x-axis title ` ` plt.xlabel (` ` 'x - axis' ` `) `   ` # Y-axis name ` ` plt.ylabel (` ` 'y - axis' ` `) `   ` # giving a title to my graphic ` ` plt.title (` ` 'Regression Graph' ` `) ` ` `  ` # plot show function ` ` plt.show () `

Code: Average

` `

` # in the next step we will find the equation of the line of best fit # we will use the slope of a linear algebra point to find the equation of the regression line # the shape of the slope is represented by y = mx + c # where m means slope (change in y) / (change in x) # c is a constant, it represents where the line will cross the Y-axis # Slope m can be formulated like this: "" "   N m =? (xi - Xmean) (yi - Ymean) /? (xi - Xmean) ^ 2 i = 1 "" " # calculate Xmean and Ymean ct = len (x) sum_x = 0 sum_y = 0   for i in x:   sum_x = sum_x + i x_mean = sum_x / ct print ( ' Value of X mean' , x_mean)   for i in y: sum_y = sum_y + i y_mean = sum_y / ct print ( 'value of Y mean' , y_mean)   # we have x mean and y_mean `

` `

Output:

` Value of X mean 2.0 value of Y mean 3.0 `

Code: linear equation

 ` # below is the process of finding a linear equation in mathematical terms ` ` # our line's slope is 2.5 ` ` # evaluate c to find the equation `   ` m ` ` = ` ` 2.5 ` ` c ` ` = ` ` y_mean ` ` - ` ` m ` ` * ` ` x_mean ` ` print ` ` ( ` ` 'Intercept' ` `, c) `

Output:

` Intercept -2.0 `

Code: Medium square error

 ` Our regression line equation looks like this: ` ` # y_pred = 2.5x-2.0 ` ` # we name the string y_pred ` ` # insert a regression line plot ` ` from ` ` sklearn.metrics ` ` import ` ` mean_squared_error ` ` # y_pred for our exhaustive data points as shown below `   ` y ` ` = ` ` [` ` 1 ` `, ` ` 2 ` `, ` ` 3 ` `, ` ` 6 ` `] ` ` y_pred ` ` = ` ` [` ` 0.5 ` `, ` ` 3 ` `, ` ` 3 ` `, ` ` 5.5 ` `] `

 ` # sklearn root mean square ` ` mse ` ` = ` ` math.sqrt (mean_squared_error (y, y_pred)) ` ` print ` ` (` ` 'Root mean square error' ` `, mse) `

Output:

` Root mean square error 0.6123724356957945 `

Code: RMSE Calculation

 ` # let's see how the mean square is calculated mathematically ` ` # let's introduce a term called residuals ` ` # remainder - it is basically the distance from the data point to the regression line ` ` # the residuals are indicated by the red line in the graph below ` ` # RMS and residuals are calculated as shown below ` ` # we have 4 data points ` ` "" "` ` r = 1, ri = yi-y_pred ` ` y_pred is mx + c ` ` ri = yi- (mx + c) ` ` eg x = 1, we have the y value as 1 ` ` we want to evaluate exactly what our model predicted for x = 1 ` ` (1, 1) r1 = 1, x = 2 ` ` "" "` ` # y_pred1 = 1- (2.5 * 1-2.0 ) = 0.5 ` ` r1 ` ` = ` ` 1 ` ` - ` ` (` ` 2.5 ` ` * ` ` 1 ` ` - ` ` 2.0 ` `) `   ` # (2, 2) r2 = 2, x = 2 ` ` # y_pred2 = 2- (2.5 * 2-2.0) = - 1 ` ` r2 ` ` = ` ` 2 ` ` - ` ` (` ` 2.5 ` ` * ` ` 2 ` ` - ` ` 2.0 ` `) `   ` # (2, 3) r3 = 3, x = 2 ` ` # y_pred3 = 3- (2.5 * 2-2.0) = 0 ` ` r3 ` ` = ` ` 3 ` ` - ` ` (` ` 2.5 ` ` * ` ` 2 ` ` - ` ` 2.0 ` `) ` ` `  ` # (3, 6) r4 = 4, x = 3 ` ` # y_pred4 = 6- (2.5 * 3-2.0) =. 5 ` ` r4 ` ` = ` ` 6 ` ` - ` ` (` ` 2.5 ` ` * ` ` 3 ` ` - ` ` 2.0 ` ` ) `   ` # on top of the calculation we have residual values ​​` ` residuals ` ` = ` ` [` ` 0.5 ` `, ` ` - ` ` 1 ` `, ` ` 0 ` `,. ` ` 5 ` `] `   ` # now calculate the root mean square error ` ` # N = 4 tons data points ` ` N ` ` = ` ` 4 ` ` rmse ` ` = ` ` math.sqrt ((r1 ` ` * ` ` * ` ` 2 ` ` + ` ` r2 ` ` * ` ` * ` ` 2 ` ` + ` ` r3 ` ` * ` ` * ` ` 2 ` ` + ` ` r4 ` ` * ` ` * ` ` 2 ` `) ` ` / ` ` N) ` ` print ` ` (` ` 'Root Mean square error using maths' ` `, rmse) `   ` # RMS value actually calculated using math ` ` # both RMSEs are calculated the same `

Output:

` Root Mean square error using maths 0.6123724356957945 `

R-squared error or coefficient of determination
Error R2 answers the following question.
How much y changes with change in x. Basically, the percentage change in y when changing from x

## How do I merge two dictionaries in a single expression (taking union of dictionaries)?

### Question by Carl Meyer

I have two Python dictionaries, and I want to write a single expression that returns these two dictionaries, merged (i.e. taking the union). The `update()` method would be what I need, if it returned its result instead of modifying a dictionary in-place.

``````>>> x = {"a": 1, "b": 2}
>>> y = {"b": 10, "c": 11}
>>> z = x.update(y)
>>> print(z)
None
>>> x
{"a": 1, "b": 10, "c": 11}
``````

How can I get that final merged dictionary in `z`, not `x`?

(To be extra-clear, the last-one-wins conflict-handling of `dict.update()` is what I"m looking for as well.)

## List changes unexpectedly after assignment. Why is this and how can I prevent it?

While using `new_list = my_list`, any modifications to `new_list` changes `my_list` every time. Why is this, and how can I clone or copy the list to prevent it?

## Can someone explain __all__ in Python?

### Question by varikin

I have been using Python more and more, and I keep seeing the variable `__all__` set in different `__init__.py` files. Can someone explain what this does?

## How do I expand the output display to see more columns of a Pandas DataFrame?

Is there a way to widen the display of output in either interactive or script-execution mode?

Specifically, I am using the `describe()` function on a Pandas `DataFrame`. When the `DataFrame` is five columns (labels) wide, I get the descriptive statistics that I want. However, if the `DataFrame` has any more columns, the statistics are suppressed and something like this is returned:

``````>> Index: 8 entries, count to max
>> Data columns:
>> x1          8  non-null values
>> x2          8  non-null values
>> x3          8  non-null values
>> x4          8  non-null values
>> x5          8  non-null values
>> x6          8  non-null values
>> x7          8  non-null values
``````

The "8" value is given whether there are 6 or 7 columns. What does the "8" refer to?

I have already tried dragging the IDLE window larger, as well as increasing the "Configure IDLE" width options, to no avail.

My purpose in using Pandas and `describe()` is to avoid using a second program like Stata to do basic data manipulation and investigation.

## List of lists changes reflected across sublists unexpectedly

### Question by Charles Anderson

I needed to create a list of lists in Python, so I typed the following:

``````my_list = [[1] * 4] * 3
``````

The list looked like this:

``````[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]
``````

Then I changed one of the innermost values:

``````my_list[0][0] = 5
``````

Now my list looks like this:

``````[[5, 1, 1, 1], [5, 1, 1, 1], [5, 1, 1, 1]]
``````

which is not what I wanted or expected. Can someone please explain what"s going on, and how to get around it?

## What does "super" do in Python? - difference between super().__init__() and explicit superclass __init__()

What"s the difference between:

``````class Child(SomeBaseClass):
def __init__(self):
super(Child, self).__init__()
``````

and:

``````class Child(SomeBaseClass):
def __init__(self):
SomeBaseClass.__init__(self)
``````

I"ve seen `super` being used quite a lot in classes with only single inheritance. I can see why you"d use it in multiple inheritance but am unclear as to what the advantages are of using it in this kind of situation.

## "is" operator behaves unexpectedly with integers

### Question by Greg Hewgill

Why does the following behave unexpectedly in Python?

``````>>> a = 256
>>> b = 256
>>> a is b
True           # This is an expected result
>>> a = 257
>>> b = 257
>>> a is b
False          # What happened here? Why is this False?
>>> 257 is 257
True           # Yet the literal numbers compare properly
``````

I am using Python 2.5.2. Trying some different versions of Python, it appears that Python 2.3.3 shows the above behaviour between 99 and 100.

Based on the above, I can hypothesize that Python is internally implemented such that "small" integers are stored in a different way than larger integers and the `is` operator can tell the difference. Why the leaky abstraction? What is a better way of comparing two arbitrary objects to see whether they are the same when I don"t know in advance whether they are numbers or not?

## How can I explicitly free memory in Python?

I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:

2. process the file and create a list of triangles, represented by their vertices
3. output the vertices in the OFF format: a list of vertices followed by a list of triangles. The triangles are represented by indices into the list of vertices

The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I"m getting memory errors because of the sizes of the lists.

What is the best way to tell Python that I no longer need some of the data, and it can be freed?

## Python string.replace regular expression

I have a parameter file of the form:

``````parameter-name parameter-value
``````

Where the parameters may be in any order but there is only one parameter per line. I want to replace one parameter"s `parameter-value` with a new value.

I am using a line replace function posted previously to replace the line which uses Python"s `string.replace(pattern, sub)`. The regular expression that I"m using works for instance in vim but doesn"t appear to work in `string.replace()`.

Here is the regular expression that I"m using:

``````line.replace("^.*interfaceOpDataFile.*\$/i", "interfaceOpDataFile %s" % (fileIn))
``````

Where `"interfaceOpDataFile"` is the parameter name that I"m replacing (/i for case-insensitive) and the new parameter value is the contents of the `fileIn` variable.

Is there a way to get Python to recognize this regular expression or else is there another way to accomplish this task?

## Expanding tuples into arguments

Is there a way to expand a Python tuple into a function - as actual parameters?

For example, here `expand()` does the magic:

``````some_tuple = (1, "foo", "bar")

def myfun(number, str1, str2):
return (number * 2, str1 + str2, str2 + str1)

myfun(expand(some_tuple)) # (2, "foobar", "barfoo")
``````

I know one could define `myfun` as `myfun((a, b, c))`, but of course there may be legacy code. Thanks

The Python 3 `range()` object doesn"t produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.

The object also implements the `object.__contains__` hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.

The advantage of the `range` type over a regular `list` or `tuple` is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the `start`, `stop` and `step` values, calculating individual items and subranges as needed).

So at a minimum, your `range()` object would do:

``````class my_range:
def __init__(self, start, stop=None, step=1, /):
if stop is None:
start, stop = 0, start
self.start, self.stop, self.step = start, stop, step
if step < 0:
lo, hi, step = stop, start, -step
else:
lo, hi = start, stop
self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1

def __iter__(self):
current = self.start
if self.step < 0:
while current > self.stop:
yield current
current += self.step
else:
while current < self.stop:
yield current
current += self.step

def __len__(self):
return self.length

def __getitem__(self, i):
if i < 0:
i += self.length
if 0 <= i < self.length:
return self.start + i * self.step
raise IndexError("my_range object index out of range")

def __contains__(self, num):
if self.step < 0:
if not (self.stop < num <= self.start):
return False
else:
if not (self.start <= num < self.stop):
return False
return (num - self.start) % self.step == 0
``````

This is still missing several things that a real `range()` supports (such as the `.index()` or `.count()` methods, hashing, equality testing, or slicing), but should give you an idea.

I also simplified the `__contains__` implementation to only focus on integer tests; if you give a real `range()` object a non-integer value (including subclasses of `int`), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.

* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it‚Äôs all executed in optimised C code and Python stores integer values in 30-bit chunks, you‚Äôd run out of memory before you saw any performance impact due to the size of the integers involved here.

You have four main options for converting types in pandas:

1. `to_numeric()` - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also `to_datetime()` and `to_timedelta()`.)

2. `astype()` - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

3. `infer_objects()` - a utility method to convert object columns holding Python objects to a pandas type if possible.

4. `convert_dtypes()` - convert DataFrame columns to the "best possible" dtype that supports `pd.NA` (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.

# 1. `to_numeric()`

The best way to convert one or more columns of a DataFrame to numeric values is to use `pandas.to_numeric()`.

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

## Basic usage

The input to `to_numeric()` is a Series or a single column of a DataFrame.

``````>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64
``````

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

``````# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
``````

You can also use it to convert multiple columns of a DataFrame via the `apply()` method:

``````# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
``````

As long as your values can all be converted, that"s probably all you need.

## Error handling

But what if some values can"t be converted to a numeric type?

`to_numeric()` also takes an `errors` keyword argument that allows you to force non-numeric values to be `NaN`, or simply ignore columns containing these values.

Here"s an example using a Series of strings `s` which has the object dtype:

``````>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object
``````

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

``````>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string
``````

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to `NaN` as follows using the `errors` keyword argument:

``````>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64
``````

The third option for `errors` is just to ignore the operation if an invalid value is encountered:

``````>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched
``````

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

``````df.apply(pd.to_numeric, errors="ignore")
``````

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

## Downcasting

By default, conversion with `to_numeric()` will give you either a `int64` or `float64` dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like `float32`, or `int8`?

`to_numeric()` gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series `s` of integer type:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

Downcasting to "integer" uses the smallest possible integer that can hold the values:

``````>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8
``````

Downcasting to "float" similarly picks a smaller than normal floating type:

``````>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32
``````

# 2. `astype()`

The `astype()` method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

## Basic usage

Just pick a type: you can use a NumPy dtype (e.g. `np.int16`), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and `astype()` will try and convert it for you:

``````# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")
``````

Notice I said "try" - if `astype()` does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a `NaN` or `inf` value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing `errors="ignore"`. Your original object will be return untouched.

## Be careful

`astype()` is powerful, but it will sometimes convert values "incorrectly". For example:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

``````>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8
``````

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using `pd.to_numeric(s, downcast="unsigned")` instead could help prevent this error.

# 3. `infer_objects()`

Version 0.21.0 of pandas introduced the method `infer_objects()` for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

``````>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object
``````

Using `infer_objects()`, you can change the type of column "a" to int64:

``````>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object
``````

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use `df.astype(int)` instead.

# 4. `convert_dtypes()`

Version 1.0 and above includes a method `convert_dtypes()` to convert Series and DataFrame columns to the best possible dtype that supports the `pd.NA` missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to `Int64`, a column of NumPy `int32` values will become the pandas dtype `Int32`.

With our `object` DataFrame `df`, we get the following result:

``````>>> df.convert_dtypes().dtypes
a     Int64
b    string
dtype: object
``````

Since column "a" held integer values, it was converted to the `Int64` type (which is capable of holding missing values, unlike `int64`).

Column "b" contained string objects, so was changed to pandas" `string` dtype.

By default, this method will infer the type from object values in each column. We can change this by passing `infer_objects=False`:

``````>>> df.convert_dtypes(infer_objects=False).dtypes
a    object
b    string
dtype: object
``````

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran `infer_dtype`) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

## How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "`iter`" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use `DataFrame.to_string()`.

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla `for` loop)
4. `DataFrame.apply()`: i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. `DataFrame.itertuples()` and `iteritems()`
6. `DataFrame.iterrows()`

`iterrows` and `itertuples` (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". `df.iterrows()` is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

## Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

## Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

``````# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]
``````

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over `zip(df["A"], df["B"], ...)` instead of `df[["A", "B"]].to_numpy()` as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, `to_numpy()` will cast the entire array to string, which may not be what you want. Fortunately `zip`ping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

## An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns `A + B`. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer `vec` over `vec_numpy`).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

## Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls `iterrows()` while doing something inside a `for` loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

# In Python, what is the purpose of `__slots__` and what are the cases one should avoid this?

## TLDR:

The special attribute `__slots__` allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

1. faster attribute access.
2. space savings in memory.

The space savings is from

1. Storing value references in slots instead of `__dict__`.
2. Denying `__dict__` and `__weakref__` creation if parent classes deny them and you declare `__slots__`.

### Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

``````class Base:
__slots__ = "foo", "bar"

class Right(Base):
__slots__ = "baz",

class Wrong(Base):
__slots__ = "foo", "bar", "baz"        # redundant foo and bar
``````

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

``````>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)
``````

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

``````>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"
``````

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

### Requirements:

• To have attributes named in `__slots__` to actually be stored in slots instead of a `__dict__`, a class must inherit from `object` (automatic in Python 3, but must be explicit in Python 2).

• To prevent the creation of a `__dict__`, you must inherit from `object` and all classes in the inheritance must declare `__slots__` and none of them can have a `"__dict__"` entry.

There are a lot of details if you wish to keep reading.

## Why use `__slots__`: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created `__slots__` for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

``````import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
def get_set_delete():
obj.foo = "foo"
obj.foo
del obj.foo
return get_set_delete
``````

and

``````>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085
``````

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

``````>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342
``````

In Python 2 on Windows I have measured it about 15% faster.

## Why use `__slots__`: Memory Savings

Another purpose of `__slots__` is to reduce the space in memory that each object instance takes up.

The space saved over using `__dict__` can be significant.

SQLAlchemy attributes a lot of memory savings to `__slots__`.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with `guppy.hpy` (aka heapy) and `sys.getsizeof`, the size of a class instance without `__slots__` declared, and nothing else, is 64 bytes. That does not include the `__dict__`. Thank you Python for lazy evaluation again, the `__dict__` is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the `__dict__` attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with `__slots__` declared to be `()` (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for `__slots__` and `__dict__` (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

``````       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272‚Ä†   16         56 + 112‚Ä† | ‚Ä†if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408
43     384        56 + 3344   384        56 + 752
``````

So, in spite of smaller dicts in Python 3, we see how nicely `__slots__` scale for instances to save us memory, and that is a major reason you would want to use `__slots__`.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

``````>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72
``````

## Demonstration of `__slots__`:

To deny the creation of a `__dict__`, you must subclass `object`. Everything subclasses `object` in Python 3, but in Python 2 you had to be explicit:

``````class Base(object):
__slots__ = ()
``````

now:

``````>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
b.a = "a"
AttributeError: "Base" object has no attribute "a"
``````

Or subclass another class that defines `__slots__`

``````class Child(Base):
__slots__ = ("a",)
``````

and now:

``````c = Child()
c.a = "a"
``````

but:

``````>>> c.b = "b"
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
c.b = "b"
AttributeError: "Child" object has no attribute "b"
``````

To allow `__dict__` creation while subclassing slotted objects, just add `"__dict__"` to the `__slots__` (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

``````class SlottedWithDict(Child):
__slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"
``````

and

``````>>> swd.__dict__
{"c": "c"}
``````

Or you don"t even need to declare `__slots__` in your subclass, and you will still use slots from the parents, but not restrict the creation of a `__dict__`:

``````class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"
``````

And:

``````>>> ns.__dict__
{"b": "b"}
``````

However, `__slots__` may cause problems for multiple inheritance:

``````class BaseA(object):
__slots__ = ("a",)

class BaseB(object):
__slots__ = ("b",)
``````

Because creating a child class from parents with both non-empty slots fails:

``````>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
File "<pyshell#68>", line 1, in <module>
class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict
``````

If you run into this problem, You could just remove `__slots__` from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

``````from abc import ABC

class AbstractA(ABC):
__slots__ = ()

class BaseA(AbstractA):
__slots__ = ("a",)

class AbstractB(ABC):
__slots__ = ()

class BaseB(AbstractB):
__slots__ = ("b",)

class Child(AbstractA, AbstractB):
__slots__ = ("a", "b")

c = Child() # no problem!
``````

### Add `"__dict__"` to `__slots__` to get dynamic assignment:

``````class Foo(object):
__slots__ = "bar", "baz", "__dict__"
``````

and now:

``````>>> foo = Foo()
>>> foo.boink = "boink"
``````

So with `"__dict__"` in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use `__slots__` - names that are in `__slots__` point to slotted values, while any other values are put in the instance"s `__dict__`.

Avoiding `__slots__` because you want to be able to add attributes on the fly is actually not a good reason - just add `"__dict__"` to your `__slots__` if this is required.

You can similarly add `__weakref__` to `__slots__` explicitly if you need that feature.

### Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

``````from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
"""MyNT is an immutable and lightweight object"""
__slots__ = ()
``````

usage:

``````>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"
``````

And trying to assign an unexpected attribute raises an `AttributeError` because we have prevented the creation of `__dict__`:

``````>>> nt.quux = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"
``````

You can allow `__dict__` creation by leaving off `__slots__ = ()`, but you can"t use non-empty `__slots__` with subtypes of tuple.

## Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

``````class Foo(object):
__slots__ = "foo", "bar"
class Bar(object):
__slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict
``````

Using an empty `__slots__` in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding `"__dict__"` to get dynamic assignment, see section above) the creation of a `__dict__`:

``````class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"
``````

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty `__slots__` in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

``````class AbstractBase:
__slots__ = ()
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"
``````

We could use the above directly by inheriting and declaring the expected slots:

``````class Foo(AbstractBase):
__slots__ = "a", "b"
``````

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

``````class AbstractBaseC:
__slots__ = ()
@property
def c(self):
print("getting c!")
return self._c
@c.setter
def c(self, arg):
print("setting c!")
self._c = arg
``````

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given `AbstractBase` nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

``````class Concretion(AbstractBase, AbstractBaseC):
__slots__ = "a b _c".split()
``````

And now we have functionality from both via multiple inheritance, and can still deny `__dict__` and `__weakref__` instantiation:

``````>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"
``````

## Other cases to avoid slots:

• Avoid them when you want to perform `__class__` assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
• Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
• Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the `__slots__` documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

### Do not "only use `__slots__` when instantiating lots of objects"

I quote:

"You would want to use `__slots__` if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the `collections` module, are not instantiated, yet `__slots__` are declared for them.

Why?

If a user wishes to deny `__dict__` or `__weakref__` creation, those things must not be available in the parent classes.

`__slots__` contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

### `__slots__` doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading `TypeError`:

``````>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
``````

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the `-1` argument. In Python 2.7 this would be `2` (which was introduced in 2.3), and in 3.6 it is `4`.

``````>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>
``````

in Python 2.7:

``````>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>
``````

in Python 3.6

``````>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>
``````

So I would keep this in mind, as it is a solved problem.

## Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of `__slots__` is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the `__dict__` when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid `__slots__`. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with `__slots__`.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

# Memory usage evidence

Create some normal objects and slotted objects:

``````>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()
``````

Instantiate a million of them:

``````>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]
``````

Inspect with `guppy.hpy().heap()`:

``````>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
0 1000000  49 64000000  64  64000000  64 __main__.Foo
1     169   0 16281480  16  80281480  80 list
2 1000000  49 16000000  16  96281480  97 __main__.Bar
3   12284   1   987472   1  97268952  97 str
...
``````

Access the regular objects and their `__dict__` and inspect again:

``````>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
1 1000000  33  64000000  17 344000000  91 __main__.Foo
2     169   0  16281480   4 360281480  95 list
3 1000000  33  16000000   4 376281480  99 __main__.Bar
4   12284   0    987472   0 377268952  99 str
...
``````

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate `__dict__` and `__weakrefs__`. (The `__dict__` is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "`__slots__ = []`" to your class.

# `os.listdir()` - list in the current directory

With listdir in os module you get the files and the folders in the current dir

`````` import os
arr = os.listdir()
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

## Looking in a directory

``````arr = os.listdir("c:\files")
``````

# `glob` from glob

with glob you can specify a type of file to list like this

``````import glob

txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)
``````

## `glob` in a list comprehension

``````mylist = [f for f in glob.glob("*.txt")]
``````

## get the full path of only files in the current directory

``````import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles)

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
``````

## Getting the full path name with `os.path.abspath`

You get the full path in return

`````` import os
files_path = [os.path.abspath(x) for x in os.listdir()]
print(files_path)

["F:\documentiapplications.txt", "F:\documenticollections.txt"]
``````

## Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

``````import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if file.endswith(".docx"):
print(os.path.join(r, file))
``````

### `os.listdir()`: get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

`````` import os
arr = os.listdir(".")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### To go up in the directory tree

``````# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")
``````

### Get files: `os.listdir()` in a particular directory (Python 2 and 3)

`````` import os
arr = os.listdir("F:\python")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### Get files of a particular subdirectory with `os.listdir()`

``````import os

x = os.listdir("./content")
``````

### `os.walk(".")` - current directory

`````` import os
arr = next(os.walk("."))[2]
print(arr)

>>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
``````

### `next(os.walk("."))` and `os.path.join("dir", "file")`

`````` import os
arr = []
for d,r,f in next(os.walk("F:\_python")):
for file in f:
arr.append(os.path.join(r,file))

for f in arr:
print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt
``````

### `next(os.walk("F:\")` - get the full path - list comprehension

`````` [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]

>>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
``````

### `os.walk` - get full path - all files in sub dirs**

``````x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

``````

### `os.listdir()` - get only txt files

`````` arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
print(arr_txt)

>>> ["work.txt", "3ebooks.txt"]
``````

## Using `glob` to get the full path of the files

If I should need the absolute path of the files:

``````from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt
``````

## Using `os.path.isfile` to avoid directories in the list

``````import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]
``````

## Using `pathlib` from Python 3.4

``````import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
if p.is_file():
print(p)
flist.append(p)

>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speak_gui2.py
>>> thumb.PNG
``````

With `list comprehension`:

``````flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
``````

Alternatively, use `pathlib.Path()` instead of `pathlib.Path(".")`

## Use glob method in pathlib.Path()

``````import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py
``````

## Get all and only files with os.walk

``````import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
for f in t:
y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
``````

## Get only files with next and walk in a directory

`````` import os
x = next(os.walk("F://python"))[2]
print(x)

>>> ["calculator.bat","calculator.py"]
``````

## Get only directories with next and walk in a directory

`````` import os
next(os.walk("F://python"))[1] # for the current dir use (".")

>>> ["python3","others"]
``````

## Get all the subdir names with `walk`

``````for r,d,f in os.walk("F:\_python"):
for dirs in d:
print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints
``````

## `os.scandir()` from Python 3.5 and greater

``````import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
for entry in i:
if entry.is_file():
print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG
``````

# Examples:

## Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

``````import os

def count(dir, counter=0):
"returns number of files in dir and subdirs"
for pack in os.walk(dir):
for f in pack[2]:
counter += 1
return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"
``````

## Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

``````import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
"Searches for pptx (or other - pptx is the default) files and copies them"
for pack in os.walk(dir):
for f in pack[2]:
if f.endswith(filetype):
fullpath = pack[0] + "\" + f
print(fullpath)
shutil.copy(fullpath, destination)
counter += 1
if counter > 0:
print("-" * 30)
print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
"searches for folders that starts with `_`"
if dir[0] == "_":
# copyfile(dir, filetype="pdf")
copyfile(dir, filetype="txt")

>>> _compiti18Compito Contabilit√† 1conti.txt
>>> _compiti18Compito Contabilit√† 1modula4.txt
>>> _compiti18Compito Contabilit√† 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files
``````

## Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

``````import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
for eachfile in os.listdir():
mylist += eachfile + "
"
file.write(mylist)
``````

## Example: txt with all the files of an hard drive

``````"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
for root, dirs, files in os.walk("D:\"):
for file in files:
listafile.append(file)
percorso.append(root + "\" + file)
testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
for file in listafile:
testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
for file in percorso:
file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")
``````

## All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

``````import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk("C:\"):
for file in f:
filewrite.write(f"{r + file}
")
``````

## How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

``````import os

def searchfiles(extension=".ttf", folder="H:\"):
"Create a txt file with all the file of a type"
with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png
``````

## (New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list.

``````import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
"insert all files in the listbox"
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
lb.insert(0, r + "\" + file)

def open_file():
os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()
``````

This is the behaviour to adopt when the referenced object is deleted. It is not specific to Django; this is an SQL standard. Although Django has its own implementation on top of SQL. (1)

There are seven possible actions to take when such event occurs:

• `CASCADE`: When the referenced object is deleted, also delete the objects that have references to it (when you remove a blog post for instance, you might want to delete comments as well). SQL equivalent: `CASCADE`.
• `PROTECT`: Forbid the deletion of the referenced object. To delete it you will have to delete all objects that reference it manually. SQL equivalent: `RESTRICT`.
• `RESTRICT`: (introduced in Django 3.1) Similar behavior as `PROTECT` that matches SQL"s `RESTRICT` more accurately. (See django documentation example)
• `SET_NULL`: Set the reference to NULL (requires the field to be nullable). For instance, when you delete a User, you might want to keep the comments he posted on blog posts, but say it was posted by an anonymous (or deleted) user. SQL equivalent: `SET NULL`.
• `SET_DEFAULT`: Set the default value. SQL equivalent: `SET DEFAULT`.
• `SET(...)`: Set a given value. This one is not part of the SQL standard and is entirely handled by Django.
• `DO_NOTHING`: Probably a very bad idea since this would create integrity issues in your database (referencing an object that actually doesn"t exist). SQL equivalent: `NO ACTION`. (2)

Source: Django documentation

In most cases, `CASCADE` is the expected behaviour, but for every ForeignKey, you should always ask yourself what is the expected behaviour in this situation. `PROTECT` and `SET_NULL` are often useful. Setting `CASCADE` where it should not, can potentially delete all of your database in cascade, by simply deleting a single user.

It"s funny to notice that the direction of the `CASCADE` action is not clear to many people. Actually, it"s funny to notice that only the `CASCADE` action is not clear. I understand the cascade behavior might be confusing, however you must think that it is the same direction as any other action. Thus, if you feel that `CASCADE` direction is not clear to you, it actually means that `on_delete` behavior is not clear to you.

In your database, a foreign key is basically represented by an integer field which value is the primary key of the foreign object. Let"s say you have an entry comment_A, which has a foreign key to an entry article_B. If you delete the entry comment_A, everything is fine. article_B used to live without comment_A and don"t bother if it"s deleted. However, if you delete article_B, then comment_A panics! It never lived without article_B and needs it, and it"s part of its attributes (`article=article_B`, but what is article_B???). This is where `on_delete` steps in, to determine how to resolve this integrity error, either by saying:

• "No! Please! Don"t! I can"t live without you!" (which is said `PROTECT` or `RESTRICT` in Django/SQL)
• "All right, if I"m not yours, then I"m nobody"s" (which is said `SET_NULL`)
• "Good bye world, I can"t live without article_B" and commit suicide (this is the `CASCADE` behavior).
• "It"s OK, I"ve got spare lover, and I"ll reference article_C from now" (`SET_DEFAULT`, or even `SET(...)`).
• "I can"t face reality, and I"ll keep calling your name even if that"s the only thing left to me!" (`DO_NOTHING`)

I hope it makes cascade direction clearer. :)

Footnotes

(1) Django has its own implementation on top of SQL. And, as mentioned by @JoeMjr2 in the comments below, Django will not create the SQL constraints. If you want the constraints to be ensured by your database (for instance, if your database is used by another application, or if you hang in the database console from time to time), you might want to set the related constraints manually yourself. There is an open ticket to add support for database-level on delete constrains in Django.

(2) Actually, there is one case where `DO_NOTHING` can be useful: If you want to skip Django"s implementation and implement the constraint yourself at the database-level.

TL;DR: If you are using Python 3.10 or later, it just works. As of today (2019), in 3.7+ you must turn this feature on using a future statement (`from __future__ import annotations`). In Python 3.6 or below, use a string.

I guess you got this exception:

``````NameError: name "Position" is not defined
``````

This is because `Position` must be defined before you can use it in an annotation unless you are using Python 3.10 or later.

## Python 3.7+: `from __future__ import annotations`

Python 3.7 introduces PEP 563: postponed evaluation of annotations. A module that uses the future statement `from __future__ import annotations` will store annotations as strings automatically:

``````from __future__ import annotations

class Position:
def __add__(self, other: Position) -> Position:
...
``````

This is scheduled to become the default in Python 3.10. Since Python still is a dynamically typed language so no type checking is done at runtime, typing annotations should have no performance impact, right? Wrong! Before python 3.7 the typing module used to be one of the slowest python modules in core so if you `import typing` you will see up to 7 times increase in performance when you upgrade to 3.7.

## Python <3.7: use a string

According to PEP 484, you should use a string instead of the class itself:

``````class Position:
...
def __add__(self, other: "Position") -> "Position":
...
``````

If you use the Django framework this may be familiar as Django models also use strings for forward references (foreign key definitions where the foreign model is `self` or is not declared yet). This should work with Pycharm and other tools.

## Sources

The relevant parts of PEP 484 and PEP 563, to spare you the trip:

# Forward references

When a type hint contains names that have not been defined yet, that definition may be expressed as a string literal, to be resolved later.

A situation where this occurs commonly is the definition of a container class, where the class being defined occurs in the signature of some of the methods. For example, the following code (the start of a simple binary tree implementation) does not work:

``````class Tree:
def __init__(self, left: Tree, right: Tree):
self.left = left
self.right = right
``````

``````class Tree:
def __init__(self, left: "Tree", right: "Tree"):
self.left = left
self.right = right
``````

The string literal should contain a valid Python expression (i.e., compile(lit, "", "eval") should be a valid code object) and it should evaluate without errors once the module has been fully loaded. The local and global namespace in which it is evaluated should be the same namespaces in which default arguments to the same function would be evaluated.

and PEP 563:

# Implementation

In Python 3.10, function and variable annotations will no longer be evaluated at definition time. Instead, a string form will be preserved in the respective¬†`__annotations__`¬†dictionary. Static type checkers will see no difference in behavior, whereas tools using annotations at runtime will have to perform postponed evaluation.

...

## Enabling the future behavior in Python 3.7

The functionality described above can be enabled starting from Python 3.7 using the following special import:

``````from __future__ import annotations
``````

## Things that you may be tempted to do instead

### A. Define a dummy `Position`

Before the class definition, place a dummy definition:

``````class Position(object):
pass

class Position(object):
...
``````

This will get rid of the `NameError` and may even look OK:

``````>>> Position.__add__.__annotations__
{"other": __main__.Position, "return": __main__.Position}
``````

But is it?

``````>>> for k, v in Position.__add__.__annotations__.items():
...     print(k, "is Position:", v is Position)
return is Position: False
other is Position: False
``````

### B. Monkey-patch in order to add the annotations:

You may want to try some Python meta programming magic and write a decorator to monkey-patch the class definition in order to add annotations:

``````class Position:
...
return self.__class__(self.x + other.x, self.y + other.y)
``````

The decorator should be responsible for the equivalent of this:

``````Position.__add__.__annotations__["return"] = Position
``````

At least it seems right:

``````>>> for k, v in Position.__add__.__annotations__.items():
...     print(k, "is Position:", v is Position)
return is Position: True
other is Position: True
``````

Probably too much trouble.

## Placing the legend (`bbox_to_anchor`)

A legend is positioned inside the bounding box of the axes using the `loc` argument to `plt.legend`.
E.g. `loc="upper right"` places the legend in the upper right corner of the bounding box, which by default extents from `(0,0)` to `(1,1)` in axes coordinates (or in bounding box notation `(x0,y0, width, height)=(0,0,1,1)`).

To place the legend outside of the axes bounding box, one may specify a tuple `(x0,y0)` of axes coordinates of the lower left corner of the legend.

``````plt.legend(loc=(1.04,0))
``````

A more versatile approach is to manually specify the bounding box into which the legend should be placed, using the `bbox_to_anchor` argument. One can restrict oneself to supply only the `(x0, y0)` part of the bbox. This creates a zero span box, out of which the legend will expand in the direction given by the `loc` argument. E.g.

`plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")`

places the legend outside the axes, such that the upper left corner of the legend is at position `(1.04,1)` in axes coordinates.

Further examples are given below, where additionally the interplay between different arguments like `mode` and `ncols` are shown.

``````l1 = plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
l2 = plt.legend(bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
l3 = plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
l4 = plt.legend(bbox_to_anchor=(0,1.02,1,0.2), loc="lower left",
l5 = plt.legend(bbox_to_anchor=(1,0), loc="lower right",
bbox_transform=fig.transFigure, ncol=3)
l6 = plt.legend(bbox_to_anchor=(0.4,0.8), loc="upper right")
``````

Details about how to interpret the 4-tuple argument to `bbox_to_anchor`, as in `l4`, can be found in this question. The `mode="expand"` expands the legend horizontally inside the bounding box given by the 4-tuple. For a vertically expanded legend, see this question.

Sometimes it may be useful to specify the bounding box in figure coordinates instead of axes coordinates. This is shown in the example `l5` from above, where the `bbox_transform` argument is used to put the legend in the lower left corner of the figure.

### Postprocessing

Having placed the legend outside the axes often leads to the undesired situation that it is completely or partially outside the figure canvas.

Solutions to this problem are:

One can adjust the subplot parameters such, that the axes take less space inside the figure (and thereby leave more space to the legend) by using `plt.subplots_adjust`. E.g.

``````  plt.subplots_adjust(right=0.7)
``````

leaves 30% space on the right-hand side of the figure, where one could place the legend.

• Tight layout
Using `plt.tight_layout` Allows to automatically adjust the subplot parameters such that the elements in the figure sit tight against the figure edges. Unfortunately, the legend is not taken into account in this automatism, but we can supply a rectangle box that the whole subplots area (including labels) will fit into.

``````  plt.tight_layout(rect=[0,0,0.75,1])
``````
• Saving the figure with `bbox_inches = "tight"`
The argument `bbox_inches = "tight"` to `plt.savefig` can be used to save the figure such that all artist on the canvas (including the legend) are fit into the saved area. If needed, the figure size is automatically adjusted.

``````  plt.savefig("output.png", bbox_inches="tight")
``````
• automatically adjusting the subplot params
A way to automatically adjust the subplot position such that the legend fits inside the canvas without changing the figure size can be found in this answer: Creating figure with exact size and no padding (and legend outside the axes)

Comparison between the cases discussed above:

## Alternatives

A figure legend

One may use a legend to the figure instead of the axes, `matplotlib.figure.Figure.legend`. This has become especially useful for matplotlib version >=2.1, where no special arguments are needed

``````fig.legend(loc=7)
``````

to create a legend for all artists in the different axes of the figure. The legend is placed using the `loc` argument, similar to how it is placed inside an axes, but in reference to the whole figure - hence it will be outside the axes somewhat automatically. What remains is to adjust the subplots such that there is no overlap between the legend and the axes. Here the point "Adjust the subplot parameters" from above will be helpful. An example:

``````import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi)
colors=["#7aa0c4","#ca82e1" ,"#8bcd50","#e18882"]
fig, axes = plt.subplots(ncols=2)
for i in range(4):
axes[i//2].plot(x,np.sin(x+i), color=colors[i],label="y=sin(x+{})".format(i))

fig.legend(loc=7)
fig.tight_layout()
plt.show()
``````

Legend inside dedicated subplot axes

An alternative to using `bbox_to_anchor` would be to place the legend in its dedicated subplot axes (`lax`). Since the legend subplot should be smaller than the plot, we may use `gridspec_kw={"width_ratios":[4,1]}` at axes creation. We can hide the axes `lax.axis("off")` but still put a legend in. The legend handles and labels need to obtained from the real plot via `h,l = ax.get_legend_handles_labels()`, and can then be supplied to the legend in the `lax` subplot, `lax.legend(h,l)`. A complete example is below.

``````import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = 6,2

fig, (ax,lax) = plt.subplots(ncols=2, gridspec_kw={"width_ratios":[4,1]})
ax.plot(x,y, label="y=sin(x)")
....

h,l = ax.get_legend_handles_labels()
lax.axis("off")

plt.tight_layout()
plt.show()
``````

This produces a plot, which is visually pretty similar to the plot from above:

We could also use the first axes to place the legend, but use the `bbox_transform` of the legend axes,

``````ax.legend(bbox_to_anchor=(0,0,1,1), bbox_transform=lax.transAxes)
lax.axis("off")
``````

In this approach, we do not need to obtain the legend handles externally, but we need to specify the `bbox_to_anchor` argument.

• Consider the matplotlib legend guide with some examples of other stuff you want to do with legends.
• Some example code for placing legends for pie charts may directly be found in answer to this question: Python - Legend overlaps with the pie chart
• The `loc` argument can take numbers instead of strings, which make calls shorter, however, they are not very intuitively mapped to each other. Here is the mapping for reference:

You can disable any Python warnings via the `PYTHONWARNINGS` environment variable. In this case, you want:

``````export PYTHONWARNINGS="ignore:Unverified HTTPS request"
``````

To disable using Python code (`requests >= 2.16.0`):

``````import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
``````

For `requests < 2.16.0`, see original answer below.

The reason doing `urllib3.disable_warnings()` didn"t work for you is because it looks like you"re using a separate instance of urllib3 vendored inside of requests.

I gather this based on the path here: `/usr/lib/python2.6/site-packages/requests/packages/urllib3/connectionpool.py`

To disable warnings in requests" vendored urllib3, you"ll need to import that specific instance of the module:

``````import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
``````

### Is there any reason for a class declaration to inherit from `object`?

In Python 3, apart from compatibility between Python 2 and 3, no reason. In Python 2, many reasons.

### Python 2.x story:

In Python 2.x (from 2.2 onwards) there"s two styles of classes depending on the presence or absence of `object` as a base-class:

1. "classic" style classes: they don"t have `object` as a base class:

``````>>> class ClassicSpam:      # no base class
...     pass
>>> ClassicSpam.__bases__
()
``````
2. "new" style classes: they have, directly or indirectly (e.g inherit from a built-in type), `object` as a base class:

``````>>> class NewSpam(object):           # directly inherit from object
...    pass
>>> NewSpam.__bases__
(<type "object">,)
>>> class IntSpam(int):              # indirectly inherit from object...
...    pass
>>> IntSpam.__bases__
(<type "int">,)
>>> IntSpam.__bases__[0].__bases__   # ... because int inherits from object
(<type "object">,)
``````

Without a doubt, when writing a class you"ll always want to go for new-style classes. The perks of doing so are numerous, to list some of them:

If you don"t inherit from `object`, forget these. A more exhaustive description of the previous bullet points along with other perks of "new" style classes can be found here.

One of the downsides of new-style classes is that the class itself is more memory demanding. Unless you"re creating many class objects, though, I doubt this would be an issue and it"s a negative sinking in a sea of positives.

### Python 3.x story:

In Python 3, things are simplified. Only new-style classes exist (referred to plainly as classes) so, the only difference in adding `object` is requiring you to type in 8 more characters. This:

``````class ClassicSpam:
pass
``````

is completely equivalent (apart from their name :-) to this:

``````class NewSpam(object):
pass
``````

and to this:

``````class Spam():
pass
``````

All have `object` in their `__bases__`.

``````>>> [object in cls.__bases__ for cls in {Spam, NewSpam, ClassicSpam}]
[True, True, True]
``````

## So, what should you do?

In Python 2: always inherit from `object` explicitly. Get the perks.

In Python 3: inherit from `object` if you are writing code that tries to be Python agnostic, that is, it needs to work both in Python 2 and in Python 3. Otherwise don"t, it really makes no difference since Python inserts it for you behind the scenes.

## InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately

Tried to perform REST GET through python requests with the following code and I got error.

Code snip:

``````import requests
url = az_base_url + az_subscription_id + "/resourcegroups/Default-Networking/resources?" + az_api_version
``````

Error:

``````/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/util/ssl_.py:79:
InsecurePlatformWarning: A true SSLContext object is not available.
This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail.
InsecurePlatformWarning
``````

My python version is 2.7.3. I tried to install urllib3 and requests[security] as some other thread suggests, I still got the same error.

Wonder if anyone can provide some tips?

## Dynamic instantiation from string name of a class in dynamically imported module?

In python, I have to instantiate certain class, knowing its name in a string, but this class "lives" in a dynamically imported module. An example follows:

``````import sys
def __init__(self, module_name, class_name): # both args are strings
try:
__import__(module_name)
modul = sys.modules[module_name]
instance = modul.class_name() # obviously this doesn"t works, here is my main problem!
except ImportError:
# manage import error
``````

``````class myName:
# etc...
``````

I use this arrangement to make any dynamically-loaded-module to be used by the loader-class following certain predefined behaviours in the dyn-loaded-modules...

## pandas loc vs. iloc vs. at vs. iat?

Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in `Pandas`. I"ve read the documentation but I"m struggling to understand the practical implications of the various localization/selection options.

Is there a reason why I should ever use `.loc` or `.iloc` over `at`, and `iat` or vice versa? In what situations should I use which method?

Note: future readers be aware that this question is old and was written before pandas v0.20 when there used to exist a function called `.ix`. This method was later split into two - `loc` and `iloc` - to make the explicit distinction between positional and label based indexing. Please beware that `ix` was discontinued due to inconsistent behavior and being hard to grok, and no longer exists in current versions of pandas (>= 1.0).

## How to get all of the immediate subdirectories in Python

I"m trying to write a simple Python script that will copy a index.tpl to index.html in all of the subdirectories (with a few exceptions).

I"m getting bogged down by trying to get the list of subdirectories.

## Standard deviation of a list

I want to find mean and standard deviation of 1st, 2nd,... digits of several (Z) lists. For example, I have

``````A_rank=[0.8,0.4,1.2,3.7,2.6,5.8]
B_rank=[0.1,2.8,3.7,2.6,5,3.4]
C_Rank=[1.2,3.4,0.5,0.1,2.5,6.1]
# etc (up to Z_rank )...
``````

Now I want to take the mean and std of `*_Rank[0]`, the mean and std of `*_Rank[1]`, etc.
(ie: mean and std of the 1st digit from all the (A..Z)_rank lists;
the mean and std of the 2nd digit from all the (A..Z)_rank lists;
the mean and std of the 3rd digit...; etc).

## sort eigenvalues and associated eigenvectors after using numpy.linalg.eig in python

I"m using numpy.linalg.eig to obtain a list of eigenvalues and eigenvectors:

``````A = someMatrixArray
from numpy.linalg import eig as eigenValuesAndVectors

solution = eigenValuesAndVectors(A)

eigenValues = solution[0]
eigenVectors = solution[1]
``````

I would like to sort my eigenvalues (e.g. from lowest to highest), in a way I know what is the associated eigenvector after the sorting.

I"m not finding any way of doing that with python functions. Is there any simple way or do I have to code my sort version?

## Associativity of "in" in Python?

I"m making a Python parser, and this is really confusing me:

``````>>> 1 in [] in "a"
False

>>> (1 in []) in "a"
TypeError: "in <string>" requires string as left operand, not bool

>>> 1 in ([] in "a")
TypeError: "in <string>" requires string as left operand, not list
``````

How exactly does `in` work in Python, with regards to associativity, etc.?

Why do no two of these expressions behave the same way?

## How to calculate probability in a normal distribution given mean & standard deviation?

How to calculate probability in normal distribution given mean, std in Python? I can always explicitly code my own function according to the definition like the OP in this question did: Calculating Probability of a Random Variable in a Distribution in Python

Just wondering if there is a library function call will allow you to do this. In my imagine it would like this:

``````nd = NormalDistribution(mu=100, std=12)
p = nd.prob(98)
``````

There is a similar question in Perl: How can I compute the probability at a point given a normal distribution in Perl?. But I didn"t see one in Python.

`Numpy` has a `random.normal` function, but it"s like sampling, not exactly what I want.

The Python 3 `range()` object doesn"t produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.

The object also implements the `object.__contains__` hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.

The advantage of the `range` type over a regular `list` or `tuple` is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the `start`, `stop` and `step` values, calculating individual items and subranges as needed).

So at a minimum, your `range()` object would do:

``````class my_range:
def __init__(self, start, stop=None, step=1, /):
if stop is None:
start, stop = 0, start
self.start, self.stop, self.step = start, stop, step
if step < 0:
lo, hi, step = stop, start, -step
else:
lo, hi = start, stop
self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1

def __iter__(self):
current = self.start
if self.step < 0:
while current > self.stop:
yield current
current += self.step
else:
while current < self.stop:
yield current
current += self.step

def __len__(self):
return self.length

def __getitem__(self, i):
if i < 0:
i += self.length
if 0 <= i < self.length:
return self.start + i * self.step
raise IndexError("my_range object index out of range")

def __contains__(self, num):
if self.step < 0:
if not (self.stop < num <= self.start):
return False
else:
if not (self.start <= num < self.stop):
return False
return (num - self.start) % self.step == 0
``````

This is still missing several things that a real `range()` supports (such as the `.index()` or `.count()` methods, hashing, equality testing, or slicing), but should give you an idea.

I also simplified the `__contains__` implementation to only focus on integer tests; if you give a real `range()` object a non-integer value (including subclasses of `int`), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.

* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it‚Äôs all executed in optimised C code and Python stores integer values in 30-bit chunks, you‚Äôd run out of memory before you saw any performance impact due to the size of the integers involved here.

You have four main options for converting types in pandas:

1. `to_numeric()` - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also `to_datetime()` and `to_timedelta()`.)

2. `astype()` - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

3. `infer_objects()` - a utility method to convert object columns holding Python objects to a pandas type if possible.

4. `convert_dtypes()` - convert DataFrame columns to the "best possible" dtype that supports `pd.NA` (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.

# 1. `to_numeric()`

The best way to convert one or more columns of a DataFrame to numeric values is to use `pandas.to_numeric()`.

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

## Basic usage

The input to `to_numeric()` is a Series or a single column of a DataFrame.

``````>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64
``````

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

``````# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
``````

You can also use it to convert multiple columns of a DataFrame via the `apply()` method:

``````# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
``````

As long as your values can all be converted, that"s probably all you need.

## Error handling

But what if some values can"t be converted to a numeric type?

`to_numeric()` also takes an `errors` keyword argument that allows you to force non-numeric values to be `NaN`, or simply ignore columns containing these values.

Here"s an example using a Series of strings `s` which has the object dtype:

``````>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object
``````

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

``````>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string
``````

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to `NaN` as follows using the `errors` keyword argument:

``````>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64
``````

The third option for `errors` is just to ignore the operation if an invalid value is encountered:

``````>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched
``````

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

``````df.apply(pd.to_numeric, errors="ignore")
``````

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

## Downcasting

By default, conversion with `to_numeric()` will give you either a `int64` or `float64` dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like `float32`, or `int8`?

`to_numeric()` gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series `s` of integer type:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

Downcasting to "integer" uses the smallest possible integer that can hold the values:

``````>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8
``````

Downcasting to "float" similarly picks a smaller than normal floating type:

``````>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32
``````

# 2. `astype()`

The `astype()` method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

## Basic usage

Just pick a type: you can use a NumPy dtype (e.g. `np.int16`), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and `astype()` will try and convert it for you:

``````# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")
``````

Notice I said "try" - if `astype()` does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a `NaN` or `inf` value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing `errors="ignore"`. Your original object will be return untouched.

## Be careful

`astype()` is powerful, but it will sometimes convert values "incorrectly". For example:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

``````>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8
``````

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using `pd.to_numeric(s, downcast="unsigned")` instead could help prevent this error.

# 3. `infer_objects()`

Version 0.21.0 of pandas introduced the method `infer_objects()` for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

``````>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object
``````

Using `infer_objects()`, you can change the type of column "a" to int64:

``````>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object
``````

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use `df.astype(int)` instead.

# 4. `convert_dtypes()`

Version 1.0 and above includes a method `convert_dtypes()` to convert Series and DataFrame columns to the best possible dtype that supports the `pd.NA` missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to `Int64`, a column of NumPy `int32` values will become the pandas dtype `Int32`.

With our `object` DataFrame `df`, we get the following result:

``````>>> df.convert_dtypes().dtypes
a     Int64
b    string
dtype: object
``````

Since column "a" held integer values, it was converted to the `Int64` type (which is capable of holding missing values, unlike `int64`).

Column "b" contained string objects, so was changed to pandas" `string` dtype.

By default, this method will infer the type from object values in each column. We can change this by passing `infer_objects=False`:

``````>>> df.convert_dtypes(infer_objects=False).dtypes
a    object
b    string
dtype: object
``````

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran `infer_dtype`) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool.

The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code:

``````from multiprocessing.dummy import Pool as ThreadPool
results = pool.map(my_function, my_array)
``````

Which is the multithreaded version of:

``````results = []
for item in my_array:
results.append(my_function(item))
``````

Description

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence.

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end.

Implementation

Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy.

`multiprocessing.dummy` is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O):

multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module.

``````import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = [
"http://www.python.org",
"http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html",
"http://www.python.org/doc/",
"http://www.python.org/getit/",
"http://www.python.org/community/",
"https://wiki.python.org/moin/",
]

# Make the Pool of workers

# Open the URLs in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

# Close the pool and wait for the work to finish
pool.close()
pool.join()
``````

And the timing results:

``````Single thread:   14.4 seconds
4 Pool:   3.1 seconds
8 Pool:   1.4 seconds
13 Pool:   1.3 seconds
``````

Passing multiple arguments (works like this only in Python 3.3 and later):

To pass multiple arrays:

``````results = pool.starmap(function, zip(list_a, list_b))
``````

Or to pass a constant and an array:

``````results = pool.starmap(function, zip(itertools.repeat(constant), list_a))
``````

If you are using an earlier version of Python, you can pass multiple arguments via this workaround).

(Thanks to user136036 for the helpful comment.)

## How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "`iter`" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use `DataFrame.to_string()`.

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla `for` loop)
4. `DataFrame.apply()`: i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. `DataFrame.itertuples()` and `iteritems()`
6. `DataFrame.iterrows()`

`iterrows` and `itertuples` (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". `df.iterrows()` is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

## Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

## Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

``````# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]
``````

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over `zip(df["A"], df["B"], ...)` instead of `df[["A", "B"]].to_numpy()` as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, `to_numpy()` will cast the entire array to string, which may not be what you want. Fortunately `zip`ping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

## An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns `A + B`. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer `vec` over `vec_numpy`).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

## Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls `iterrows()` while doing something inside a `for` loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

# In Python, what is the purpose of `__slots__` and what are the cases one should avoid this?

## TLDR:

The special attribute `__slots__` allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

1. faster attribute access.
2. space savings in memory.

The space savings is from

1. Storing value references in slots instead of `__dict__`.
2. Denying `__dict__` and `__weakref__` creation if parent classes deny them and you declare `__slots__`.

### Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

``````class Base:
__slots__ = "foo", "bar"

class Right(Base):
__slots__ = "baz",

class Wrong(Base):
__slots__ = "foo", "bar", "baz"        # redundant foo and bar
``````

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

``````>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)
``````

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

``````>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"
``````

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

### Requirements:

• To have attributes named in `__slots__` to actually be stored in slots instead of a `__dict__`, a class must inherit from `object` (automatic in Python 3, but must be explicit in Python 2).

• To prevent the creation of a `__dict__`, you must inherit from `object` and all classes in the inheritance must declare `__slots__` and none of them can have a `"__dict__"` entry.

There are a lot of details if you wish to keep reading.

## Why use `__slots__`: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created `__slots__` for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

``````import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
def get_set_delete():
obj.foo = "foo"
obj.foo
del obj.foo
return get_set_delete
``````

and

``````>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085
``````

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

``````>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342
``````

In Python 2 on Windows I have measured it about 15% faster.

## Why use `__slots__`: Memory Savings

Another purpose of `__slots__` is to reduce the space in memory that each object instance takes up.

The space saved over using `__dict__` can be significant.

SQLAlchemy attributes a lot of memory savings to `__slots__`.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with `guppy.hpy` (aka heapy) and `sys.getsizeof`, the size of a class instance without `__slots__` declared, and nothing else, is 64 bytes. That does not include the `__dict__`. Thank you Python for lazy evaluation again, the `__dict__` is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the `__dict__` attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with `__slots__` declared to be `()` (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for `__slots__` and `__dict__` (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

``````       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272‚Ä†   16         56 + 112‚Ä† | ‚Ä†if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408
43     384        56 + 3344   384        56 + 752
``````

So, in spite of smaller dicts in Python 3, we see how nicely `__slots__` scale for instances to save us memory, and that is a major reason you would want to use `__slots__`.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

``````>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72
``````

## Demonstration of `__slots__`:

To deny the creation of a `__dict__`, you must subclass `object`. Everything subclasses `object` in Python 3, but in Python 2 you had to be explicit:

``````class Base(object):
__slots__ = ()
``````

now:

``````>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
b.a = "a"
AttributeError: "Base" object has no attribute "a"
``````

Or subclass another class that defines `__slots__`

``````class Child(Base):
__slots__ = ("a",)
``````

and now:

``````c = Child()
c.a = "a"
``````

but:

``````>>> c.b = "b"
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
c.b = "b"
AttributeError: "Child" object has no attribute "b"
``````

To allow `__dict__` creation while subclassing slotted objects, just add `"__dict__"` to the `__slots__` (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

``````class SlottedWithDict(Child):
__slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"
``````

and

``````>>> swd.__dict__
{"c": "c"}
``````

Or you don"t even need to declare `__slots__` in your subclass, and you will still use slots from the parents, but not restrict the creation of a `__dict__`:

``````class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"
``````

And:

``````>>> ns.__dict__
{"b": "b"}
``````

However, `__slots__` may cause problems for multiple inheritance:

``````class BaseA(object):
__slots__ = ("a",)

class BaseB(object):
__slots__ = ("b",)
``````

Because creating a child class from parents with both non-empty slots fails:

``````>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
File "<pyshell#68>", line 1, in <module>
class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict
``````

If you run into this problem, You could just remove `__slots__` from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

``````from abc import ABC

class AbstractA(ABC):
__slots__ = ()

class BaseA(AbstractA):
__slots__ = ("a",)

class AbstractB(ABC):
__slots__ = ()

class BaseB(AbstractB):
__slots__ = ("b",)

class Child(AbstractA, AbstractB):
__slots__ = ("a", "b")

c = Child() # no problem!
``````

### Add `"__dict__"` to `__slots__` to get dynamic assignment:

``````class Foo(object):
__slots__ = "bar", "baz", "__dict__"
``````

and now:

``````>>> foo = Foo()
>>> foo.boink = "boink"
``````

So with `"__dict__"` in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use `__slots__` - names that are in `__slots__` point to slotted values, while any other values are put in the instance"s `__dict__`.

Avoiding `__slots__` because you want to be able to add attributes on the fly is actually not a good reason - just add `"__dict__"` to your `__slots__` if this is required.

You can similarly add `__weakref__` to `__slots__` explicitly if you need that feature.

### Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

``````from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
"""MyNT is an immutable and lightweight object"""
__slots__ = ()
``````

usage:

``````>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"
``````

And trying to assign an unexpected attribute raises an `AttributeError` because we have prevented the creation of `__dict__`:

``````>>> nt.quux = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"
``````

You can allow `__dict__` creation by leaving off `__slots__ = ()`, but you can"t use non-empty `__slots__` with subtypes of tuple.

## Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

``````class Foo(object):
__slots__ = "foo", "bar"
class Bar(object):
__slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
multiple bases have instance lay-out conflict
``````

Using an empty `__slots__` in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding `"__dict__"` to get dynamic assignment, see section above) the creation of a `__dict__`:

``````class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"
``````

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty `__slots__` in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

``````class AbstractBase:
__slots__ = ()
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"
``````

We could use the above directly by inheriting and declaring the expected slots:

``````class Foo(AbstractBase):
__slots__ = "a", "b"
``````

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

``````class AbstractBaseC:
__slots__ = ()
@property
def c(self):
print("getting c!")
return self._c
@c.setter
def c(self, arg):
print("setting c!")
self._c = arg
``````

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given `AbstractBase` nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

``````class Concretion(AbstractBase, AbstractBaseC):
__slots__ = "a b _c".split()
``````

And now we have functionality from both via multiple inheritance, and can still deny `__dict__` and `__weakref__` instantiation:

``````>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"
``````

## Other cases to avoid slots:

• Avoid them when you want to perform `__class__` assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
• Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
• Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the `__slots__` documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

### Do not "only use `__slots__` when instantiating lots of objects"

I quote:

"You would want to use `__slots__` if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the `collections` module, are not instantiated, yet `__slots__` are declared for them.

Why?

If a user wishes to deny `__dict__` or `__weakref__` creation, those things must not be available in the parent classes.

`__slots__` contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

### `__slots__` doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading `TypeError`:

``````>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
``````

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the `-1` argument. In Python 2.7 this would be `2` (which was introduced in 2.3), and in 3.6 it is `4`.

``````>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>
``````

in Python 2.7:

``````>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>
``````

in Python 3.6

``````>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>
``````

So I would keep this in mind, as it is a solved problem.

## Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of `__slots__` is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the `__dict__` when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid `__slots__`. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with `__slots__`.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

# Memory usage evidence

Create some normal objects and slotted objects:

``````>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()
``````

Instantiate a million of them:

``````>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]
``````

Inspect with `guppy.hpy().heap()`:

``````>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
0 1000000  49 64000000  64  64000000  64 __main__.Foo
1     169   0 16281480  16  80281480  80 list
2 1000000  49 16000000  16  96281480  97 __main__.Bar
3   12284   1   987472   1  97268952  97 str
...
``````

Access the regular objects and their `__dict__` and inspect again:

``````>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
1 1000000  33  64000000  17 344000000  91 __main__.Foo
2     169   0  16281480   4 360281480  95 list
3 1000000  33  16000000   4 376281480  99 __main__.Bar
4   12284   0    987472   0 377268952  99 str
...
``````

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate `__dict__` and `__weakrefs__`. (The `__dict__` is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "`__slots__ = []`" to your class.

# `os.listdir()` - list in the current directory

With listdir in os module you get the files and the folders in the current dir

`````` import os
arr = os.listdir()
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

## Looking in a directory

``````arr = os.listdir("c:\files")
``````

# `glob` from glob

with glob you can specify a type of file to list like this

``````import glob

txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)
``````

## `glob` in a list comprehension

``````mylist = [f for f in glob.glob("*.txt")]
``````

## get the full path of only files in the current directory

``````import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles)

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
``````

## Getting the full path name with `os.path.abspath`

You get the full path in return

`````` import os
files_path = [os.path.abspath(x) for x in os.listdir()]
print(files_path)

["F:\documentiapplications.txt", "F:\documenticollections.txt"]
``````

## Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

``````import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if file.endswith(".docx"):
print(os.path.join(r, file))
``````

### `os.listdir()`: get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

`````` import os
arr = os.listdir(".")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### To go up in the directory tree

``````# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")
``````

### Get files: `os.listdir()` in a particular directory (Python 2 and 3)

`````` import os
arr = os.listdir("F:\python")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### Get files of a particular subdirectory with `os.listdir()`

``````import os

x = os.listdir("./content")
``````

### `os.walk(".")` - current directory

`````` import os
arr = next(os.walk("."))[2]
print(arr)

>>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
``````

### `next(os.walk("."))` and `os.path.join("dir", "file")`

`````` import os
arr = []
for d,r,f in next(os.walk("F:\_python")):
for file in f:
arr.append(os.path.join(r,file))

for f in arr:
print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt
``````

### `next(os.walk("F:\")` - get the full path - list comprehension

`````` [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]

>>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
``````

### `os.walk` - get full path - all files in sub dirs**

``````x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

``````

### `os.listdir()` - get only txt files

`````` arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
print(arr_txt)

>>> ["work.txt", "3ebooks.txt"]
``````

## Using `glob` to get the full path of the files

If I should need the absolute path of the files:

``````from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt
``````

## Using `os.path.isfile` to avoid directories in the list

``````import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]
``````

## Using `pathlib` from Python 3.4

``````import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
if p.is_file():
print(p)
flist.append(p)

>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speak_gui2.py
>>> thumb.PNG
``````

With `list comprehension`:

``````flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
``````

Alternatively, use `pathlib.Path()` instead of `pathlib.Path(".")`

## Use glob method in pathlib.Path()

``````import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py
``````

## Get all and only files with os.walk

``````import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
for f in t:
y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
``````

## Get only files with next and walk in a directory

`````` import os
x = next(os.walk("F://python"))[2]
print(x)

>>> ["calculator.bat","calculator.py"]
``````

## Get only directories with next and walk in a directory

`````` import os
next(os.walk("F://python"))[1] # for the current dir use (".")

>>> ["python3","others"]
``````

## Get all the subdir names with `walk`

``````for r,d,f in os.walk("F:\_python"):
for dirs in d:
print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints
``````

## `os.scandir()` from Python 3.5 and greater

``````import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
for entry in i:
if entry.is_file():
print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG
``````

# Examples:

## Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

``````import os

def count(dir, counter=0):
"returns number of files in dir and subdirs"
for pack in os.walk(dir):
for f in pack[2]:
counter += 1
return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"
``````

## Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

``````import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
"Searches for pptx (or other - pptx is the default) files and copies them"
for pack in os.walk(dir):
for f in pack[2]:
if f.endswith(filetype):
fullpath = pack[0] + "\" + f
print(fullpath)
shutil.copy(fullpath, destination)
counter += 1
if counter > 0:
print("-" * 30)
print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
"searches for folders that starts with `_`"
if dir[0] == "_":
# copyfile(dir, filetype="pdf")
copyfile(dir, filetype="txt")

>>> _compiti18Compito Contabilit√† 1conti.txt
>>> _compiti18Compito Contabilit√† 1modula4.txt
>>> _compiti18Compito Contabilit√† 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files
``````

## Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

``````import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
for eachfile in os.listdir():
mylist += eachfile + "
"
file.write(mylist)
``````

## Example: txt with all the files of an hard drive

``````"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
for root, dirs, files in os.walk("D:\"):
for file in files:
listafile.append(file)
percorso.append(root + "\" + file)
testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
for file in listafile:
testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
for file in percorso:
file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")
``````

## All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

``````import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk("C:\"):
for file in f:
filewrite.write(f"{r + file}
")
``````

## How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

``````import os

def searchfiles(extension=".ttf", folder="H:\"):
"Create a txt file with all the file of a type"
with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png
``````

## (New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list.

``````import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
"insert all files in the listbox"
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
lb.insert(0, r + "\" + file)

def open_file():
os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()
``````

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here"s what this post will go through:

• The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

• merging with different column names
• merging with multiple columns
• avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

• Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
• Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

# Enough talk - just show me how to use `merge`!

### Setup & Basics

``````np.random.seed(0)
left = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
right = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": np.random.randn(4)})

left

key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357
``````

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note This, along with the forthcoming figures all follow this convention:

• blue indicates rows that are present in the merge result
• red indicates rows that are excluded from the result (i.e., removed)
• green indicates missing values that are replaced with `NaN`s in the result

To perform an INNER JOIN, call `merge` on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

``````left.merge(right, on="key")
# Or, if you want to be explicit
# left.merge(right, on="key", how="inner")

key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
``````

This returns only rows from `left` and `right` which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying `how="left"`.

``````left.merge(right, on="key", how="left")

key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
``````

Carefully note the placement of NaNs here. If you specify `how="left"`, then only keys from `left` are used, and missing data from `right` is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify `how="right"`:

``````left.merge(right, on="key", how="right")

key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357
``````

Here, keys from `right` are used, and missing data from `left` is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify `how="outer"`.

``````left.merge(right, on="key", how="outer")

key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
``````

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely:

### Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from `left` only,

``````(left.merge(right, on="key", how="left", indicator=True)
.query("_merge == "left_only"")
.drop("_merge", 1))

key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN
``````

Where,

``````left.merge(right, on="key", how="left", indicator=True)

key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both``````

And similarly, for a RIGHT-Excluding JOIN,

``````(left.merge(right, on="key", how="right", indicator=True)
.query("_merge == "right_only"")
.drop("_merge", 1))

key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357``````

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion‚Äî

``````(left.merge(right, on="key", how="outer", indicator=True)
.query("_merge != "both"")
.drop("_merge", 1))

key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357
``````

### Different names for key columns

If the key columns are named differently‚Äîfor example, `left` has `keyLeft`, and `right` has `keyRight` instead of `key`‚Äîthen you will have to specify `left_on` and `right_on` as arguments instead of `on`:

``````left2 = left.rename({"key":"keyLeft"}, axis=1)
right2 = right.rename({"key":"keyRight"}, axis=1)

left2

keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
``````
``````left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")

keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278
``````

### Avoiding duplicate key column in output

When merging on `keyLeft` from `left` and `keyRight` from `right`, if you only want either of the `keyLeft` or `keyRight` (but not both) in the output, you can start by setting the index as a preliminary step.

``````left3 = left2.set_index("keyLeft")
left3.merge(right2, left_index=True, right_on="keyRight")

value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278
``````

Contrast this with the output of the command just before (that is, the output of `left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")`), you"ll notice `keyLeft` is missing. You can figure out what column to keep based on which frame"s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

### Merging only a single column from one of the `DataFrames`

For example, consider

``````right3 = right.assign(newcol=np.arange(len(right)))
right3
key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3
``````

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

``````left.merge(right3[["key", "newcol"]], on="key")

key     value  newcol
0   B  0.400157       0
1   D  2.240893       1
``````

If you"re doing a LEFT OUTER JOIN, a more performant solution would involve `map`:

``````# left["newcol"] = left["key"].map(right3.set_index("key")["newcol"]))
left.assign(newcol=left["key"].map(right3.set_index("key")["newcol"]))

key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
``````

As mentioned, this is similar to, but faster than

``````left.merge(right3[["key", "newcol"]], on="key", how="left")

key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
``````

### Merging on multiple columns

To join on more than one column, specify a list for `on` (or `left_on` and `right_on`, as appropriate).

``````left.merge(right, on=["key1", "key2"] ...)
``````

Or, in the event the names are different,

``````left.merge(right, left_on=["lkey1", "lkey2"], right_on=["rkey1", "rkey2"])
``````

### Other useful `merge*` operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on `merge`, `join`, and `concat` as well as the links to the function specifications.

*You are here.

# tl;dr / quick fix

• Don"t decode/encode willy nilly
• Don"t assume your strings are UTF-8 encoded
• Try to convert strings to Unicode strings as soon as possible in your code
• Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
• Don"t be tempted to use quick `reload` hacks

# Unicode Zen in Python 2.x - The Long Version

Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

`UnicodeDecodeError: "ascii" codec can"t decode byte` generally happens when you try to convert a Python 2.x `str` that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use `unicode()` (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the `u` prefix to strings. E.g.

``````>>> my_u = u"my √ºnic√¥d√© strƒØng"
>>> type(my_u)
<type "unicode">
``````

Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

# Gotchas

Conversion from `str` to Unicode can happen even when you don"t explicitly call `unicode()`.

The following scenarios cause `UnicodeDecodeError` exceptions:

``````# Explicit conversion without encoding
unicode("‚Ç¨")

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("‚Ç¨")

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "‚Ç¨"

# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "‚Ç¨"
``````

## Examples

In the following diagram, you can see how the word `caf√©` has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, `caf` is just regular ascii. In UTF-8, `√©` is encoded using two bytes. In "Cp1252", √© is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct `decode()` is invoked and conversion to a Python Unicode is successfull:

In this diagram, `decode()` is called with `ascii` (which is the same as calling `unicode()` without an encoding given). As ASCII can"t contain bytes greater than `0x7F`, this will throw a `UnicodeDecodeError` exception:

# The Unicode Sandwich

It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to `str`s on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

## Input / Decode

### Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a `u`. E.g.

``````u"Z√ºrich"
``````

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

``````# encoding: utf-8
``````

This is only necessary when you have non-ASCII in your source code.

### Files

Usually non-ASCII data is received from a file. The `io` module provides a TextWrapper that decodes your file on the fly, using a given `encoding`. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

``````import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
``````

`my_unicode_string` would then be suitable for passing to Markdown. If a `UnicodeDecodeError` from the `read()` line, then you"ve probably used the wrong encoding value.

### CSV Files

The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

``````from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
yield row
``````

### Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

``````charset="utf8",
use_unicode=True
``````

E.g.

``````>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
``````
PostgreSQL

``````psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
``````

### HTTP

Web pages can be encoded in just about any encoding. The `Content-type` header should contain a `charset` field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in `response.text`.

### Manually

If you must decode strings manually, you can simply do `my_string.decode(encoding)`, where `encoding` is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get `UnicodeDecodeError` then you"ve probably got the wrong encoding.

## The meat of the sandwich

Work with Unicodes as you would normal strs.

## Output

### stdout / printing

`print` writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s `locale` is `en_GB.UTF-8`, the output will be encoded to `UTF-8`. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. `PYTHONIOENCODING` environment variable can force the encoding for stdout.

### Files

Just like input, `io.open` can be used to transparently convert Unicodes to encoded byte strings.

### Database

The same configuration for reading will allow Unicodes to be written directly.

# Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular `str` is now a Unicode string and the old `str` is now `bytes`.

The default encoding is UTF-8, so if you `.decode()` a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

Further, `open()` operates in text mode by default, so returns decoded `str` (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

# Why you shouldn"t use `sys.setdefaultencoding("utf8")`

It"s a nasty hack (there"s a reason you have to use `reload`) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

Clear the cache directory where appropriate for your system

Linux and Unix

``````~/.cache/pip  # and it respects the XDG_CACHE_HOME directory.
``````

OS X

``````~/Library/Caches/pip
``````

Windows

``````%LocalAppData%pipCache
``````

### UPDATE

With pip `20.1` or later, you can find the full path for your operating system easily by typing this in the command line:

``````pip cache dir
``````

Example output on my Ubuntu installation:

``````‚ûú pip3 cache dir
/home/tawanda/.cache/pip
``````

For sanity, you probably want to have all `datetimes` calculated by your DB server, rather than the application server. Calculating the timestamp in the application can lead to problems because network latency is variable, clients experience slightly different clock drift, and different programming languages occasionally calculate time slightly differently.

SQLAlchemy allows you to do this by passing `func.now()` or `func.current_timestamp()` (they are aliases of each other) which tells the DB to calculate the timestamp itself.

### Use SQLALchemy"s `server_default`

Additionally, for a default where you"re already telling the DB to calculate the value, it"s generally better to use `server_default` instead of `default`. This tells SQLAlchemy to pass the default value as part of the `CREATE TABLE` statement.

For example, if you write an ad hoc script against this table, using `server_default` means you won"t need to worry about manually adding a timestamp call to your script--the database will set it automatically.

### Understanding SQLAlchemy"s `onupdate`/`server_onupdate`

SQLAlchemy also supports `onupdate` so that anytime the row is updated it inserts a new timestamp. Again, best to tell the DB to calculate the timestamp itself:

``````from sqlalchemy.sql import func

time_created = Column(DateTime(timezone=True), server_default=func.now())
time_updated = Column(DateTime(timezone=True), onupdate=func.now())
``````

There is a `server_onupdate` parameter, but unlike `server_default`, it doesn"t actually set anything serverside. It just tells SQLalchemy that your database will change the column when an update happens (perhaps you created a trigger on the column ), so SQLAlchemy will ask for the return value so it can update the corresponding object.

### One other potential gotcha:

You might be surprised to notice that if you make a bunch of changes within a single transaction, they all have the same timestamp. That"s because the SQL standard specifies that `CURRENT_TIMESTAMP` returns values based on the start of the transaction.

PostgreSQL provides the non-SQL-standard `statement_timestamp()` and `clock_timestamp()` which do change within a transaction. Docs here: https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-CURRENT

### UTC timestamp

If you want to use UTC timestamps, a stub of implementation for `func.utcnow()` is provided in SQLAlchemy documentation. You need to provide appropriate driver-specific functions on your own though.

## How to remove convexity defects in a Sudoku square?

I was doing a fun project: Solving a Sudoku from an input image using OpenCV (as in Google goggles etc). And I have completed the task, but at the end I found a little problem for which I came here.

I did the programming using Python API of OpenCV 2.3.1.

Below is what I did :

2. Find the contours
3. Select the one with maximum area, ( and also somewhat equivalent to square).
4. Find the corner points.

e.g. given below:

(Notice here that the green line correctly coincides with the true boundary of the Sudoku, so the Sudoku can be correctly warped. Check next image)

5. warp the image to a perfect square

eg image:

6. Perform OCR ( for which I used the method I have given in Simple Digit Recognition OCR in OpenCV-Python )

And the method worked well.

Problem:

Check out this image.

Performing the step 4 on this image gives the result below:

The red line drawn is the original contour which is the true outline of sudoku boundary.

The green line drawn is approximated contour which will be the outline of warped image.

Which of course, there is difference between green line and red line at the top edge of sudoku. So while warping, I am not getting the original boundary of the Sudoku.

My Question :

How can I warp the image on the correct boundary of the Sudoku, i.e. the red line OR how can I remove the difference between red line and green line? Is there any method for this in OpenCV?

## Is there a library function for Root mean square error (RMSE) in python?

I know I could implement a root mean squared error function like this:

``````def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())
``````

What I"m looking for if this rmse function is implemented in a library somewhere, perhaps in scipy or scikit-learn?

## What"s the difference between lists enclosed by square brackets and parentheses in Python?

``````>>> x=[1,2]
>>> x[1]
2
>>> x=(1,2)
>>> x[1]
2
``````

Are they both valid? Is one preferred for some reason?

## How do I calculate square root in Python?

Why does Python give the "wrong" answer?

``````x = 16

sqrt = x**(.5)  #returns 4
sqrt = x**(1/2) #returns 1
``````

Yes, I know `import math` and use `sqrt`. But I"m looking for an answer to the above.

## What do square brackets mean in pip install?

I see more and more commands like this:

``````\$ pip install "splinter[django]"
``````

What do these square brackets do?

## How do I calculate r-squared using Python and Numpy?

I"m using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel"s best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here"s my function:

``````import numpy

# Polynomial Regression
def polyfit(x, y, degree):
results = {}

coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results["polynomial"] = coeffs.tolist()

correlation = numpy.corrcoef(x, y)[0,1]

# r
results["correlation"] = correlation
# r-squared
results["determination"] = correlation**2

return results
``````

## What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

I"ve noticed three methods of selecting a column in a Pandas DataFrame:

First method of selecting a column using loc:

``````df_new = df.loc[:, "col1"]
``````

Second method - seems simpler and faster:

``````df_new = df["col1"]
``````

Third method - most convenient:

``````df_new = df.col1
``````

Is there a difference between these three methods? I don"t think so, in which case I"d rather use the third method.

I"m mostly curious as to why there appear to be three methods for doing the same thing.

I think you"re almost there, try removing the extra square brackets around the `lst`"s (Also you don"t need to specify the column names when you"re creating a dataframe from a dict like this):

``````import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{"lst1Title": lst1,
"lst2Title": lst2,
"lst3Title": lst3
})

percentile_list
lst1Title  lst2Title  lst3Title
0          0         0         0
1          1         1         1
2          2         2         2
3          3         3         3
4          4         4         4
5          5         5         5
6          6         6         6
...
``````

If you need a more performant solution you can use `np.column_stack` rather than `zip` as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:

``````import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=["lst1Title", "lst2Title", "lst3Title"])
``````

There is a clean, one-line way of doing this in Pandas:

``````df["col_3"] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
``````

This allows `f` to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.

Example with data (based on original question):

``````import pandas as pd

df = pd.DataFrame({"ID":["1", "2", "3"], "col_1": [0, 2, 3], "col_2":[1, 4, 5]})
mylist = ["a", "b", "c", "d", "e", "f"]

def get_sublist(sta,end):
return mylist[sta:end+1]

df["col_3"] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
``````

Output of `print(df)`:

``````  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]
``````

If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:

``````df["col_3"] = df.apply(lambda x: f(x["col 1"], x["col 2"]), axis=1)
``````

# Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo"s answer, that uses the full list of the current `scipy.stats` distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

## Example Fitting

Using the El Ni√±o dataset from `statsmodels`, the distributions are fit and error is determined. The distribution with the least error is returned.

### Example Code

``````%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
matplotlib.style.use("ggplot")

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0

# Best holders
best_distributions = []

# Estimate distribution parameters from data
for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):

print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

distribution = getattr(st, distribution)

# Try to fit the distribution
try:
# Ignore warnings from data that can"t be fit
with warnings.catch_warnings():
warnings.filterwarnings("ignore")

# fit dist to data
params = distribution.fit(data)

# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))

# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass

# identify if this distribution is better
best_distributions.append((distribution, params, sse))

except Exception:
pass

return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
"""Generate distributions"s Probability Distribution Function """

# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

# Get sane start and end points of distribution
start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)

return pdf

# Load data from statsmodels datasets

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])[1]["color"])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u"El Ni√±o sea temp.
All Fitted Distributions")
ax.set_xlabel(u"Temp (¬∞C)")
ax.set_ylabel("Frequency")

# Make PDF with best params
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)

param_names = (best_dist[0].shapes + ", loc, scale").split(", ") if best_dist[0].shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = "{}({})".format(best_dist[0].name, param_str)

ax.set_title(u"El Ni√±o sea temp. with best fit distribution
" + dist_str)
ax.set_xlabel(u"Temp. (¬∞C)")
ax.set_ylabel("Frequency")
``````

TL;DR

``````def square_list(n):
the_list = []                         # Replace
for x in range(n):
y = x * x
the_list.append(y)                # these
return the_list                       # lines
``````

# do this:

``````def square_yield(n):
for x in range(n):
y = x * x
yield y                           # with this one.
``````

Whenever you find yourself building a list from scratch, `yield` each piece instead.

This was my first "aha" moment with yield.

`yield` is a sugary way to say

build a series of stuff

Same behavior:

``````>>> for square in square_list(4):
...     print(square)
...
0
1
4
9
>>> for square in square_yield(4):
...     print(square)
...
0
1
4
9
``````

Different behavior:

Yield is single-pass: you can only iterate through once. When a function has a yield in it we call it a generator function. And an iterator is what it returns. Those terms are revealing. We lose the convenience of a container, but gain the power of a series that"s computed as needed, and arbitrarily long.

Yield is lazy, it puts off computation. A function with a yield in it doesn"t actually execute at all when you call it. It returns an iterator object that remembers where it left off. Each time you call `next()` on the iterator (this happens in a for-loop) execution inches forward to the next yield. `return` raises StopIteration and ends the series (this is the natural end of a for-loop).

Yield is versatile. Data doesn"t have to be stored all together, it can be made available one at a time. It can be infinite.

``````>>> def squares_all_of_them():
...     x = 0
...     while True:
...         yield x * x
...         x += 1
...
>>> squares = squares_all_of_them()
>>> for _ in range(4):
...     print(next(squares))
...
0
1
4
9
``````

If you need multiple passes and the series isn"t too long, just call `list()` on it:

``````>>> list(square_yield(4))
[0, 1, 4, 9]
``````

Brilliant choice of the word `yield` because both meanings apply:

yield — produce or provide (as in agriculture)

...provide the next data in the series.

yield — give way or relinquish (as in political power)

...relinquish CPU execution until the iterator advances.

This is kind of overkill but let"s give it a go. First lets use statsmodel to find out what the p-values should be

``````import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

X = diabetes.data
y = diabetes.target

est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
``````

and we get

``````                         OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================
``````

Ok, let"s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

``````lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don"t want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
``````

And this gives us.

``````    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306
``````

So we can reproduce the values from statsmodel.

The problem is the use of `aspect="equal"`, which prevents the subplots from stretching to an arbitrary aspect ratio and filling up all the empty space.

Normally, this would work:

``````import matplotlib.pyplot as plt

ax = [plt.subplot(2,2,i+1) for i in range(4)]

for a in ax:
a.set_xticklabels([])
a.set_yticklabels([])

``````

The result is this:

However, with `aspect="equal"`, as in the following code:

``````import matplotlib.pyplot as plt

ax = [plt.subplot(2,2,i+1) for i in range(4)]

for a in ax:
a.set_xticklabels([])
a.set_yticklabels([])
a.set_aspect("equal")

``````

This is what we get:

The difference in this second case is that you"ve forced the x- and y-axes to have the same number of units/pixel. Since the axes go from 0 to 1 by default (i.e., before you plot anything), using `aspect="equal"` forces each axis to be a square. Since the figure is not a square, pyplot adds in extra spacing between the axes horizontally.

To get around this problem, you can set your figure to have the correct aspect ratio. We"re going to use the object-oriented pyplot interface here, which I consider to be superior in general:

``````import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,8)) # Notice the equal aspect ratio
ax = [fig.add_subplot(2,2,i+1) for i in range(4)]

for a in ax:
a.set_xticklabels([])
a.set_yticklabels([])
a.set_aspect("equal")

``````

Here"s the result:

How about using `numpy.vectorize`.

``````import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1,  4,  9, 16, 25])
``````

Use imap instead of map, which returns an iterator of processed values.

``````from multiprocessing import Pool
import tqdm
import time

def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square

if __name__ == "__main__":
with Pool(2) as p:
r = list(tqdm.tqdm(p.imap(_foo, range(30)), total=30))
``````

I"d like to shed a little bit more light on the interplay of `iter`, `__iter__` and `__getitem__` and what happens behind the curtains. Armed with that knowledge, you will be able to understand why the best you can do is

``````try:
iter(maybe_iterable)
print("iteration will probably work")
except TypeError:
print("not iterable")
``````

I will list the facts first and then follow up with a quick reminder of what happens when you employ a `for` loop in python, followed by a discussion to illustrate the facts.

# Facts

1. You can get an iterator from any object `o` by calling `iter(o)` if at least one of the following conditions holds true:

a) `o` has an `__iter__` method which returns an iterator object. An iterator is any object with an `__iter__` and a `__next__` (Python 2: `next`) method.

b) `o` has a `__getitem__` method.

2. Checking for an instance of `Iterable` or `Sequence`, or checking for the attribute `__iter__` is not enough.

3. If an object `o` implements only `__getitem__`, but not `__iter__`, `iter(o)` will construct an iterator that tries to fetch items from `o` by integer index, starting at index 0. The iterator will catch any `IndexError` (but no other errors) that is raised and then raises `StopIteration` itself.

4. In the most general sense, there"s no way to check whether the iterator returned by `iter` is sane other than to try it out.

5. If an object `o` implements `__iter__`, the `iter` function will make sure that the object returned by `__iter__` is an iterator. There is no sanity check if an object only implements `__getitem__`.

6. `__iter__` wins. If an object `o` implements both `__iter__` and `__getitem__`, `iter(o)` will call `__iter__`.

7. If you want to make your own objects iterable, always implement the `__iter__` method.

# `for` loops

In order to follow along, you need an understanding of what happens when you employ a `for` loop in Python. Feel free to skip right to the next section if you already know.

When you use `for item in o` for some iterable object `o`, Python calls `iter(o)` and expects an iterator object as the return value. An iterator is any object which implements a `__next__` (or `next` in Python 2) method and an `__iter__` method.

By convention, the `__iter__` method of an iterator should return the object itself (i.e. `return self`). Python then calls `next` on the iterator until `StopIteration` is raised. All of this happens implicitly, but the following demonstration makes it visible:

``````import random

class DemoIterable(object):
def __iter__(self):
print("__iter__ called")
return DemoIterator()

class DemoIterator(object):
def __iter__(self):
return self

def __next__(self):
print("__next__ called")
r = random.randint(1, 10)
if r == 5:
print("raising StopIteration")
raise StopIteration
return r
``````

Iteration over a `DemoIterable`:

``````>>> di = DemoIterable()
>>> for x in di:
...     print(x)
...
__iter__ called
__next__ called
9
__next__ called
8
__next__ called
10
__next__ called
3
__next__ called
10
__next__ called
raising StopIteration
``````

# Discussion and illustrations

On point 1 and 2: getting an iterator and unreliable checks

Consider the following class:

``````class BasicIterable(object):
def __getitem__(self, item):
if item == 3:
raise IndexError
return item
``````

Calling `iter` with an instance of `BasicIterable` will return an iterator without any problems because `BasicIterable` implements `__getitem__`.

``````>>> b = BasicIterable()
>>> iter(b)
<iterator object at 0x7f1ab216e320>
``````

However, it is important to note that `b` does not have the `__iter__` attribute and is not considered an instance of `Iterable` or `Sequence`:

``````>>> from collections import Iterable, Sequence
>>> hasattr(b, "__iter__")
False
>>> isinstance(b, Iterable)
False
>>> isinstance(b, Sequence)
False
``````

This is why Fluent Python by Luciano Ramalho recommends calling `iter` and handling the potential `TypeError` as the most accurate way to check whether an object is iterable. Quoting directly from the book:

As of Python 3.4, the most accurate way to check whether an object `x` is iterable is to call `iter(x)` and handle a `TypeError` exception if it isn‚Äôt. This is more accurate than using `isinstance(x, abc.Iterable)` , because `iter(x)` also considers the legacy `__getitem__` method, while the `Iterable` ABC does not.

On point 3: Iterating over objects which only provide `__getitem__`, but not `__iter__`

Iterating over an instance of `BasicIterable` works as expected: Python constructs an iterator that tries to fetch items by index, starting at zero, until an `IndexError` is raised. The demo object"s `__getitem__` method simply returns the `item` which was supplied as the argument to `__getitem__(self, item)` by the iterator returned by `iter`.

``````>>> b = BasicIterable()
>>> it = iter(b)
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
``````

Note that the iterator raises `StopIteration` when it cannot return the next item and that the `IndexError` which is raised for `item == 3` is handled internally. This is why looping over a `BasicIterable` with a `for` loop works as expected:

``````>>> for x in b:
...     print(x)
...
0
1
2
``````

Here"s another example in order to drive home the concept of how the iterator returned by `iter` tries to access items by index. `WrappedDict` does not inherit from `dict`, which means instances won"t have an `__iter__` method.

``````class WrappedDict(object): # note: no inheritance from dict!
def __init__(self, dic):
self._dict = dic

def __getitem__(self, item):
try:
return self._dict[item] # delegate to dict.__getitem__
except KeyError:
raise IndexError
``````

Note that calls to `__getitem__` are delegated to `dict.__getitem__` for which the square bracket notation is simply a shorthand.

``````>>> w = WrappedDict({-1: "not printed",
...                   0: "hi", 1: "StackOverflow", 2: "!",
...                   4: "not printed",
...                   "x": "not printed"})
>>> for x in w:
...     print(x)
...
hi
StackOverflow
!
``````

On point 4 and 5: `iter` checks for an iterator when it calls `__iter__`:

When `iter(o)` is called for an object `o`, `iter` will make sure that the return value of `__iter__`, if the method is present, is an iterator. This means that the returned object must implement `__next__` (or `next` in Python 2) and `__iter__`. `iter` cannot perform any sanity checks for objects which only provide `__getitem__`, because it has no way to check whether the items of the object are accessible by integer index.

``````class FailIterIterable(object):
def __iter__(self):
return object() # not an iterator

class FailGetitemIterable(object):
def __getitem__(self, item):
raise Exception
``````

Note that constructing an iterator from `FailIterIterable` instances fails immediately, while constructing an iterator from `FailGetItemIterable` succeeds, but will throw an Exception on the first call to `__next__`.

``````>>> fii = FailIterIterable()
>>> iter(fii)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: iter() returned non-iterator of type "object"
>>>
>>> fgi = FailGetitemIterable()
>>> it = iter(fgi)
>>> next(it)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/iterdemo.py", line 42, in __getitem__
raise Exception
Exception
``````

On point 6: `__iter__` wins

This one is straightforward. If an object implements `__iter__` and `__getitem__`, `iter` will call `__iter__`. Consider the following class

``````class IterWinsDemo(object):
def __iter__(self):
return iter(["__iter__", "wins"])

def __getitem__(self, item):
return ["__getitem__", "wins"][item]
``````

and the output when looping over an instance:

``````>>> iwd = IterWinsDemo()
>>> for x in iwd:
...     print(x)
...
__iter__
wins
``````

On point 7: your iterable classes should implement `__iter__`

You might ask yourself why most builtin sequences like `list` implement an `__iter__` method when `__getitem__` would be sufficient.

``````class WrappedList(object): # note: no inheritance from list!
def __init__(self, lst):
self._list = lst

def __getitem__(self, item):
return self._list[item]
``````

After all, iteration over instances of the class above, which delegates calls to `__getitem__` to `list.__getitem__` (using the square bracket notation), will work fine:

``````>>> wl = WrappedList(["A", "B", "C"])
>>> for x in wl:
...     print(x)
...
A
B
C
``````

The reasons your custom iterables should implement `__iter__` are as follows:

1. If you implement `__iter__`, instances will be considered iterables, and `isinstance(o, collections.abc.Iterable)` will return `True`.
2. If the object returned by `__iter__` is not an iterator, `iter` will fail immediately and raise a `TypeError`.
3. The special handling of `__getitem__` exists for backwards compatibility reasons. Quoting again from Fluent Python:

That is why any Python sequence is iterable: they all implement `__getitem__` . In fact, the standard sequences also implement `__iter__`, and yours should too, because the special handling of `__getitem__` exists for backward compatibility reasons and may be gone in the future (although it is not deprecated as I write this).

There are lots of things I have seen make a model diverge.

1. Too high of a learning rate. You can often tell if this is the case if the loss begins to increase and then diverges to infinity.

2. I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function. This involves taking the log of the prediction which diverges as the prediction approaches zero. That is why people usually add a small epsilon value to the prediction to prevent this divergence. I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it. Probably not the issue.

3. Other numerical stability issues can exist such as division by zero where adding the epsilon can help. Another less obvious one if the square root who"s derivative can diverge if not properly simplified when dealing with finite precision numbers. Yet again I doubt this is the issue in the case of the DNNClassifier.

4. You may have an issue with the input data. Try calling `assert not np.any(np.isnan(x))` on the input data to make sure you are not introducing the nan. Also make sure all of the target values are valid. Finally, make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

5. The labels must be in the domain of the loss function, so if using a logarithmic-based loss function all labels must be non-negative (as noted by evan pu and the comments below).