  # Python | Root mean square error

mean | NumPy | Python Methods and Functions | square

Steps to find the MSE

1. Find the equation for the regression line.

(1) 2. Insert the X values ​​into the equation found in step 1 to get the corresponding Y values, i.e.

(2) 3. Now subtract the new Y values ​​(ie ) from the original values ​​of Y. Thus, the found values ​​are erroneous terms. This is also known as the vertical distance of a given point from the regression line.

(3) 4. Square the errors found in step 3.

(4) 5. Summarize all squares.

(5) 6. Divide the value found in step 5 by the total number of observations.

(6) Example:
Consider these points: (1,1), (2,1), (3 , 2), (4,2), (5,4)
You can use this online calculator to find the equation / regression line.

Regression line equation: Y = 0.7X — 0,1

X Y
110.6
2 1 1.29
3 2 1.99
422.69
5 4 3.4

Now using the formula found for MSE on step 6 above, we can get MSE = 0.21606

MSE using scikit — learn:

`  Output:  0.21606 `

MSE using Numpy module:

` `

 ` from ` ` sklearn.metrics ` ` import ` ` mean_squared_error ` ` `  ` # Specified values ​​` ` Y_true ` ` = ` ` [` ` 1 ` `, ` ` 1 ` `, ` ` 2 ` `, ` ` 2 ` `, ` ` 4 ` `] ` ` # Y_true = Y (original values ) `   ` # calculated values ​​` ` Y_pred ` ` = < / code> [ 0.6 , 1.29 , 1.99 , 2.69 , 3.4 ]  # Y_pred = Y & # 39; ``   # Calculation mean square error (MSE) mean_squared_error (Y_true, Y_pred) `
 ` import ` ` numpy as np ` ` `  ` # Specified values ​​` ` Y_true ` ` = ` ` [` ` 1 ` `, ` ` 1 ` `, ` ` 2 ` `, ` ` 2 ` `, ` ` 4 ` `] ` ` # Y_true = Y (original values) `   ` # Estimated values ​​` ` Y_pred ` ` = ` ` [` ` 0.6 ` ` , ` ` 1.29 ` `, ` ` 1.99 ` `, ` ` 2.69 ` `, ` ` 3.4 ` `] ` ` # Y_pred = Y & # 39; `    ` # Mean square error ` ` MSE ` ` = ` ` np.square (np.subtract (Y_true, Y_pred)). mean () `
` `

` `

`  Output:  0.21606 `

## Meaning of @classmethod and @staticmethod for beginner?

### Question by user1632861

Could someone explain to me the meaning of `@classmethod` and `@staticmethod` in python? I need to know the difference and the meaning.

As far as I understand, `@classmethod` tells a class that it"s a method which should be inherited into subclasses, or... something. However, what"s the point of that? Why not just define the class method without adding `@classmethod` or `@staticmethod` or any `@` definitions?

tl;dr: when should I use them, why should I use them, and how should I use them?

## What is the meaning of single and double underscore before an object name?

Can someone please explain the exact meaning of having single and double leading underscores before an object"s name in Python, and the difference between both?

Also, does that meaning stay the same regardless of whether the object in question is a variable, a function, a method, etc.?

## What does -> mean in Python function definitions?

I"ve recently noticed something interesting when looking at Python 3.3 grammar specification:

``````funcdef: "def" NAME parameters ["->" test] ":" suite
``````

The optional "arrow" block was absent in Python 2 and I couldn"t find any information regarding its meaning in Python 3. It turns out this is correct Python and it"s accepted by the interpreter:

``````def f(x) -> 123:
return x
``````

I thought that this might be some kind of a precondition syntax, but:

• I cannot test `x` here, as it is still undefined,
• No matter what I put after the arrow (e.g. `2 < 1`), it doesn"t affect the function behavior.

Could anyone accustomed with this syntax style explain it?

## What does the star and doublestar operator mean in a function call?

What does the `*` operator mean in Python, such as in code like `zip(*x)` or `f(**k)`?

1. How is it handled internally in the interpreter?
2. Does it affect performance at all? Is it fast or slow?
3. When is it useful and when is it not?
4. Should it be used in a function declaration or in a call?

## Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

I have a data frame `df` and I use several columns from it to `groupby`:

``````df["col1","col2","col3","col4"].groupby(["col1","col2"]).mean()
``````

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?

## What does -1 mean in numpy reshape?

A numpy matrix can be reshaped into a vector using reshape function with parameter -1. But I don"t know what -1 means here.

For example:

``````a = numpy.matrix([[1, 2, 3, 4], [5, 6, 7, 8]])
b = numpy.reshape(a, -1)
``````

The result of `b` is: `matrix([[1, 2, 3, 4, 5, 6, 7, 8]])`

Does anyone know what -1 means here? And it seems python assign -1 several meanings, such as: `array[-1]` means the last element. Can you give an explanation?

## In Matplotlib, what does the argument mean in fig.add_subplot(111)?

Sometimes I come across code such as this:

``````import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
fig = plt.figure()
plt.scatter(x, y)
plt.show()
``````

Which produces: I"ve been reading the documentation like crazy but I can"t find an explanation for the `111`. sometimes I see a `212`.

What does the argument of `fig.add_subplot()` mean?

## What does it mean if a Python object is "subscriptable" or not?

### Question by Alistair

Which types of objects fall into the domain of "subscriptable"?

## What does "SyntaxError: Missing parentheses in call to "print"" mean in Python?

When I try to use a `print` statement in Python, it gives me this error:

``````>>> print "Hello, World!"
File "<stdin>", line 1
print "Hello, World!"
^
SyntaxError: Missing parentheses in call to "print"
``````

What does that mean?

## What does `ValueError: cannot reindex from a duplicate axis` mean?

I am getting a `ValueError: cannot reindex from a duplicate axis` when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.

Here is my session inside of `ipdb` trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create `sum` index for sum of all columns I am getting `ValueError: cannot reindex from a duplicate axis` error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?

I don"t really understand what `ValueError: cannot reindex from a duplicate axis`means, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.

``````ipdb> type(affinity_matrix)
<class "pandas.core.frame.DataFrame">
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype="int64")
ipdb> affinity_matrix.index
Index([u"001", u"002", u"003", u"004", u"005", u"008", u"009", u"010", u"011", u"014", u"015", u"016", u"018", u"020", u"021", u"022", u"024", u"025", u"026", u"027", u"028", u"029", u"030", u"032", u"033", u"034", u"035", u"036", u"039", u"040", u"041", u"042", u"043", u"044", u"045", u"047", u"047", u"048", u"050", u"053", u"054", u"055", u"056", u"057", u"058", u"059", u"060", u"061", u"062", u"063", u"065", u"067", u"068", u"069", u"070", u"071", u"072", u"073", u"074", u"075", u"076", u"077", u"078", u"080", u"082", u"083", u"084", u"085", u"086", u"089", u"090", u"091", u"092", u"093", u"094", u"095", u"096", u"097", u"098", u"100", u"101", u"103", u"104", u"105", u"106", u"107", u"108", u"109", u"110", u"111", u"112", u"113", u"114", u"115", u"116", u"117", u"118", u"119", u"121", u"122", ...], dtype="object")

ipdb> affinity_matrix.values.dtype
dtype("float64")
ipdb> "sums" in affinity_matrix.index
False
``````

Here is the error:

``````ipdb> affinity_matrix.loc["sums"] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
``````

I tried to reproduce this with a simple example, but I failed

``````In : import pandas as pd

In : import numpy as np

In : a = np.arange(35).reshape(5,7)

In : df = pd.DataFrame(a, ["x", "y", "u", "z", "w"], range(10, 17))

In : df.values.dtype
Out: dtype("int64")

In : df.loc["sums"] = df.sum(axis=0)

In : df
Out:
10  11  12  13  14  15   16
x      0   1   2   3   4   5    6
y      7   8   9  10  11  12   13
u     14  15  16  17  18  19   20
z     21  22  23  24  25  26   27
w     28  29  30  31  32  33   34
sums  70  75  80  85  90  95  100
``````

# Recommendation for beginners:

This is my personal recommendation for beginners: start by learning `virtualenv` and `pip`, tools which work with both Python 2 and 3 and in a variety of situations, and pick up other tools once you start needing them.

# PyPI packages not in the standard library:

• `virtualenv` is a very popular tool that creates isolated Python environments for Python libraries. If you"re not familiar with this tool, I highly recommend learning it, as it is a very useful tool, and I"ll be making comparisons to it for the rest of this answer.

It works by installing a bunch of files in a directory (eg: `env/`), and then modifying the `PATH` environment variable to prefix it with a custom `bin` directory (eg: `env/bin/`). An exact copy of the `python` or `python3` binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It"s not part of Python"s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using `pip`.

• `pyenv` is used to isolate Python versions. For example, you may want to test your code against Python 2.7, 3.6, 3.7 and 3.8, so you"ll need a way to switch between them. Once activated, it prefixes the `PATH` environment variable with `~/.pyenv/shims`, where there are special files matching the Python commands (`python`, `pip`). These are not copies of the Python-shipped commands; they are special scripts that decide on the fly which version of Python to run based on the `PYENV_VERSION` environment variable, or the `.python-version` file, or the `~/.pyenv/version` file. `pyenv` also makes the process of downloading and installing multiple Python versions easier, using the command `pyenv install`.

• `pyenv-virtualenv` is a plugin for `pyenv` by the same author as `pyenv`, to allow you to use `pyenv` and `virtualenv` at the same time conveniently. However, if you"re using Python 3.3 or later, `pyenv-virtualenv` will try to run `python -m venv` if it is available, instead of `virtualenv`. You can use `virtualenv` and `pyenv` together without `pyenv-virtualenv`, if you don"t want the convenience features.

• `virtualenvwrapper` is a set of extensions to `virtualenv` (see docs). It gives you commands like `mkvirtualenv`, `lssitepackages`, and especially `workon` for switching between different `virtualenv` directories. This tool is especially useful if you want multiple `virtualenv` directories.

• `pyenv-virtualenvwrapper` is a plugin for `pyenv` by the same author as `pyenv`, to conveniently integrate `virtualenvwrapper` into `pyenv`.

• `pipenv` aims to combine `Pipfile`, `pip` and `virtualenv` into one command on the command-line. The `virtualenv` directory typically gets placed in `~/.local/share/virtualenvs/XXX`, with `XXX` being a hash of the path of the project directory. This is different from `virtualenv`, where the directory is typically in the current working directory. `pipenv` is meant to be used when developing Python applications (as opposed to libraries). There are alternatives to `pipenv`, such as `poetry`, which I won"t list here since this question is only about the packages that are similarly named.

# Standard library:

• `pyvenv` (not to be confused with `pyenv` in the previous section) is a script shipped with Python 3 but deprecated in Python 3.6 as it had problems (not to mention the confusing name). In Python 3.6+, the exact equivalent is `python3 -m venv`.

• `venv` is a package shipped with Python 3, which you can run using `python3 -m venv` (although for some reason some distros separate it out into a separate distro package, such as `python3-venv` on Ubuntu/Debian). It serves the same purpose as `virtualenv`, but only has a subset of its features (see a comparison here). `virtualenv` continues to be more popular than `venv`, especially since the former supports both Python 2 and 3.

You have four main options for converting types in pandas:

1. `to_numeric()` - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also `to_datetime()` and `to_timedelta()`.)

2. `astype()` - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

3. `infer_objects()` - a utility method to convert object columns holding Python objects to a pandas type if possible.

4. `convert_dtypes()` - convert DataFrame columns to the "best possible" dtype that supports `pd.NA` (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.

# 1. `to_numeric()`

The best way to convert one or more columns of a DataFrame to numeric values is to use `pandas.to_numeric()`.

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

## Basic usage

The input to `to_numeric()` is a Series or a single column of a DataFrame.

``````>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64
``````

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

``````# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
``````

You can also use it to convert multiple columns of a DataFrame via the `apply()` method:

``````# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
``````

As long as your values can all be converted, that"s probably all you need.

## Error handling

But what if some values can"t be converted to a numeric type?

`to_numeric()` also takes an `errors` keyword argument that allows you to force non-numeric values to be `NaN`, or simply ignore columns containing these values.

Here"s an example using a Series of strings `s` which has the object dtype:

``````>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object
``````

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

``````>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string
``````

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to `NaN` as follows using the `errors` keyword argument:

``````>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64
``````

The third option for `errors` is just to ignore the operation if an invalid value is encountered:

``````>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched
``````

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

``````df.apply(pd.to_numeric, errors="ignore")
``````

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

## Downcasting

By default, conversion with `to_numeric()` will give you either a `int64` or `float64` dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like `float32`, or `int8`?

`to_numeric()` gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series `s` of integer type:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

Downcasting to "integer" uses the smallest possible integer that can hold the values:

``````>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8
``````

Downcasting to "float" similarly picks a smaller than normal floating type:

``````>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32
``````

# 2. `astype()`

The `astype()` method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

## Basic usage

Just pick a type: you can use a NumPy dtype (e.g. `np.int16`), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and `astype()` will try and convert it for you:

``````# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")
``````

Notice I said "try" - if `astype()` does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a `NaN` or `inf` value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing `errors="ignore"`. Your original object will be return untouched.

## Be careful

`astype()` is powerful, but it will sometimes convert values "incorrectly". For example:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

``````>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8
``````

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using `pd.to_numeric(s, downcast="unsigned")` instead could help prevent this error.

# 3. `infer_objects()`

Version 0.21.0 of pandas introduced the method `infer_objects()` for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

``````>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object
``````

Using `infer_objects()`, you can change the type of column "a" to int64:

``````>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object
``````

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use `df.astype(int)` instead.

# 4. `convert_dtypes()`

Version 1.0 and above includes a method `convert_dtypes()` to convert Series and DataFrame columns to the best possible dtype that supports the `pd.NA` missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to `Int64`, a column of NumPy `int32` values will become the pandas dtype `Int32`.

With our `object` DataFrame `df`, we get the following result:

``````>>> df.convert_dtypes().dtypes
a     Int64
b    string
dtype: object
``````

Since column "a" held integer values, it was converted to the `Int64` type (which is capable of holding missing values, unlike `int64`).

Column "b" contained string objects, so was changed to pandas" `string` dtype.

By default, this method will infer the type from object values in each column. We can change this by passing `infer_objects=False`:

``````>>> df.convert_dtypes(infer_objects=False).dtypes
a    object
b    string
dtype: object
``````

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran `infer_dtype`) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

## How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "`iter`" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use `DataFrame.to_string()`.

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla `for` loop)
4. `DataFrame.apply()`: i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. `DataFrame.itertuples()` and `iteritems()`
6. `DataFrame.iterrows()`

`iterrows` and `itertuples` (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". `df.iterrows()` is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

## Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

## Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

``````# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row, ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row, ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]
``````

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over `zip(df["A"], df["B"], ...)` instead of `df[["A", "B"]].to_numpy()` as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, `to_numpy()` will cast the entire array to string, which may not be what you want. Fortunately `zip`ping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

## An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns `A + B`. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above. Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer `vec` over `vec_numpy`).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

## Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls `iterrows()` while doing something inside a `for` loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

This is the behaviour to adopt when the referenced object is deleted. It is not specific to Django; this is an SQL standard. Although Django has its own implementation on top of SQL. (1)

There are seven possible actions to take when such event occurs:

• `CASCADE`: When the referenced object is deleted, also delete the objects that have references to it (when you remove a blog post for instance, you might want to delete comments as well). SQL equivalent: `CASCADE`.
• `PROTECT`: Forbid the deletion of the referenced object. To delete it you will have to delete all objects that reference it manually. SQL equivalent: `RESTRICT`.
• `RESTRICT`: (introduced in Django 3.1) Similar behavior as `PROTECT` that matches SQL"s `RESTRICT` more accurately. (See django documentation example)
• `SET_NULL`: Set the reference to NULL (requires the field to be nullable). For instance, when you delete a User, you might want to keep the comments he posted on blog posts, but say it was posted by an anonymous (or deleted) user. SQL equivalent: `SET NULL`.
• `SET_DEFAULT`: Set the default value. SQL equivalent: `SET DEFAULT`.
• `SET(...)`: Set a given value. This one is not part of the SQL standard and is entirely handled by Django.
• `DO_NOTHING`: Probably a very bad idea since this would create integrity issues in your database (referencing an object that actually doesn"t exist). SQL equivalent: `NO ACTION`. (2)

Source: Django documentation

In most cases, `CASCADE` is the expected behaviour, but for every ForeignKey, you should always ask yourself what is the expected behaviour in this situation. `PROTECT` and `SET_NULL` are often useful. Setting `CASCADE` where it should not, can potentially delete all of your database in cascade, by simply deleting a single user.

It"s funny to notice that the direction of the `CASCADE` action is not clear to many people. Actually, it"s funny to notice that only the `CASCADE` action is not clear. I understand the cascade behavior might be confusing, however you must think that it is the same direction as any other action. Thus, if you feel that `CASCADE` direction is not clear to you, it actually means that `on_delete` behavior is not clear to you.

In your database, a foreign key is basically represented by an integer field which value is the primary key of the foreign object. Let"s say you have an entry comment_A, which has a foreign key to an entry article_B. If you delete the entry comment_A, everything is fine. article_B used to live without comment_A and don"t bother if it"s deleted. However, if you delete article_B, then comment_A panics! It never lived without article_B and needs it, and it"s part of its attributes (`article=article_B`, but what is article_B???). This is where `on_delete` steps in, to determine how to resolve this integrity error, either by saying:

• "No! Please! Don"t! I can"t live without you!" (which is said `PROTECT` or `RESTRICT` in Django/SQL)
• "All right, if I"m not yours, then I"m nobody"s" (which is said `SET_NULL`)
• "Good bye world, I can"t live without article_B" and commit suicide (this is the `CASCADE` behavior).
• "It"s OK, I"ve got spare lover, and I"ll reference article_C from now" (`SET_DEFAULT`, or even `SET(...)`).
• "I can"t face reality, and I"ll keep calling your name even if that"s the only thing left to me!" (`DO_NOTHING`)

I hope it makes cascade direction clearer. :)

Footnotes

(1) Django has its own implementation on top of SQL. And, as mentioned by @JoeMjr2 in the comments below, Django will not create the SQL constraints. If you want the constraints to be ensured by your database (for instance, if your database is used by another application, or if you hang in the database console from time to time), you might want to set the related constraints manually yourself. There is an open ticket to add support for database-level on delete constrains in Django.

(2) Actually, there is one case where `DO_NOTHING` can be useful: If you want to skip Django"s implementation and implement the constraint yourself at the database-level.

## Label vs. Location

The main distinction between the two methods is:

• `loc` gets rows (and/or columns) with particular labels.

• `iloc` gets rows (and/or columns) at integer locations.

To demonstrate, consider a series `s` of characters with a non-monotonic integer index:

``````>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49    a
48    b
47    c
0     d
1     e
2     f

>>> s.loc    # value at index label 0
"d"

>>> s.iloc   # value at index location 0
"a"

>>> s.loc[0:1]  # rows at index labels between 0 and 1 (inclusive)
0    d
1    e

>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49    a
``````

Here are some of the differences/similarities between `s.loc` and `s.iloc` when passed various objects:

<object> description `s.loc[<object>]` `s.iloc[<object>]`
`0` single item Value at index label `0` (the string `"d"`) Value at index location 0 (the string `"a"`)
`0:1` slice Two rows (labels `0` and `1`) One row (first row at location 0)
`1:47` slice with out-of-bounds end Zero rows (empty Series) Five rows (location 1 onwards)
`1:47:-1` slice with negative step three rows (labels `1` back to `47`) Zero rows (empty Series)
`[2, 0]` integer list Two rows with given labels Two rows with given locations
`s > "e"` Bool series (indicating which values have the property) One row (containing `"f"`) `NotImplementedError`
`(s>"e").values` Bool array One row (containing `"f"`) Same as `loc`
`999` int object not in index `KeyError` `IndexError` (out of bounds)
`-1` int object not in index `KeyError` Returns last value in `s`
`lambda x: x.index` callable applied to series (here returning 3rd item in index) `s.loc[s.index]` `s.iloc[s.index]`

`loc`"s label-querying capabilities extend well-beyond integer indexes and it"s worth highlighting a couple of additional examples.

Here"s a Series where the index contains string objects:

``````>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a    49
b    48
c    47
d     0
e     1
f     2
``````

Since `loc` is label-based, it can fetch the first value in the Series using `s2.loc["a"]`. It can also slice with non-integer objects:

``````>>> s2.loc["c":"e"]  # all rows lying between "c" and "e" (inclusive)
c    47
d     0
e     1
``````

For DateTime indexes, we don"t need to pass the exact date/time to fetch by label. For example:

``````>>> s3 = pd.Series(list("abcde"), pd.date_range("now", periods=5, freq="M"))
>>> s3
2021-01-31 16:41:31.879768    a
2021-02-28 16:41:31.879768    b
2021-03-31 16:41:31.879768    c
2021-04-30 16:41:31.879768    d
2021-05-31 16:41:31.879768    e
``````

Then to fetch the row(s) for March/April 2021 we only need:

``````>>> s3.loc["2021-03":"2021-04"]
2021-03-31 17:04:30.742316    c
2021-04-30 17:04:30.742316    d
``````

## Rows and Columns

`loc` and `iloc` work the same way with DataFrames as they do with Series. It"s useful to note that both methods can address columns and rows together.

When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.

Consider the DataFrame defined below:

``````>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list("abcde"),
columns=["x","y","z", 8, 9])
>>> df
x   y   z   8   9
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24
``````

Then for example:

``````>>> df.loc["c": , :"z"]  # rows "c" and onwards AND columns up to "z"
x   y   z
c  10  11  12
d  15  16  17
e  20  21  22

>>> df.iloc[:, 3]        # all rows, but only the column at index location 3
a     3
b     8
c    13
d    18
e    23
``````

Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of `loc` and `iloc`.

For example, consider the following DataFrame. How best to slice the rows up to and including "c" and take the first four columns?

``````>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list("abcde"),
columns=["x","y","z", 8, 9])
>>> df
x   y   z   8   9
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24
``````

We can achieve this result using `iloc` and the help of another method:

``````>>> df.iloc[:df.index.get_loc("c") + 1, :4]
x   y   z   8
a   0   1   2   3
b   5   6   7   8
c  10  11  12  13
``````

`get_loc()` is an index method meaning "get the position of the label in this index". Note that since slicing with `iloc` is exclusive of its endpoint, we must add 1 to this value if we want row "c" as well.

The simplest way to get row counts per group is by calling `.size()`, which returns a `Series`:

``````df.groupby(["col1","col2"]).size()
``````

Usually you want this result as a `DataFrame` (instead of a `Series`) so you can do:

``````df.groupby(["col1", "col2"]).size().reset_index(name="counts")
``````

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

## Detailed example:

Consider the following example dataframe:

``````In : df
Out:
col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17
``````

First let"s use `.size()` to get the row counts:

``````In : df.groupby(["col1", "col2"]).size()
Out:
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64
``````

Then let"s use `.size().reset_index(name="counts")` to get the row counts:

``````In : df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out:
col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1
``````

### Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

``````In : (df
...: .groupby(["col1", "col2"])
...: .agg({
...:     "col3": ["mean", "count"],
...:     "col4": ["median", "min", "count"]
...: }))
Out:
col4                  col3
median   min count      mean count
col1 col2
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1
``````

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`. It looks like this:

``````In : gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...:  .reset_index()
...: )
...:
Out:
col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63
``````

### Footnotes

The code used to generate the test data is shown below:

``````In : import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["E", "F"],
...:         ["E", "F"],
...:         ["G", "H"]
...:         ])
...:
...: df = pd.DataFrame(
...:     np.hstack([keys,np.random.randn(10,4).round(2)]),
...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...:     df[["col3", "col4", "col5", "col6"]].astype(float)
...:
``````

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN` entries in the mean calculation without telling you about it.

The idiomatic way to do this with Pandas is to use the `.sample` method of your dataframe to sample all rows without replacement:

``````df.sample(frac=1)
``````

The `frac` keyword argument specifies the fraction of rows to return in the random sample, so `frac=1` means return all rows (in random order).

Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

``````df = df.sample(frac=1).reset_index(drop=True)
``````

Here, specifying `drop=True` prevents `.reset_index` from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean `id(df_old)` is not the same as `id(df_new)`), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

``````\$ python3 -m memory_profiler .	est.py
Filename: .	est.py

Line #    Mem usage    Increment   Line Contents
================================================
5     68.5 MiB     68.5 MiB   @profile
6                             def shuffle():
7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

``````

## Placing the legend (`bbox_to_anchor`)

A legend is positioned inside the bounding box of the axes using the `loc` argument to `plt.legend`.
E.g. `loc="upper right"` places the legend in the upper right corner of the bounding box, which by default extents from `(0,0)` to `(1,1)` in axes coordinates (or in bounding box notation `(x0,y0, width, height)=(0,0,1,1)`).

To place the legend outside of the axes bounding box, one may specify a tuple `(x0,y0)` of axes coordinates of the lower left corner of the legend.

``````plt.legend(loc=(1.04,0))
``````

A more versatile approach is to manually specify the bounding box into which the legend should be placed, using the `bbox_to_anchor` argument. One can restrict oneself to supply only the `(x0, y0)` part of the bbox. This creates a zero span box, out of which the legend will expand in the direction given by the `loc` argument. E.g.

`plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")`

places the legend outside the axes, such that the upper left corner of the legend is at position `(1.04,1)` in axes coordinates.

Further examples are given below, where additionally the interplay between different arguments like `mode` and `ncols` are shown. ``````l1 = plt.legend(bbox_to_anchor=(1.04,1), borderaxespad=0)
l2 = plt.legend(bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
l3 = plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
l4 = plt.legend(bbox_to_anchor=(0,1.02,1,0.2), loc="lower left",
l5 = plt.legend(bbox_to_anchor=(1,0), loc="lower right",
bbox_transform=fig.transFigure, ncol=3)
l6 = plt.legend(bbox_to_anchor=(0.4,0.8), loc="upper right")
``````

Details about how to interpret the 4-tuple argument to `bbox_to_anchor`, as in `l4`, can be found in this question. The `mode="expand"` expands the legend horizontally inside the bounding box given by the 4-tuple. For a vertically expanded legend, see this question.

Sometimes it may be useful to specify the bounding box in figure coordinates instead of axes coordinates. This is shown in the example `l5` from above, where the `bbox_transform` argument is used to put the legend in the lower left corner of the figure.

### Postprocessing

Having placed the legend outside the axes often leads to the undesired situation that it is completely or partially outside the figure canvas.

Solutions to this problem are:

One can adjust the subplot parameters such, that the axes take less space inside the figure (and thereby leave more space to the legend) by using `plt.subplots_adjust`. E.g.

``````  plt.subplots_adjust(right=0.7)
``````

leaves 30% space on the right-hand side of the figure, where one could place the legend.

• Tight layout
Using `plt.tight_layout` Allows to automatically adjust the subplot parameters such that the elements in the figure sit tight against the figure edges. Unfortunately, the legend is not taken into account in this automatism, but we can supply a rectangle box that the whole subplots area (including labels) will fit into.

``````  plt.tight_layout(rect=[0,0,0.75,1])
``````
• Saving the figure with `bbox_inches = "tight"`
The argument `bbox_inches = "tight"` to `plt.savefig` can be used to save the figure such that all artist on the canvas (including the legend) are fit into the saved area. If needed, the figure size is automatically adjusted.

``````  plt.savefig("output.png", bbox_inches="tight")
``````
• automatically adjusting the subplot params
A way to automatically adjust the subplot position such that the legend fits inside the canvas without changing the figure size can be found in this answer: Creating figure with exact size and no padding (and legend outside the axes)

Comparison between the cases discussed above: ## Alternatives

A figure legend

One may use a legend to the figure instead of the axes, `matplotlib.figure.Figure.legend`. This has become especially useful for matplotlib version >=2.1, where no special arguments are needed

``````fig.legend(loc=7)
``````

to create a legend for all artists in the different axes of the figure. The legend is placed using the `loc` argument, similar to how it is placed inside an axes, but in reference to the whole figure - hence it will be outside the axes somewhat automatically. What remains is to adjust the subplots such that there is no overlap between the legend and the axes. Here the point "Adjust the subplot parameters" from above will be helpful. An example:

``````import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi)
colors=["#7aa0c4","#ca82e1" ,"#8bcd50","#e18882"]
fig, axes = plt.subplots(ncols=2)
for i in range(4):
axes[i//2].plot(x,np.sin(x+i), color=colors[i],label="y=sin(x+{})".format(i))

fig.legend(loc=7)
fig.tight_layout()
plt.show()
`````` Legend inside dedicated subplot axes

An alternative to using `bbox_to_anchor` would be to place the legend in its dedicated subplot axes (`lax`). Since the legend subplot should be smaller than the plot, we may use `gridspec_kw={"width_ratios":[4,1]}` at axes creation. We can hide the axes `lax.axis("off")` but still put a legend in. The legend handles and labels need to obtained from the real plot via `h,l = ax.get_legend_handles_labels()`, and can then be supplied to the legend in the `lax` subplot, `lax.legend(h,l)`. A complete example is below.

``````import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = 6,2

fig, (ax,lax) = plt.subplots(ncols=2, gridspec_kw={"width_ratios":[4,1]})
ax.plot(x,y, label="y=sin(x)")
....

h,l = ax.get_legend_handles_labels()
lax.axis("off")

plt.tight_layout()
plt.show()
``````

This produces a plot, which is visually pretty similar to the plot from above: We could also use the first axes to place the legend, but use the `bbox_transform` of the legend axes,

``````ax.legend(bbox_to_anchor=(0,0,1,1), bbox_transform=lax.transAxes)
lax.axis("off")
``````

In this approach, we do not need to obtain the legend handles externally, but we need to specify the `bbox_to_anchor` argument.

• Consider the matplotlib legend guide with some examples of other stuff you want to do with legends.
• Some example code for placing legends for pie charts may directly be found in answer to this question: Python - Legend overlaps with the pie chart
• The `loc` argument can take numbers instead of strings, which make calls shorter, however, they are not very intuitively mapped to each other. Here is the mapping for reference: The fundamental misunderstanding here is in thinking that `range` is a generator. It"s not. In fact, it"s not any kind of iterator.

You can tell this pretty easily:

``````>>> a = range(5)
>>> print(list(a))
[0, 1, 2, 3, 4]
>>> print(list(a))
[0, 1, 2, 3, 4]
``````

If it were a generator, iterating it once would exhaust it:

``````>>> b = my_crappy_range(5)
>>> print(list(b))
[0, 1, 2, 3, 4]
>>> print(list(b))
[]
``````

What `range` actually is, is a sequence, just like a list. You can even test this:

``````>>> import collections.abc
>>> isinstance(a, collections.abc.Sequence)
True
``````

This means it has to follow all the rules of being a sequence:

``````>>> a         # indexable
3
>>> len(a)       # sized
5
>>> 3 in a       # membership
True
>>> reversed(a)  # reversible
<range_iterator at 0x101cd2360>
>>> a.index(3)   # implements "index"
3
>>> a.count(3)   # implements "count"
1
``````

The difference between a `range` and a `list` is that a `range` is a lazy or dynamic sequence; it doesn"t remember all of its values, it just remembers its `start`, `stop`, and `step`, and creates the values on demand on `__getitem__`.

(As a side note, if you `print(iter(a))`, you"ll notice that `range` uses the same `listiterator` type as `list`. How does that work? A `listiterator` doesn"t use anything special about `list` except for the fact that it provides a C implementation of `__getitem__`, so it works fine for `range` too.)

Now, there"s nothing that says that `Sequence.__contains__` has to be constant time‚Äîin fact, for obvious examples of sequences like `list`, it isn"t. But there"s nothing that says it can"t be. And it"s easier to implement `range.__contains__` to just check it mathematically (`(val - start) % step`, but with some extra complexity to deal with negative steps) than to actually generate and test all the values, so why shouldn"t it do it the better way?

But there doesn"t seem to be anything in the language that guarantees this will happen. As Ashwini Chaudhari points out, if you give it a non-integral value, instead of converting to integer and doing the mathematical test, it will fall back to iterating all the values and comparing them one by one. And just because CPython 3.2+ and PyPy 3.x versions happen to contain this optimization, and it"s an obvious good idea and easy to do, there"s no reason that IronPython or NewKickAssPython 3.x couldn"t leave it out. (And in fact, CPython 3.0-3.1 didn"t include it.)

If `range` actually were a generator, like `my_crappy_range`, then it wouldn"t make sense to test `__contains__` this way, or at least the way it makes sense wouldn"t be obvious. If you"d already iterated the first 3 values, is `1` still `in` the generator? Should testing for `1` cause it to iterate and consume all the values up to `1` (or up to the first value `>= 1`)?

# Using a for loop, how do I access the loop index, from 1 to 5 in this case?

Use `enumerate` to get the index with the element as you iterate:

``````for index, item in enumerate(items):
print(index, item)
``````

And note that Python"s indexes start at zero, so you would get 0 to 4 with the above. If you want the count, 1 to 5, do this:

``````count = 0 # in case items is empty and you need it after the loop
for count, item in enumerate(items, start=1):
print(count, item)
``````

# Unidiomatic control flow

What you are asking for is the Pythonic equivalent of the following, which is the algorithm most programmers of lower-level languages would use:

``````index = 0            # Python"s indexing starts at zero
for item in items:   # Python"s for loops are a "for each" loop
print(index, item)
index += 1
``````

Or in languages that do not have a for-each loop:

``````index = 0
while index < len(items):
print(index, items[index])
index += 1
``````

or sometimes more commonly (but unidiomatically) found in Python:

``````for index in range(len(items)):
print(index, items[index])
``````

# Use the Enumerate Function

Python"s `enumerate` function reduces the visual clutter by hiding the accounting for the indexes, and encapsulating the iterable into another iterable (an `enumerate` object) that yields a two-item tuple of the index and the item that the original iterable would provide. That looks like this:

``````for index, item in enumerate(items, start=0):   # default is zero
print(index, item)
``````

This code sample is fairly well the canonical example of the difference between code that is idiomatic of Python and code that is not. Idiomatic code is sophisticated (but not complicated) Python, written in the way that it was intended to be used. Idiomatic code is expected by the designers of the language, which means that usually this code is not just more readable, but also more efficient.

## Getting a count

Even if you don"t need indexes as you go, but you need a count of the iterations (sometimes desirable) you can start with `1` and the final number will be your count.

``````count = 0 # in case items is empty
for count, item in enumerate(items, start=1):   # default is zero
print(item)

print("there were {0} items printed".format(count))
``````

The count seems to be more what you intend to ask for (as opposed to index) when you said you wanted from 1 to 5.

## Breaking it down - a step by step explanation

To break these examples down, say we have a list of items that we want to iterate over with an index:

``````items = ["a", "b", "c", "d", "e"]
``````

Now we pass this iterable to enumerate, creating an enumerate object:

``````enumerate_object = enumerate(items) # the enumerate object
``````

We can pull the first item out of this iterable that we would get in a loop with the `next` function:

``````iteration = next(enumerate_object) # first iteration from enumerate
print(iteration)
``````

And we see we get a tuple of `0`, the first index, and `"a"`, the first item:

``````(0, "a")
``````

we can use what is referred to as "sequence unpacking" to extract the elements from this two-tuple:

``````index, item = iteration
#   0,  "a" = (0, "a") # essentially this.
``````

and when we inspect `index`, we find it refers to the first index, 0, and `item` refers to the first item, `"a"`.

``````>>> print(index)
0
>>> print(item)
a
``````

# Conclusion

• Python indexes start at zero
• To get these indexes from an iterable as you iterate over it, use the enumerate function
• Using enumerate in the idiomatic way (along with tuple unpacking) creates code that is more readable and maintainable:

So do this:

``````for index, item in enumerate(items, start=0):   # Python indexes start at zero
print(index, item)
``````

## How to remove convexity defects in a Sudoku square?

I was doing a fun project: Solving a Sudoku from an input image using OpenCV (as in Google goggles etc). And I have completed the task, but at the end I found a little problem for which I came here.

I did the programming using Python API of OpenCV 2.3.1.

Below is what I did :

2. Find the contours
3. Select the one with maximum area, ( and also somewhat equivalent to square).
4. Find the corner points.

e.g. given below: (Notice here that the green line correctly coincides with the true boundary of the Sudoku, so the Sudoku can be correctly warped. Check next image)

5. warp the image to a perfect square

eg image: 6. Perform OCR ( for which I used the method I have given in Simple Digit Recognition OCR in OpenCV-Python )

And the method worked well.

Problem:

Check out this image.

Performing the step 4 on this image gives the result below: The red line drawn is the original contour which is the true outline of sudoku boundary.

The green line drawn is approximated contour which will be the outline of warped image.

Which of course, there is difference between green line and red line at the top edge of sudoku. So while warping, I am not getting the original boundary of the Sudoku.

My Question :

How can I warp the image on the correct boundary of the Sudoku, i.e. the red line OR how can I remove the difference between red line and green line? Is there any method for this in OpenCV?

## Is there a library function for Root mean square error (RMSE) in python?

I know I could implement a root mean squared error function like this:

``````def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())
``````

What I"m looking for if this rmse function is implemented in a library somewhere, perhaps in scipy or scikit-learn?

## What"s the difference between lists enclosed by square brackets and parentheses in Python?

``````>>> x=[1,2]
>>> x
2
>>> x=(1,2)
>>> x
2
``````

Are they both valid? Is one preferred for some reason?

## How do I calculate square root in Python?

Why does Python give the "wrong" answer?

``````x = 16

sqrt = x**(.5)  #returns 4
sqrt = x**(1/2) #returns 1
``````

Yes, I know `import math` and use `sqrt`. But I"m looking for an answer to the above.

## What do square brackets mean in pip install?

I see more and more commands like this:

``````\$ pip install "splinter[django]"
``````

What do these square brackets do?

## How do I calculate r-squared using Python and Numpy?

I"m using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel"s best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here"s my function:

``````import numpy

# Polynomial Regression
def polyfit(x, y, degree):
results = {}

coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results["polynomial"] = coeffs.tolist()

correlation = numpy.corrcoef(x, y)[0,1]

# r
results["correlation"] = correlation
# r-squared
results["determination"] = correlation**2

return results
``````

## What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

I"ve noticed three methods of selecting a column in a Pandas DataFrame:

First method of selecting a column using loc:

``````df_new = df.loc[:, "col1"]
``````

Second method - seems simpler and faster:

``````df_new = df["col1"]
``````

Third method - most convenient:

``````df_new = df.col1
``````

Is there a difference between these three methods? I don"t think so, in which case I"d rather use the third method.

I"m mostly curious as to why there appear to be three methods for doing the same thing.

I think you"re almost there, try removing the extra square brackets around the `lst`"s (Also you don"t need to specify the column names when you"re creating a dataframe from a dict like this):

``````import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{"lst1Title": lst1,
"lst2Title": lst2,
"lst3Title": lst3
})

percentile_list
lst1Title  lst2Title  lst3Title
0          0         0         0
1          1         1         1
2          2         2         2
3          3         3         3
4          4         4         4
5          5         5         5
6          6         6         6
...
``````

If you need a more performant solution you can use `np.column_stack` rather than `zip` as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:

``````import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=["lst1Title", "lst2Title", "lst3Title"])
``````

There is a clean, one-line way of doing this in Pandas:

``````df["col_3"] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
``````

This allows `f` to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.

Example with data (based on original question):

``````import pandas as pd

df = pd.DataFrame({"ID":["1", "2", "3"], "col_1": [0, 2, 3], "col_2":[1, 4, 5]})
mylist = ["a", "b", "c", "d", "e", "f"]

def get_sublist(sta,end):
return mylist[sta:end+1]

df["col_3"] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
``````

Output of `print(df)`:

``````  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]
``````

If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:

``````df["col_3"] = df.apply(lambda x: f(x["col 1"], x["col 2"]), axis=1)
``````

# Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo"s answer, that uses the full list of the current `scipy.stats` distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

## Example Fitting

Using the El Ni√±o dataset from `statsmodels`, the distributions are fit and error is determined. The distribution with the least error is returned.

### All Distributions ### Best Fit Distribution ### Example Code

``````%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
matplotlib.style.use("ggplot")

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0

# Best holders
best_distributions = []

# Estimate distribution parameters from data
for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):

print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

distribution = getattr(st, distribution)

# Try to fit the distribution
try:
# Ignore warnings from data that can"t be fit
with warnings.catch_warnings():
warnings.filterwarnings("ignore")

# fit dist to data
params = distribution.fit(data)

# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))

# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass

# identify if this distribution is better
best_distributions.append((distribution, params, sse))

except Exception:
pass

return sorted(best_distributions, key=lambda x:x)

def make_pdf(dist, params, size=10000):
"""Generate distributions"s Probability Distribution Function """

# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

# Get sane start and end points of distribution
start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)

return pdf

# Load data from statsmodels datasets

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])["color"])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u"El Ni√±o sea temp.
All Fitted Distributions")
ax.set_xlabel(u"Temp (¬∞C)")
ax.set_ylabel("Frequency")

# Make PDF with best params
pdf = make_pdf(best_dist, best_dist)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)

param_names = (best_dist.shapes + ", loc, scale").split(", ") if best_dist.shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist)])
dist_str = "{}({})".format(best_dist.name, param_str)

ax.set_title(u"El Ni√±o sea temp. with best fit distribution
" + dist_str)
ax.set_xlabel(u"Temp. (¬∞C)")
ax.set_ylabel("Frequency")
``````

TL;DR

``````def square_list(n):
the_list = []                         # Replace
for x in range(n):
y = x * x
the_list.append(y)                # these
return the_list                       # lines
``````

# do this:

``````def square_yield(n):
for x in range(n):
y = x * x
yield y                           # with this one.
``````

Whenever you find yourself building a list from scratch, `yield` each piece instead.

This was my first "aha" moment with yield.

`yield` is a sugary way to say

build a series of stuff

Same behavior:

``````>>> for square in square_list(4):
...     print(square)
...
0
1
4
9
>>> for square in square_yield(4):
...     print(square)
...
0
1
4
9
``````

Different behavior:

Yield is single-pass: you can only iterate through once. When a function has a yield in it we call it a generator function. And an iterator is what it returns. Those terms are revealing. We lose the convenience of a container, but gain the power of a series that"s computed as needed, and arbitrarily long.

Yield is lazy, it puts off computation. A function with a yield in it doesn"t actually execute at all when you call it. It returns an iterator object that remembers where it left off. Each time you call `next()` on the iterator (this happens in a for-loop) execution inches forward to the next yield. `return` raises StopIteration and ends the series (this is the natural end of a for-loop).

Yield is versatile. Data doesn"t have to be stored all together, it can be made available one at a time. It can be infinite.

``````>>> def squares_all_of_them():
...     x = 0
...     while True:
...         yield x * x
...         x += 1
...
>>> squares = squares_all_of_them()
>>> for _ in range(4):
...     print(next(squares))
...
0
1
4
9
``````

If you need multiple passes and the series isn"t too long, just call `list()` on it:

``````>>> list(square_yield(4))
[0, 1, 4, 9]
``````

Brilliant choice of the word `yield` because both meanings apply:

yield — produce or provide (as in agriculture)

...provide the next data in the series.

yield — give way or relinquish (as in political power)

...relinquish CPU execution until the iterator advances.

This is kind of overkill but let"s give it a go. First lets use statsmodel to find out what the p-values should be

``````import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

X = diabetes.data
y = diabetes.target

est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
``````

and we get

``````                         OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================
``````

Ok, let"s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

``````lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don"t want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX)))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
``````

And this gives us.

``````    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306
``````

So we can reproduce the values from statsmodel.

The problem is the use of `aspect="equal"`, which prevents the subplots from stretching to an arbitrary aspect ratio and filling up all the empty space.

Normally, this would work:

``````import matplotlib.pyplot as plt

ax = [plt.subplot(2,2,i+1) for i in range(4)]

for a in ax:
a.set_xticklabels([])
a.set_yticklabels([])

``````

The result is this: However, with `aspect="equal"`, as in the following code:

``````import matplotlib.pyplot as plt

ax = [plt.subplot(2,2,i+1) for i in range(4)]

for a in ax:
a.set_xticklabels([])
a.set_yticklabels([])
a.set_aspect("equal")

``````

This is what we get: The difference in this second case is that you"ve forced the x- and y-axes to have the same number of units/pixel. Since the axes go from 0 to 1 by default (i.e., before you plot anything), using `aspect="equal"` forces each axis to be a square. Since the figure is not a square, pyplot adds in extra spacing between the axes horizontally.

To get around this problem, you can set your figure to have the correct aspect ratio. We"re going to use the object-oriented pyplot interface here, which I consider to be superior in general:

``````import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,8)) # Notice the equal aspect ratio
ax = [fig.add_subplot(2,2,i+1) for i in range(4)]

for a in ax:
a.set_xticklabels([])
a.set_yticklabels([])
a.set_aspect("equal")

``````

Here"s the result: How about using `numpy.vectorize`.

``````import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1,  4,  9, 16, 25])
``````

Use imap instead of map, which returns an iterator of processed values.

``````from multiprocessing import Pool
import tqdm
import time

def _foo(my_number):
square = my_number * my_number
time.sleep(1)
return square

if __name__ == "__main__":
with Pool(2) as p:
r = list(tqdm.tqdm(p.imap(_foo, range(30)), total=30))
``````

I"d like to shed a little bit more light on the interplay of `iter`, `__iter__` and `__getitem__` and what happens behind the curtains. Armed with that knowledge, you will be able to understand why the best you can do is

``````try:
iter(maybe_iterable)
print("iteration will probably work")
except TypeError:
print("not iterable")
``````

I will list the facts first and then follow up with a quick reminder of what happens when you employ a `for` loop in python, followed by a discussion to illustrate the facts.

# Facts

1. You can get an iterator from any object `o` by calling `iter(o)` if at least one of the following conditions holds true:

a) `o` has an `__iter__` method which returns an iterator object. An iterator is any object with an `__iter__` and a `__next__` (Python 2: `next`) method.

b) `o` has a `__getitem__` method.

2. Checking for an instance of `Iterable` or `Sequence`, or checking for the attribute `__iter__` is not enough.

3. If an object `o` implements only `__getitem__`, but not `__iter__`, `iter(o)` will construct an iterator that tries to fetch items from `o` by integer index, starting at index 0. The iterator will catch any `IndexError` (but no other errors) that is raised and then raises `StopIteration` itself.

4. In the most general sense, there"s no way to check whether the iterator returned by `iter` is sane other than to try it out.

5. If an object `o` implements `__iter__`, the `iter` function will make sure that the object returned by `__iter__` is an iterator. There is no sanity check if an object only implements `__getitem__`.

6. `__iter__` wins. If an object `o` implements both `__iter__` and `__getitem__`, `iter(o)` will call `__iter__`.

7. If you want to make your own objects iterable, always implement the `__iter__` method.

# `for` loops

In order to follow along, you need an understanding of what happens when you employ a `for` loop in Python. Feel free to skip right to the next section if you already know.

When you use `for item in o` for some iterable object `o`, Python calls `iter(o)` and expects an iterator object as the return value. An iterator is any object which implements a `__next__` (or `next` in Python 2) method and an `__iter__` method.

By convention, the `__iter__` method of an iterator should return the object itself (i.e. `return self`). Python then calls `next` on the iterator until `StopIteration` is raised. All of this happens implicitly, but the following demonstration makes it visible:

``````import random

class DemoIterable(object):
def __iter__(self):
print("__iter__ called")
return DemoIterator()

class DemoIterator(object):
def __iter__(self):
return self

def __next__(self):
print("__next__ called")
r = random.randint(1, 10)
if r == 5:
print("raising StopIteration")
raise StopIteration
return r
``````

Iteration over a `DemoIterable`:

``````>>> di = DemoIterable()
>>> for x in di:
...     print(x)
...
__iter__ called
__next__ called
9
__next__ called
8
__next__ called
10
__next__ called
3
__next__ called
10
__next__ called
raising StopIteration
``````

# Discussion and illustrations

On point 1 and 2: getting an iterator and unreliable checks

Consider the following class:

``````class BasicIterable(object):
def __getitem__(self, item):
if item == 3:
raise IndexError
return item
``````

Calling `iter` with an instance of `BasicIterable` will return an iterator without any problems because `BasicIterable` implements `__getitem__`.

``````>>> b = BasicIterable()
>>> iter(b)
<iterator object at 0x7f1ab216e320>
``````

However, it is important to note that `b` does not have the `__iter__` attribute and is not considered an instance of `Iterable` or `Sequence`:

``````>>> from collections import Iterable, Sequence
>>> hasattr(b, "__iter__")
False
>>> isinstance(b, Iterable)
False
>>> isinstance(b, Sequence)
False
``````

This is why Fluent Python by Luciano Ramalho recommends calling `iter` and handling the potential `TypeError` as the most accurate way to check whether an object is iterable. Quoting directly from the book:

As of Python 3.4, the most accurate way to check whether an object `x` is iterable is to call `iter(x)` and handle a `TypeError` exception if it isn‚Äôt. This is more accurate than using `isinstance(x, abc.Iterable)` , because `iter(x)` also considers the legacy `__getitem__` method, while the `Iterable` ABC does not.

On point 3: Iterating over objects which only provide `__getitem__`, but not `__iter__`

Iterating over an instance of `BasicIterable` works as expected: Python constructs an iterator that tries to fetch items by index, starting at zero, until an `IndexError` is raised. The demo object"s `__getitem__` method simply returns the `item` which was supplied as the argument to `__getitem__(self, item)` by the iterator returned by `iter`.

``````>>> b = BasicIterable()
>>> it = iter(b)
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
``````

Note that the iterator raises `StopIteration` when it cannot return the next item and that the `IndexError` which is raised for `item == 3` is handled internally. This is why looping over a `BasicIterable` with a `for` loop works as expected:

``````>>> for x in b:
...     print(x)
...
0
1
2
``````

Here"s another example in order to drive home the concept of how the iterator returned by `iter` tries to access items by index. `WrappedDict` does not inherit from `dict`, which means instances won"t have an `__iter__` method.

``````class WrappedDict(object): # note: no inheritance from dict!
def __init__(self, dic):
self._dict = dic

def __getitem__(self, item):
try:
return self._dict[item] # delegate to dict.__getitem__
except KeyError:
raise IndexError
``````

Note that calls to `__getitem__` are delegated to `dict.__getitem__` for which the square bracket notation is simply a shorthand.

``````>>> w = WrappedDict({-1: "not printed",
...                   0: "hi", 1: "StackOverflow", 2: "!",
...                   4: "not printed",
...                   "x": "not printed"})
>>> for x in w:
...     print(x)
...
hi
StackOverflow
!
``````

On point 4 and 5: `iter` checks for an iterator when it calls `__iter__`:

When `iter(o)` is called for an object `o`, `iter` will make sure that the return value of `__iter__`, if the method is present, is an iterator. This means that the returned object must implement `__next__` (or `next` in Python 2) and `__iter__`. `iter` cannot perform any sanity checks for objects which only provide `__getitem__`, because it has no way to check whether the items of the object are accessible by integer index.

``````class FailIterIterable(object):
def __iter__(self):
return object() # not an iterator

class FailGetitemIterable(object):
def __getitem__(self, item):
raise Exception
``````

Note that constructing an iterator from `FailIterIterable` instances fails immediately, while constructing an iterator from `FailGetItemIterable` succeeds, but will throw an Exception on the first call to `__next__`.

``````>>> fii = FailIterIterable()
>>> iter(fii)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: iter() returned non-iterator of type "object"
>>>
>>> fgi = FailGetitemIterable()
>>> it = iter(fgi)
>>> next(it)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/iterdemo.py", line 42, in __getitem__
raise Exception
Exception
``````

On point 6: `__iter__` wins

This one is straightforward. If an object implements `__iter__` and `__getitem__`, `iter` will call `__iter__`. Consider the following class

``````class IterWinsDemo(object):
def __iter__(self):
return iter(["__iter__", "wins"])

def __getitem__(self, item):
return ["__getitem__", "wins"][item]
``````

and the output when looping over an instance:

``````>>> iwd = IterWinsDemo()
>>> for x in iwd:
...     print(x)
...
__iter__
wins
``````

On point 7: your iterable classes should implement `__iter__`

You might ask yourself why most builtin sequences like `list` implement an `__iter__` method when `__getitem__` would be sufficient.

``````class WrappedList(object): # note: no inheritance from list!
def __init__(self, lst):
self._list = lst

def __getitem__(self, item):
return self._list[item]
``````

After all, iteration over instances of the class above, which delegates calls to `__getitem__` to `list.__getitem__` (using the square bracket notation), will work fine:

``````>>> wl = WrappedList(["A", "B", "C"])
>>> for x in wl:
...     print(x)
...
A
B
C
``````

The reasons your custom iterables should implement `__iter__` are as follows:

1. If you implement `__iter__`, instances will be considered iterables, and `isinstance(o, collections.abc.Iterable)` will return `True`.
2. If the object returned by `__iter__` is not an iterator, `iter` will fail immediately and raise a `TypeError`.
3. The special handling of `__getitem__` exists for backwards compatibility reasons. Quoting again from Fluent Python:

That is why any Python sequence is iterable: they all implement `__getitem__` . In fact, the standard sequences also implement `__iter__`, and yours should too, because the special handling of `__getitem__` exists for backward compatibility reasons and may be gone in the future (although it is not deprecated as I write this).

There are lots of things I have seen make a model diverge.

1. Too high of a learning rate. You can often tell if this is the case if the loss begins to increase and then diverges to infinity.

2. I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function. This involves taking the log of the prediction which diverges as the prediction approaches zero. That is why people usually add a small epsilon value to the prediction to prevent this divergence. I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it. Probably not the issue.

3. Other numerical stability issues can exist such as division by zero where adding the epsilon can help. Another less obvious one if the square root who"s derivative can diverge if not properly simplified when dealing with finite precision numbers. Yet again I doubt this is the issue in the case of the DNNClassifier.

4. You may have an issue with the input data. Try calling `assert not np.any(np.isnan(x))` on the input data to make sure you are not introducing the nan. Also make sure all of the target values are valid. Finally, make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

5. The labels must be in the domain of the loss function, so if using a logarithmic-based loss function all labels must be non-negative (as noted by evan pu and the comments below).