Python Methods and Functions | reset_index | reset_index

** **

Pandas ** reset_index() ** — it is a method to reset the index of a data frame. The .reset_index () method sets a list of integers ranging from 0 to the length of the data as the index.

Syntax:

DataFrame..reset_index (level = None, drop = False, inplace = False, col_level = 0, col_fill = ”)

Parameters:

level:int, string or a list to select and remove passed column from index.

drop:Boolean value, Adds the replaced index column to the data if False.

inplace:Boolean value, make changes in the original data frame itself if True.

col_level:Select in which column level to insert the labels.

col_fill:Object, to determine how the other levels are named.

Return type:DataFrame

To load the CSV file you are using, press here.

** Example # 1: ** Resetting the index

In this example, to reset the index, the Name column was first set as the column index, and then a new index was created using the reset index.

` ` |

** Output: **

As shown in the output images, a new index mark was created called level_0.

** Before reset — **

** After reset — **

** Example # 2: ** Working with a multilevel index

In this example, 2 columns (Name and Gender) are added to the index column and then one level is removed using the .reset_index () method.

` ` |

** Output: As shown in the output image, the floor column in the index column was replaced because its level was 2. **

** Before reset — **

** After reset — **

I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: `[1,5,6,10,11]`

and I would like to reset it to `[0,1,2,3,4]`

. How can I do it?

The following seems to work:

```
df = df.reset_index()
del df["index"]
```

The following does not work:

```
df = df.reindex()
```

The simplest way to get row counts per group is by calling `.size()`

, which returns a `Series`

:

```
df.groupby(["col1","col2"]).size()
```

Usually you want this result as a `DataFrame`

(instead of a `Series`

) so you can do:

```
df.groupby(["col1", "col2"]).size().reset_index(name="counts")
```

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Consider the following example dataframe:

```
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
```

First let"s use `.size()`

to get the row counts:

```
In [3]: df.groupby(["col1", "col2"]).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
```

Then let"s use `.size().reset_index(name="counts")`

to get the row counts:

```
In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
```

When you want to calculate statistics on grouped data, it usually looks like this:

```
In [5]: (df
...: .groupby(["col1", "col2"])
...: .agg({
...: "col3": ["mean", "count"],
...: "col4": ["median", "min", "count"]
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
```

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`

. It looks like this:

```
In [6]: gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...: .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...: .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...: .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
```

The code used to generate the test data is shown below:

```
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["C", "D"],
...: ["C", "D"],
...: ["C", "D"],
...: ["E", "F"],
...: ["E", "F"],
...: ["G", "H"]
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...: df[["col3", "col4", "col5", "col6"]].astype(float)
...:
```

**Disclaimer:**

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN`

entries in the mean calculation without telling you about it.

The idiomatic way to do this with Pandas is to use the `.sample`

method of your dataframe to sample all rows without replacement:

```
df.sample(frac=1)
```

The `frac`

keyword argument specifies the fraction of rows to return in the random sample, so `frac=1`

means return all rows (in random order).

**Note:**
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

```
df = df.sample(frac=1).reset_index(drop=True)
```

Here, specifying `drop=True`

prevents `.reset_index`

from creating a column containing the old index entries.

**Follow-up note:** Although it may not look like the above operation is *in-place*, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the *reference* object has changed (by which I mean `id(df_old)`

is not the same as `id(df_new)`

), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

```
$ python3 -m memory_profiler . est.py
Filename: . est.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
```

I would suggest using the duplicated method on the Pandas Index itself:

```
df3 = df3[~df3.index.duplicated(keep="first")]
```

While all the other methods work, `.drop_duplicates`

is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

```
>>> %timeit df3.reset_index().drop_duplicates(subset="index", keep="first").set_index("index")
1000 loops, best of 3: 1.54 ms per loop
>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 ¬µs per loop
>>> %timeit df3[~df3.index.duplicated(keep="first")]
1000 loops, best of 3: 307 ¬µs per loop
```

Note that you can keep the last element by changing the keep argument to `"last"`

.

It should also be noted that this method works with `MultiIndex`

as well (using df1 as specified in Paul"s example):

```
>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 ¬µs per loop
>>> %timeit df1[~df1.index.duplicated(keep="last")]
1000 loops, best of 3: 365 ¬µs per loop
```

`df.to_numpy()`

is better than `df.values`

, here"s why.It"s time to deprecate your usage of `values`

and `as_matrix()`

.

pandas `v0.24.0`

introduced two new methods for obtaining NumPy arrays from pandas objects:

, which is defined on`to_numpy()`

`Index`

,`Series`

, and`DataFrame`

objects, and, which is defined on`array`

`Index`

and`Series`

objects only.

If you visit the v0.24 docs for `.values`

, you will see a big red warning that says:

## Warning: We recommend using

`DataFrame.to_numpy()`

instead.

See this section of the v0.24.0 release notes, and this answer for more information.

_{* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you"re just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.}

`to_numpy()`

In the spirit of better consistency throughout the API, a new method `to_numpy`

has been introduced to extract the underlying NumPy array from DataFrames.

```
# Setup
df = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]},
index=["a", "b", "c"])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]])
# Convert specific columns
df[["A", "C"]].to_numpy()
# array([[1, 7],
# [2, 8],
# [3, 9]])
```

As mentioned above, this method is also defined on `Index`

and `Series`

objects (see here).

```
df.index.to_numpy()
# array(["a", "b", "c"], dtype=object)
df["A"].to_numpy()
# array([1, 2, 3])
```

By default, a view is returned, so any modifications made will affect the original.

```
v = df.to_numpy()
v[0, 0] = -1
df
A B C
a -1 4 7
b 2 5 8
c 3 6 9
```

If you need a copy instead, use `to_numpy(copy=True)`

.

If you"re using pandas 1.x, chances are you"ll be dealing with extension types a lot more. You"ll have to be a little more careful that these extension types are correctly converted.

```
a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object) # yuck, objects
# Correct
a.to_numpy(dtype="float", na_value=np.nan)
# array([ 1., 2., nan])
# Also correct
a.to_numpy(dtype="int", na_value=-1)
# array([ 1, 2, -1])
```

This is called out in the docs.

`dtypes`

in the result...As shown in another answer, `DataFrame.to_records`

is a good way to do this.

```
df.to_records()
# rec.array([("a", 1, 4, 7), ("b", 2, 5, 8), ("c", 3, 6, 9)],
# dtype=[("index", "O"), ("A", "<i8"), ("B", "<i8"), ("C", "<i8")])
```

This cannot be done with `to_numpy`

, unfortunately. However, as an alternative, you can use `np.rec.fromrecords`

:

```
v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([("a", 1, 4, 7), ("b", 2, 5, 8), ("c", 3, 6, 9)],
# dtype=[("index", "<U1"), ("A", "<i8"), ("B", "<i8"), ("C", "<i8")])
```

Performance wise, it"s nearly the same (actually, using `rec.fromrecords`

is a bit faster).

```
df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ¬± 511 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
9.56 ms ¬± 291 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
```

`to_numpy()`

(in addition to `array`

) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[...] with

`.values`

it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like`Categorical`

). For example, with`PeriodIndex`

,`.values`

generates a new`ndarray`

of period objects each time. [...]

`to_numpy`

aims to improve the consistency of the API, which is a major step in the right direction. `.values`

will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.

`DataFrame.values`

has inconsistent behaviour, as already noted.

`DataFrame.get_values()`

is simply a wrapper around `DataFrame.values`

, so everything said above applies.

`DataFrame.as_matrix()`

is deprecated now, do **NOT** use!

You can use the `DataFrame`

constructor with `lists`

created by `to_list`

:

```
import pandas as pd
d1 = {"teams": [["SF", "NYG"],["SF", "NYG"],["SF", "NYG"],
["SF", "NYG"],["SF", "NYG"],["SF", "NYG"],["SF", "NYG"]]}
df2 = pd.DataFrame(d1)
print (df2)
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]
```

```
df2[["team1","team2"]] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
teams team1 team2
0 [SF, NYG] SF NYG
1 [SF, NYG] SF NYG
2 [SF, NYG] SF NYG
3 [SF, NYG] SF NYG
4 [SF, NYG] SF NYG
5 [SF, NYG] SF NYG
6 [SF, NYG] SF NYG
```

And for a new `DataFrame`

:

```
df3 = pd.DataFrame(df2["teams"].to_list(), columns=["team1","team2"])
print (df3)
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
```

A solution with `apply(pd.Series)`

is very slow:

```
#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)
In [121]: %timeit df2["teams"].apply(pd.Series)
1.79 s ¬± 52.5 ms per loop (mean ¬± std. dev. of 7 runs, 1 loop each)
In [122]: %timeit pd.DataFrame(df2["teams"].to_list(), columns=["team1","team2"])
1.63 ms ¬± 54.3 ¬µs per loop (mean ¬± std. dev. of 7 runs, 1000 loops each)
```

**UPDATE**

From v0.20, `melt`

is a first order function, you can now use

```
df.melt(id_vars=["location", "name"],
var_name="Date",
value_name="Value")
location name Date Value
0 A "test" Jan-2010 12
1 B "foo" Jan-2010 18
2 A "test" Feb-2010 20
3 B "foo" Feb-2010 20
4 A "test" March-2010 30
5 B "foo" March-2010 25
```

**OLD(ER) VERSIONS: <0.20**

You can use `pd.melt`

to get most of the way there, and then sort:

```
>>> df
location name Jan-2010 Feb-2010 March-2010
0 A test 12 20 30
1 B foo 18 20 25
>>> df2 = pd.melt(df, id_vars=["location", "name"],
var_name="Date", value_name="Value")
>>> df2
location name Date Value
0 A test Jan-2010 12
1 B foo Jan-2010 18
2 A test Feb-2010 20
3 B foo Feb-2010 20
4 A test March-2010 30
5 B foo March-2010 25
>>> df2 = df2.sort(["location", "name"])
>>> df2
location name Date Value
0 A test Jan-2010 12
2 A test Feb-2010 20
4 A test March-2010 30
1 B foo Jan-2010 18
3 B foo Feb-2010 20
5 B foo March-2010 25
```

(Might want to throw in a `.reset_index(drop=True)`

, just to keep the output clean.)

**Note**: `pd.DataFrame.sort`

has been deprecated in favour of `pd.DataFrame.sort_values`

.

I know `object`

columns `type`

makes the data hard to convert with a `pandas`

function. When I received the data like this, the first thing that came to mind was to "flatten" or unnest the columns .

I am using `pandas`

and `python`

functions for this type of question. If you are worried about the speed of the above solutions, check user3483203"s answer, since it"s using `numpy`

and most of the time `numpy`

is faster . I recommend `Cpython`

and `numba`

if speed matters.

**Method 0 [pandas >= 0.25]**

Starting from pandas 0.25, if you only need to explode *one* column, you can use the `pandas.DataFrame.explode`

function:

```
df.explode("B")
A B
0 1 1
1 1 2
0 2 1
1 2 2
```

Given a dataframe with an empty `list`

or a `NaN`

in the column. An empty list will not cause an issue, but a `NaN`

will need to be filled with a `list`

```
df = pd.DataFrame({"A": [1, 2, 3, 4],"B": [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode("B")
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
```

**Method 1**

** apply + pd.Series** (easy to understand but in terms of performance not recommended . )

```
df.set_index("A").B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:"B"})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
```

**Method 2**

Using `repeat`

with `DataFrame`

constructor , re-create your dataframe (good at performance, not good at multiple columns )

```
df=pd.DataFrame({"A":df.A.repeat(df.B.str.len()),"B":np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
```

**Method 2.1**

for example besides A we have A.1 .....A.n. If we still use the method(**Method 2**) above it is hard for us to re-create the columns one by one .

Solution : `join`

or `merge`

with the `index`

after "unnest" the single columns

```
s=pd.DataFrame({"B":np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop("B",1),how="left")
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
```

If you need the column order exactly the same as before, add `reindex`

at the end.

```
s.join(df.drop("B",1),how="left").reindex(columns=df.columns)
```

**Method 3**

recreate the `list`

```
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
```

If more than two columns, use

```
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
```

**Method 4**

using `reindex`

or `loc`

```
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
```

**Method 5**

when the list only contains unique values:

```
df=pd.DataFrame({"A":[1,2],"B":[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df["B"], df["A"])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
```

**Method 6**

using `numpy`

for high performance:

```
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
```

**Method 7**

using base function `itertools`

`cycle`

and `chain`

: Pure python solution just for fun

```
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
```

**Generalizing to multiple columns**

```
df=pd.DataFrame({"A":[1,2],"B":[[1,2],[3,4]],"C":[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
```

Self-def function:

```
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how="left")
unnesting(df,["B","C"])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
```

All above method is talking about the * vertical* unnesting and explode , If you do need expend the list

`pd.DataFrame`

constructor```
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix("B_"))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
```

**Updated function**

```
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how="left")
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how="left")
```

Test Output

```
unnesting(df, ["B","C"], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2
```

Update 2021-02-17 with original explode function

```
def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how="left")
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how="left")
```

You can `groupby`

on cols "A" and "B" and call `size`

and then `reset_index`

and `rename`

the generated column:

```
In [26]:
df1.groupby(["A","B"]).size().reset_index().rename(columns={0:"count"})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
```

**update**

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call `size`

which returns the number of unique groups:

```
In[202]:
df1.groupby(["A","B"]).size()
Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
```

So now to restore the grouped columns, we call `reset_index`

:

```
In[203]:
df1.groupby(["A","B"]).size().reset_index()
Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
```

This restores the indices but the size aggregation is turned into a generated column `0`

, so we have to rename this:

```
In[204]:
df1.groupby(["A","B"]).size().reset_index().rename(columns={0:"count"})
Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
```

`groupby`

does accept the arg `as_index`

which we could have set to `False`

so it doesn"t make the grouped columns the index, but this generates a `series`

and you"d still have to restore the indices and so on....:

```
In[205]:
df1.groupby(["A","B"], as_index=False).size()
Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
```

You can *reset* the index using `reset_index`

to get back a default index of 0, 1, 2, ..., n-1 (and use `drop=True`

to indicate you want to drop the existing index instead of adding it as an additional column to your dataframe):

```
In [19]: df2 = df2.reset_index(drop=True)
In [20]: df2
Out[20]:
x y
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
```

TLDR; No, `for`

loops are not blanket "bad", at least, not always. It is probably **more accurate to say that some vectorized operations are slower than iterating**, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

- When your data is small (...depending on what you"re doing),
- When dealing with
`object`

/mixed dtypes - When using the
`str`

/regex accessor functions

Let"s examine these situations individually.

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

- Index/axis alignment
- Handling mixed datatypes
- Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an **overhead**. The overhead is less for numeric functions (for example, `Series.add`

), while it is more pronounced for string functions (for example, `Series.str.replace`

).

`for`

loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through `for`

loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

```
[f(x) for x in seq]
```

Where `seq`

is a pandas series or DataFrame column. Or, when operating over multiple columns,

```
[f(x, y) for x, y in zip(seq1, seq2)]
```

Where `seq1`

and `seq2`

are columns.

**Numeric Comparison**

Consider a simple boolean indexing operation. The list comprehension method has been timed against `Series.ne`

(`!=`

) and `query`

. Here are the functions:

```
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query("A != B") # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
```

For simplicity, I have used the `perfplot`

package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms `query`

for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note

It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisationwithoutall the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as`df[df.A.values != df.B.values]`

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

**Value Counts**

Taking another example - this time, with another vanilla python construct that is *faster* than a for loop - `collections.Counter`

. A common requirement is to compute the value counts and return the result as a dictionary. This is done with `value_counts`

, `np.unique`

, and `Counter`

:

```
# Value Counts comparison.
ser.value_counts(sort=False).to_dict() # value_counts
dict(zip(*np.unique(ser, return_counts=True))) # np.unique
Counter(ser) # Counter
```

The results are more pronounced, `Counter`

wins out over both vectorized methods for a larger range of small N (~3500).

Note

More trivia (courtesy @user2357112). The`Counter`

is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a`for`

loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

```
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
```

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.

`object`

dtypes**String-based Comparison**

Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

```
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query("A != B") # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
```

So, what changed? The thing to note here is that **string operations are inherently difficult to vectorize.** Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

**Accessing Dictionary Value(s) by Key**

Here are timings for two operations that extract a value from a column of dictionaries: `map`

and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

```
# Dictionary value extraction.
ser.map(operator.itemgetter("value")) # map
pd.Series([x.get("value") for x in ser]) # list comprehension
```

**Positional List Indexing**

Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), `map`

, `str.get`

accessor method, and the list comprehension:

```
# List positional indexing.
def get_0th(lst):
try:
return lst[0]
# Handle empty lists and NaNs gracefully.
except (IndexError, TypeError):
return np.nan
```

```
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
```

Note

If the index matters, you would want to do:`pd.Series([...], index=ser.index)`

When reconstructing the series.

**List Flattening**

A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

```
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
```

Both `itertools.chain.from_iterable`

and the nested list comprehension are pure python constructs, and scale much better than the `stack`

solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed `apply`

on these solutions, because it would skew the graph (yes, it"s that slow).

`.str`

Accessor MethodsPandas can apply regex operations such as `str.contains`

, `str.extract`

, and `str.extractall`

, as well as other "vectorized" string operations (such as `str.split`

, str.find`,`

str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with `re.compile`

(also see Is it worth using Python's re.compile?). The list comp equivalent to `str.contains`

looks something like this:

```
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
```

Or,

```
ser2 = ser[[bool(p.search(x)) for x in ser]]
```

If you need to handle NaNs, you can do something like

```
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
```

The list comp equivalent to `str.extract`

(without groups) will look something like:

```
df["col2"] = [p.search(x).group(0) for x in df["col"]]
```

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

```
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df["col2"] = [matcher(x) for x in df["col"]]
```

The `matcher`

function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the `group`

or `groups`

attribute of the matcher object.

For `str.extractall`

, change `p.search`

to `p.findall`

.

**String Extraction**

Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

```
# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
```

**More Examples**

Full disclosure - I am the author (in part or whole) of these posts listed below.

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Create new column with incremental values in a faster and efficient way - Answer by Divakar

Fast punctuation removal with pandas - Answer by Paul Panzer

Additionally, sometimes just operating on the underlying arrays via `.values`

as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the **Note** in the **Numeric Comparison** section above). So, for example `df[df.A.values != df.B.values]`

would show instant performance boosts over `df[df.A != df.B]`

. Using `.values`

may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.

```
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
```

```
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
n_range=[2**k for k in range(0, 15)],
xlabel="N"
)
```

```
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=["value_counts", "np.unique", "Counter"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=lambda x, y: dict(x) == dict(y)
)
```

```
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=["vectorized !=", "query (numexpr)", "list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter("value")),
lambda ser: pd.Series([x.get("value") for x in ser]),
],
labels=["map", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# List positional indexing.
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=["map", "str accessor", "list comprehension", "list comp safe"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=["stack", "itertools.chain", "nested list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=["str.extract", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: `[1,5,6,10,11]`

and I would like to reset it to `[0,1,2,3,4]`

. How can I do it?

The following seems to work:

```
df = df.reset_index()
del df["index"]
```

The following does not work:

```
df = df.reindex()
```

The simplest way to get row counts per group is by calling `.size()`

, which returns a `Series`

:

```
df.groupby(["col1","col2"]).size()
```

Usually you want this result as a `DataFrame`

(instead of a `Series`

) so you can do:

```
df.groupby(["col1", "col2"]).size().reset_index(name="counts")
```

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Consider the following example dataframe:

```
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
```

First let"s use `.size()`

to get the row counts:

```
In [3]: df.groupby(["col1", "col2"]).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
```

Then let"s use `.size().reset_index(name="counts")`

to get the row counts:

```
In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
```

When you want to calculate statistics on grouped data, it usually looks like this:

```
In [5]: (df
...: .groupby(["col1", "col2"])
...: .agg({
...: "col3": ["mean", "count"],
...: "col4": ["median", "min", "count"]
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
```

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`

. It looks like this:

```
In [6]: gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...: .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...: .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...: .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
```

The code used to generate the test data is shown below:

```
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["C", "D"],
...: ["C", "D"],
...: ["C", "D"],
...: ["E", "F"],
...: ["E", "F"],
...: ["G", "H"]
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...: df[["col3", "col4", "col5", "col6"]].astype(float)
...:
```

**Disclaimer:**

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN`

entries in the mean calculation without telling you about it.

The idiomatic way to do this with Pandas is to use the `.sample`

method of your dataframe to sample all rows without replacement:

```
df.sample(frac=1)
```

The `frac`

keyword argument specifies the fraction of rows to return in the random sample, so `frac=1`

means return all rows (in random order).

**Note:**
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

```
df = df.sample(frac=1).reset_index(drop=True)
```

Here, specifying `drop=True`

prevents `.reset_index`

from creating a column containing the old index entries.

**Follow-up note:** Although it may not look like the above operation is *in-place*, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the *reference* object has changed (by which I mean `id(df_old)`

is not the same as `id(df_new)`

), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

```
$ python3 -m memory_profiler . est.py
Filename: . est.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
```

I would suggest using the duplicated method on the Pandas Index itself:

```
df3 = df3[~df3.index.duplicated(keep="first")]
```

While all the other methods work, `.drop_duplicates`

is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

```
>>> %timeit df3.reset_index().drop_duplicates(subset="index", keep="first").set_index("index")
1000 loops, best of 3: 1.54 ms per loop
>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 ¬µs per loop
>>> %timeit df3[~df3.index.duplicated(keep="first")]
1000 loops, best of 3: 307 ¬µs per loop
```

Note that you can keep the last element by changing the keep argument to `"last"`

.

It should also be noted that this method works with `MultiIndex`

as well (using df1 as specified in Paul"s example):

```
>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 ¬µs per loop
>>> %timeit df1[~df1.index.duplicated(keep="last")]
1000 loops, best of 3: 365 ¬µs per loop
```

`df.to_numpy()`

is better than `df.values`

, here"s why.It"s time to deprecate your usage of `values`

and `as_matrix()`

.

pandas `v0.24.0`

introduced two new methods for obtaining NumPy arrays from pandas objects:

, which is defined on`to_numpy()`

`Index`

,`Series`

, and`DataFrame`

objects, and, which is defined on`array`

`Index`

and`Series`

objects only.

If you visit the v0.24 docs for `.values`

, you will see a big red warning that says:

## Warning: We recommend using

`DataFrame.to_numpy()`

instead.

See this section of the v0.24.0 release notes, and this answer for more information.

_{* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you"re just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.}

`to_numpy()`

In the spirit of better consistency throughout the API, a new method `to_numpy`

has been introduced to extract the underlying NumPy array from DataFrames.

```
# Setup
df = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]},
index=["a", "b", "c"])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]])
# Convert specific columns
df[["A", "C"]].to_numpy()
# array([[1, 7],
# [2, 8],
# [3, 9]])
```

As mentioned above, this method is also defined on `Index`

and `Series`

objects (see here).

```
df.index.to_numpy()
# array(["a", "b", "c"], dtype=object)
df["A"].to_numpy()
# array([1, 2, 3])
```

By default, a view is returned, so any modifications made will affect the original.

```
v = df.to_numpy()
v[0, 0] = -1
df
A B C
a -1 4 7
b 2 5 8
c 3 6 9
```

If you need a copy instead, use `to_numpy(copy=True)`

.

If you"re using pandas 1.x, chances are you"ll be dealing with extension types a lot more. You"ll have to be a little more careful that these extension types are correctly converted.

```
a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object) # yuck, objects
# Correct
a.to_numpy(dtype="float", na_value=np.nan)
# array([ 1., 2., nan])
# Also correct
a.to_numpy(dtype="int", na_value=-1)
# array([ 1, 2, -1])
```

This is called out in the docs.

`dtypes`

in the result...As shown in another answer, `DataFrame.to_records`

is a good way to do this.

```
df.to_records()
# rec.array([("a", 1, 4, 7), ("b", 2, 5, 8), ("c", 3, 6, 9)],
# dtype=[("index", "O"), ("A", "<i8"), ("B", "<i8"), ("C", "<i8")])
```

This cannot be done with `to_numpy`

, unfortunately. However, as an alternative, you can use `np.rec.fromrecords`

:

```
v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([("a", 1, 4, 7), ("b", 2, 5, 8), ("c", 3, 6, 9)],
# dtype=[("index", "<U1"), ("A", "<i8"), ("B", "<i8"), ("C", "<i8")])
```

Performance wise, it"s nearly the same (actually, using `rec.fromrecords`

is a bit faster).

```
df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ¬± 511 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
9.56 ms ¬± 291 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
```

`to_numpy()`

(in addition to `array`

) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[...] with

`.values`

it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like`Categorical`

). For example, with`PeriodIndex`

,`.values`

generates a new`ndarray`

of period objects each time. [...]

`to_numpy`

aims to improve the consistency of the API, which is a major step in the right direction. `.values`

will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.

`DataFrame.values`

has inconsistent behaviour, as already noted.

`DataFrame.get_values()`

is simply a wrapper around `DataFrame.values`

, so everything said above applies.

`DataFrame.as_matrix()`

is deprecated now, do **NOT** use!

You can use the `DataFrame`

constructor with `lists`

created by `to_list`

:

```
import pandas as pd
d1 = {"teams": [["SF", "NYG"],["SF", "NYG"],["SF", "NYG"],
["SF", "NYG"],["SF", "NYG"],["SF", "NYG"],["SF", "NYG"]]}
df2 = pd.DataFrame(d1)
print (df2)
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]
```

```
df2[["team1","team2"]] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
teams team1 team2
0 [SF, NYG] SF NYG
1 [SF, NYG] SF NYG
2 [SF, NYG] SF NYG
3 [SF, NYG] SF NYG
4 [SF, NYG] SF NYG
5 [SF, NYG] SF NYG
6 [SF, NYG] SF NYG
```

And for a new `DataFrame`

:

```
df3 = pd.DataFrame(df2["teams"].to_list(), columns=["team1","team2"])
print (df3)
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
```

A solution with `apply(pd.Series)`

is very slow:

```
#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)
In [121]: %timeit df2["teams"].apply(pd.Series)
1.79 s ¬± 52.5 ms per loop (mean ¬± std. dev. of 7 runs, 1 loop each)
In [122]: %timeit pd.DataFrame(df2["teams"].to_list(), columns=["team1","team2"])
1.63 ms ¬± 54.3 ¬µs per loop (mean ¬± std. dev. of 7 runs, 1000 loops each)
```

**UPDATE**

From v0.20, `melt`

is a first order function, you can now use

```
df.melt(id_vars=["location", "name"],
var_name="Date",
value_name="Value")
location name Date Value
0 A "test" Jan-2010 12
1 B "foo" Jan-2010 18
2 A "test" Feb-2010 20
3 B "foo" Feb-2010 20
4 A "test" March-2010 30
5 B "foo" March-2010 25
```

**OLD(ER) VERSIONS: <0.20**

You can use `pd.melt`

to get most of the way there, and then sort:

```
>>> df
location name Jan-2010 Feb-2010 March-2010
0 A test 12 20 30
1 B foo 18 20 25
>>> df2 = pd.melt(df, id_vars=["location", "name"],
var_name="Date", value_name="Value")
>>> df2
location name Date Value
0 A test Jan-2010 12
1 B foo Jan-2010 18
2 A test Feb-2010 20
3 B foo Feb-2010 20
4 A test March-2010 30
5 B foo March-2010 25
>>> df2 = df2.sort(["location", "name"])
>>> df2
location name Date Value
0 A test Jan-2010 12
2 A test Feb-2010 20
4 A test March-2010 30
1 B foo Jan-2010 18
3 B foo Feb-2010 20
5 B foo March-2010 25
```

(Might want to throw in a `.reset_index(drop=True)`

, just to keep the output clean.)

**Note**: `pd.DataFrame.sort`

has been deprecated in favour of `pd.DataFrame.sort_values`

.

I know `object`

columns `type`

makes the data hard to convert with a `pandas`

function. When I received the data like this, the first thing that came to mind was to "flatten" or unnest the columns .

I am using `pandas`

and `python`

functions for this type of question. If you are worried about the speed of the above solutions, check user3483203"s answer, since it"s using `numpy`

and most of the time `numpy`

is faster . I recommend `Cpython`

and `numba`

if speed matters.

**Method 0 [pandas >= 0.25]**

Starting from pandas 0.25, if you only need to explode *one* column, you can use the `pandas.DataFrame.explode`

function:

```
df.explode("B")
A B
0 1 1
1 1 2
0 2 1
1 2 2
```

Given a dataframe with an empty `list`

or a `NaN`

in the column. An empty list will not cause an issue, but a `NaN`

will need to be filled with a `list`

```
df = pd.DataFrame({"A": [1, 2, 3, 4],"B": [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode("B")
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
```

**Method 1**

** apply + pd.Series** (easy to understand but in terms of performance not recommended . )

```
df.set_index("A").B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:"B"})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
```

**Method 2**

Using `repeat`

with `DataFrame`

constructor , re-create your dataframe (good at performance, not good at multiple columns )

```
df=pd.DataFrame({"A":df.A.repeat(df.B.str.len()),"B":np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
```

**Method 2.1**

for example besides A we have A.1 .....A.n. If we still use the method(**Method 2**) above it is hard for us to re-create the columns one by one .

Solution : `join`

or `merge`

with the `index`

after "unnest" the single columns

```
s=pd.DataFrame({"B":np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop("B",1),how="left")
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
```

If you need the column order exactly the same as before, add `reindex`

at the end.

```
s.join(df.drop("B",1),how="left").reindex(columns=df.columns)
```

**Method 3**

recreate the `list`

```
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
```

If more than two columns, use

```
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
```

**Method 4**

using `reindex`

or `loc`

```
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
```

**Method 5**

when the list only contains unique values:

```
df=pd.DataFrame({"A":[1,2],"B":[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df["B"], df["A"])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
```

**Method 6**

using `numpy`

for high performance:

```
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
```

**Method 7**

using base function `itertools`

`cycle`

and `chain`

: Pure python solution just for fun

```
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
```

**Generalizing to multiple columns**

```
df=pd.DataFrame({"A":[1,2],"B":[[1,2],[3,4]],"C":[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
```

Self-def function:

```
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how="left")
unnesting(df,["B","C"])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
```

All above method is talking about the * vertical* unnesting and explode , If you do need expend the list

`pd.DataFrame`

constructor```
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix("B_"))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
```

**Updated function**

```
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how="left")
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how="left")
```

Test Output

```
unnesting(df, ["B","C"], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2
```

Update 2021-02-17 with original explode function

```
def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how="left")
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how="left")
```

You can `groupby`

on cols "A" and "B" and call `size`

and then `reset_index`

and `rename`

the generated column:

```
In [26]:
df1.groupby(["A","B"]).size().reset_index().rename(columns={0:"count"})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
```

**update**

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call `size`

which returns the number of unique groups:

```
In[202]:
df1.groupby(["A","B"]).size()
Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
```

So now to restore the grouped columns, we call `reset_index`

:

```
In[203]:
df1.groupby(["A","B"]).size().reset_index()
Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
```

This restores the indices but the size aggregation is turned into a generated column `0`

, so we have to rename this:

```
In[204]:
df1.groupby(["A","B"]).size().reset_index().rename(columns={0:"count"})
Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
```

`groupby`

does accept the arg `as_index`

which we could have set to `False`

so it doesn"t make the grouped columns the index, but this generates a `series`

and you"d still have to restore the indices and so on....:

```
In[205]:
df1.groupby(["A","B"], as_index=False).size()
Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
```

You can *reset* the index using `reset_index`

to get back a default index of 0, 1, 2, ..., n-1 (and use `drop=True`

to indicate you want to drop the existing index instead of adding it as an additional column to your dataframe):

```
In [19]: df2 = df2.reset_index(drop=True)
In [20]: df2
Out[20]:
x y
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
```

TLDR; No, `for`

loops are not blanket "bad", at least, not always. It is probably **more accurate to say that some vectorized operations are slower than iterating**, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

- When your data is small (...depending on what you"re doing),
- When dealing with
`object`

/mixed dtypes - When using the
`str`

/regex accessor functions

Let"s examine these situations individually.

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

- Index/axis alignment
- Handling mixed datatypes
- Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an **overhead**. The overhead is less for numeric functions (for example, `Series.add`

), while it is more pronounced for string functions (for example, `Series.str.replace`

).

`for`

loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through `for`

loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

```
[f(x) for x in seq]
```

Where `seq`

is a pandas series or DataFrame column. Or, when operating over multiple columns,

```
[f(x, y) for x, y in zip(seq1, seq2)]
```

Where `seq1`

and `seq2`

are columns.

**Numeric Comparison**

Consider a simple boolean indexing operation. The list comprehension method has been timed against `Series.ne`

(`!=`

) and `query`

. Here are the functions:

```
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query("A != B") # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
```

For simplicity, I have used the `perfplot`

package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms `query`

for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note

It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisationwithoutall the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as`df[df.A.values != df.B.values]`

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

**Value Counts**

Taking another example - this time, with another vanilla python construct that is *faster* than a for loop - `collections.Counter`

. A common requirement is to compute the value counts and return the result as a dictionary. This is done with `value_counts`

, `np.unique`

, and `Counter`

:

```
# Value Counts comparison.
ser.value_counts(sort=False).to_dict() # value_counts
dict(zip(*np.unique(ser, return_counts=True))) # np.unique
Counter(ser) # Counter
```

The results are more pronounced, `Counter`

wins out over both vectorized methods for a larger range of small N (~3500).

Note

More trivia (courtesy @user2357112). The`Counter`

is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a`for`

loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

```
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
```

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.

`object`

dtypes**String-based Comparison**

Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

```
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query("A != B") # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
```

So, what changed? The thing to note here is that **string operations are inherently difficult to vectorize.** Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

**Accessing Dictionary Value(s) by Key**

Here are timings for two operations that extract a value from a column of dictionaries: `map`

and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

```
# Dictionary value extraction.
ser.map(operator.itemgetter("value")) # map
pd.Series([x.get("value") for x in ser]) # list comprehension
```

**Positional List Indexing**

Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), `map`

, `str.get`

accessor method, and the list comprehension:

```
# List positional indexing.
def get_0th(lst):
try:
return lst[0]
# Handle empty lists and NaNs gracefully.
except (IndexError, TypeError):
return np.nan
```

```
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
```

Note

If the index matters, you would want to do:`pd.Series([...], index=ser.index)`

When reconstructing the series.

**List Flattening**

A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

```
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
```

Both `itertools.chain.from_iterable`

and the nested list comprehension are pure python constructs, and scale much better than the `stack`

solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed `apply`

on these solutions, because it would skew the graph (yes, it"s that slow).

`.str`

Accessor MethodsPandas can apply regex operations such as `str.contains`

, `str.extract`

, and `str.extractall`

, as well as other "vectorized" string operations (such as `str.split`

, str.find`,`

str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with `re.compile`

(also see Is it worth using Python's re.compile?). The list comp equivalent to `str.contains`

looks something like this:

```
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
```

Or,

```
ser2 = ser[[bool(p.search(x)) for x in ser]]
```

If you need to handle NaNs, you can do something like

```
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
```

The list comp equivalent to `str.extract`

(without groups) will look something like:

```
df["col2"] = [p.search(x).group(0) for x in df["col"]]
```

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

```
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df["col2"] = [matcher(x) for x in df["col"]]
```

The `matcher`

function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the `group`

or `groups`

attribute of the matcher object.

For `str.extractall`

, change `p.search`

to `p.findall`

.

**String Extraction**

Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

```
# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
```

**More Examples**

Full disclosure - I am the author (in part or whole) of these posts listed below.

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Create new column with incremental values in a faster and efficient way - Answer by Divakar

Fast punctuation removal with pandas - Answer by Paul Panzer

Additionally, sometimes just operating on the underlying arrays via `.values`

as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the **Note** in the **Numeric Comparison** section above). So, for example `df[df.A.values != df.B.values]`

would show instant performance boosts over `df[df.A != df.B]`

. Using `.values`

may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.

```
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
```

```
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
n_range=[2**k for k in range(0, 15)],
xlabel="N"
)
```

```
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=["value_counts", "np.unique", "Counter"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=lambda x, y: dict(x) == dict(y)
)
```

```
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=["vectorized !=", "query (numexpr)", "list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter("value")),
lambda ser: pd.Series([x.get("value") for x in ser]),
],
labels=["map", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# List positional indexing.
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=["map", "str accessor", "list comprehension", "list comp safe"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=["stack", "itertools.chain", "nested list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=["str.extract", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

X
# Submit new EBook