Arrays | Loops | NumPy | Python Methods and Functions | vectorization

The time complexity of any algorithm is very important in deciding whether an application is reliable or not. Running a large algorithm as quickly as possible is very important when it comes to applying real-time output. To do this, Python has several standard math functions for quickly performing operations on entire arrays of data without having to write loops. One such library that contains such a function is ** numpy **. Let`s see how we can use this standard function in the case of vectorization.

** What is vectorization? **

Vectorization is used to speed up Python code without using a loop. Using such a function can help minimize your code execution time. Various operations are performed on a vector, such as the * dot product of vectors, * which is also known as the * dot product, * because it produces a single output, outer products that result in a square measurement matrix. equal to the length X of the length of vectors, * multiplication by an element, * which produces the product The element of the indices and the dimension of the matrix remain unchanged.

We will see how classical methods take longer than ordinary functions, calculating their processing time.

outer (a, b):Compute the outer product of two vectors.

multiply (a, b) :Matrix product of two arrays.

dot (a, b):Dot product of two arrays.

zeros ((n, m)) :Return a matrix of given shape and type, filled with zeros.

process_time ():Return the value (in fractional seconds) of the sum of the system and user CPU time of the c urrent process. It does not include time elapsed during sleep.

** Dot product: **

Dot product — it is an algebraic operation in which two vectors of equal length are multiplied so that it produces one number. The Dot Product is often referred to as a ** Inner Product **. This product results in a scalar number. Let`s consider two matrices

** Visual representation of the dot product — **

Below here is the Python code:

` ` |

** Exit:**

dot_product = 833323333350000.0 Computation time = 35.59449199999999ms n_dot_product = 833323333350000 Computation time = 0.1559900000000225ms

** Outdoor product: **

* Tensor product * of two coordinate vectors is called * External work * . Consider two vectors * a * and * b * with dimensions ` nx 1 `

and ` mx 1 `

then the outer product of the vector leads to a rectangular matrix ** nxm **. If two vectors have the same dimension, then the resulting matrix will be a square matrix, as shown in the figure.

** Visual representation of the external product — **

Below is the Python code:

` ` |

** Exit:**

outer_product = [[0. 0. 0. ..., 0. 0. 0.] [200.201.202. ..., 397. 398. 399.] [400. 402. 404. ..., 794. 796. 798.] ..., [39400. 39597. 39794. ..., 78209. 78406. 78603. ] [39600. 39798. 39996. ..., 78606. 78804. 79002.] [39800. 39999. 40198. ..., 79202. 79401.]] Computation time = 39.821617ms outer_product = [[0 0 0 ..., 0 0 0] [200 201 202 ..., 397 398 399] [400 402 404 ..., 794 79 6 798] ..., [39400 39597 39794 ..., 78209 78406 78603] [39600 39798 39996 ..., 78606 78804 79002] [39800 39999 40198 ..., 79003 79202 79401]] Computation time = 0.2809480000000031ms

Element wise product:

Elementwise multiplication of two matrices — it is an algebraic operation in which each element of the first matrix is multiplied by the corresponding element in the later matrix. The dimensions of the matrices must be the same.

Consider two matricesaandb, the element index ina— these areiandj,thena (i, j)is multiplied byb (i, j), respectively, as shown in the picture below.

Visual representation of the wise product Element —Below is the Python code:

`# Element-wise multiplication`

`import`

`time`

`import`

`numpy`

`import`

`array`

`a`

`=`

`array.array (`

``i``

`)`

`for`

`i`

`in`

`range`

`(`

`50000`

`):`

`a.append (i);`

`b`

`=`

`array.array (`

`` i``

`)`

`for`

`i`

`in`

`range`

`(`

`50000`

`,`

`100000`

`):`

`b.append (i)`

`# classic item-by-item product vector implementations`

`vector`

`=`

`numpy.zeros ((`

`50000`

`))`

`tic`

`=`

`time.process_time ()`

`for`

`i`

`in`

`range`

`(`

`len`

`(a)):`

`vector [i]`

`=`

`a [i]`

`*`

`b [i]`

`toc`

`=`

`time.process_time ()`

`(`

`"Element wise Product ="`

`+`

`str`

`(vector));`

`(`

`"Computation time ="`

`+`

`str`

`(`

`1000`

`*`

`(toc`

`-`

`tic))`

`+`

`"ms"`

`)`

`n_tic`

`=`

`time.process_time ()`

`vector`

`=`

`numpy.multiply (a, b)`

`n_toc`

`=`

`time.process_time ()`

`(`

`"Element wise Product ="`

`+`

`str`

`(vector));`

`(`

`"Computation time ="`

`+`

`str`

`(`

`1000`

`*`

`(n_toc`

`-`

`n_tic))`

`+`

`"ms"`

`)`

Exit:Element wise Product = [0.00000000e + 00 5.00010000e + 04 1.00004000e + 05 ..., 4.99955001e + 09 4.99970000e + 09 4.99985000e + 09] Computation time = 23.516678000000013ms Element wise Product = [0 50001 100004 ... , 704582713 704732708 704882705] Computation time = 0.2250640000000248ms

## How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "`iter`

" in its name for more than a few thousand rows or you will have to get used to a **lot** of waiting.

Do you want to print a DataFrame? Use ** DataFrame.to_string()**.

Do you want to compute something? In that case, search for methods in this order (list modified from here):

- Vectorization
- Cython routines
- List Comprehensions (vanilla
`for`

loop) : i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space`DataFrame.apply()`

and`DataFrame.itertuples()`

`iteritems()`

`DataFrame.iterrows()`

`iterrows`

and `itertuples`

(both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

**Appeal to Authority**

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

_{* It"s actually a little more complicated than "don"t". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.}

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

```
# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]
```

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

**Caveats**

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

- The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
- When dealing with mixed data types you should iterate over
`zip(df["A"], df["B"], ...)`

instead of`df[["A", "B"]].to_numpy()`

as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string,`to_numpy()`

will cast the entire array to string, which may not be what you want. Fortunately`zip`

ping your columns together is the most straightforward workaround to this.

_{*Your mileage may vary for the reasons outlined in the Caveats section above.}

Let"s demonstrate the difference with a simple example of adding two pandas columns `A + B`

. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer `vec`

over `vec_numpy`

).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions.

Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations

*Are for-loops in pandas really bad? When should I care?*- a detailed writeup by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data)*When should I (not) want to use pandas apply() in my code?*-`apply`

is slow (but not as slow as the`iter*`

family. There are, however, situations where one can (or should) consider`apply`

as a serious alternative, especially in some`GroupBy`

operations).

_{* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.}

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls `iterrows()`

while doing something inside a `for`

loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

I"ve tested all suggested methods plus `np.array(map(f, x))`

with `perfplot`

(a small project of mine).

Message #1: If you can use numpy"s native functions, do that.

If the function you"re trying to vectorize already *is* vectorized (like the `x**2`

example in the original post), using that is *much* faster than anything else (note the log scale):

If you actually need vectorization, it doesn"t really matter much which variant you use.

Code to reproduce the plots:

```
import numpy as np
import perfplot
import math
def f(x):
# return math.sqrt(x)
return np.sqrt(x)
vf = np.vectorize(f)
def array_for(x):
return np.array([f(xi) for xi in x])
def array_map(x):
return np.array(list(map(f, x)))
def fromiter(x):
return np.fromiter((f(xi) for xi in x), x.dtype)
def vectorize(x):
return np.vectorize(f)(x)
def vectorize_without_init(x):
return vf(x)
perfplot.show(
setup=np.random.rand,
n_range=[2 ** k for k in range(20)],
kernels=[f, array_for, array_map, fromiter,
vectorize, vectorize_without_init],
xlabel="len(x)",
)
```

TLDR; No, `for`

loops are not blanket "bad", at least, not always. It is probably **more accurate to say that some vectorized operations are slower than iterating**, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

- When your data is small (...depending on what you"re doing),
- When dealing with
`object`

/mixed dtypes - When using the
`str`

/regex accessor functions

Let"s examine these situations individually.

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

- Index/axis alignment
- Handling mixed datatypes
- Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an **overhead**. The overhead is less for numeric functions (for example, `Series.add`

), while it is more pronounced for string functions (for example, `Series.str.replace`

).

`for`

loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through `for`

loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

```
[f(x) for x in seq]
```

Where `seq`

is a pandas series or DataFrame column. Or, when operating over multiple columns,

```
[f(x, y) for x, y in zip(seq1, seq2)]
```

Where `seq1`

and `seq2`

are columns.

**Numeric Comparison**

Consider a simple boolean indexing operation. The list comprehension method has been timed against `Series.ne`

(`!=`

) and `query`

. Here are the functions:

```
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query("A != B") # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
```

For simplicity, I have used the `perfplot`

package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms `query`

for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note

It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisationwithoutall the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as`df[df.A.values != df.B.values]`

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

**Value Counts**

Taking another example - this time, with another vanilla python construct that is *faster* than a for loop - `collections.Counter`

. A common requirement is to compute the value counts and return the result as a dictionary. This is done with `value_counts`

, `np.unique`

, and `Counter`

:

```
# Value Counts comparison.
ser.value_counts(sort=False).to_dict() # value_counts
dict(zip(*np.unique(ser, return_counts=True))) # np.unique
Counter(ser) # Counter
```

The results are more pronounced, `Counter`

wins out over both vectorized methods for a larger range of small N (~3500).

Note

More trivia (courtesy @user2357112). The`Counter`

is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a`for`

loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

```
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
```

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.

`object`

dtypes**String-based Comparison**

Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

```
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query("A != B") # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
```

So, what changed? The thing to note here is that **string operations are inherently difficult to vectorize.** Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

**Accessing Dictionary Value(s) by Key**

Here are timings for two operations that extract a value from a column of dictionaries: `map`

and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

```
# Dictionary value extraction.
ser.map(operator.itemgetter("value")) # map
pd.Series([x.get("value") for x in ser]) # list comprehension
```

**Positional List Indexing**

Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), `map`

, `str.get`

accessor method, and the list comprehension:

```
# List positional indexing.
def get_0th(lst):
try:
return lst[0]
# Handle empty lists and NaNs gracefully.
except (IndexError, TypeError):
return np.nan
```

```
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
```

Note

If the index matters, you would want to do:`pd.Series([...], index=ser.index)`

When reconstructing the series.

**List Flattening**

A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

```
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
```

Both `itertools.chain.from_iterable`

and the nested list comprehension are pure python constructs, and scale much better than the `stack`

solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed `apply`

on these solutions, because it would skew the graph (yes, it"s that slow).

`.str`

Accessor MethodsPandas can apply regex operations such as `str.contains`

, `str.extract`

, and `str.extractall`

, as well as other "vectorized" string operations (such as `str.split`

, str.find`,`

str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with `re.compile`

(also see Is it worth using Python's re.compile?). The list comp equivalent to `str.contains`

looks something like this:

```
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
```

Or,

```
ser2 = ser[[bool(p.search(x)) for x in ser]]
```

If you need to handle NaNs, you can do something like

```
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
```

The list comp equivalent to `str.extract`

(without groups) will look something like:

```
df["col2"] = [p.search(x).group(0) for x in df["col"]]
```

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

```
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df["col2"] = [matcher(x) for x in df["col"]]
```

The `matcher`

function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the `group`

or `groups`

attribute of the matcher object.

For `str.extractall`

, change `p.search`

to `p.findall`

.

**String Extraction**

Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

```
# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
```

**More Examples**

Full disclosure - I am the author (in part or whole) of these posts listed below.

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Create new column with incremental values in a faster and efficient way - Answer by Divakar

Fast punctuation removal with pandas - Answer by Paul Panzer

Additionally, sometimes just operating on the underlying arrays via `.values`

as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the **Note** in the **Numeric Comparison** section above). So, for example `df[df.A.values != df.B.values]`

would show instant performance boosts over `df[df.A != df.B]`

. Using `.values`

may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.

```
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
```

```
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
n_range=[2**k for k in range(0, 15)],
xlabel="N"
)
```

```
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=["value_counts", "np.unique", "Counter"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=lambda x, y: dict(x) == dict(y)
)
```

```
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query("A != B"),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=["vectorized !=", "query (numexpr)", "list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter("value")),
lambda ser: pd.Series([x.get("value") for x in ser]),
],
labels=["map", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# List positional indexing.
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=["map", "str accessor", "list comprehension", "list comp safe"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=["stack", "itertools.chain", "nested list comp"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

```
# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=["str.extract", "list comprehension"],
n_range=[2**k for k in range(0, 15)],
xlabel="N",
equality_check=None
)
```

Generally, `iterrows`

should only be used in very, very specific cases. This is the general order of precedence for performance of various operations:

```
1) vectorization
2) using a custom cython routine
3) apply
a) reductions that can be performed in cython
b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)
```

Using a custom Cython routine is usually too complicated, so let"s skip that for now.

1) Vectorization is ALWAYS, ALWAYS the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish `DataFrame`

, it may be faster to use other methods.

3) `apply`

*usually* can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the `apply`

expression. For example, `df.apply(lambda x: np.sum(x))`

will be executed pretty swiftly, though of course, `df.sum(1)`

is even better. However something like `df.apply(lambda x: x["b"] + 1)`

will be executed in Python space, and consequently is much slower.

4) `itertuples`

does not box the data into a `Series`

. It just returns the data in the form of tuples.

5) `iterrows`

DOES box the data into a `Series`

. Unless you really need this, use another method.

6) Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a `DataFrame`

does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and `concat`

.

This book project was first presented to me during my first week in my current role of managing the data mining development at SAS. Writ- ing a book has always been a bucket‐list item, and I was ver...

10/07/2020

Computer languages have so far been of the ‘interpreted’ or the ‘compiled’ type. Compiled languages (like ‘C’) have been more common. You prepare a program, save it (the debugged version),...

23/09/2020

As the title promises, this book will introduce you to one of the world’s most popular programming languages: Python. It’s aimed at beginning programmers as well as more experienced programmers wh...

23/09/2020

The big data era is upon us: data are being generated, analyzed, and used at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data...

10/07/2020

X
# Submit new EBook