  # Python | Numpy matrix.astype ()

astype | NumPy | Python Methods and Functions

With the ` Numpy matrix.astype () ` method we can convert the matrix type, but the problem is data loss, if we want to convert float to int then some data will be lost. This method helps in transformation of the matrix type.

Syntax: `matrix.astype()`

Return: Return the matrix after type conversion.

Example # 1:
In this example, we see how we convert a floating matrix to an int matrix using ` matrix.astype () `.

` `

` # import important module into python import numpy as np   # make a matrix with NumPy gfg = np.matrix ( ` [1.2, 2.8, 3.1, 4.5] ` )    # applying the matrix.astype () method geeks = gfg.astype ( int )   print (geeks) `

` ` Exit:

` [[1 2 3 4]] `

Example # 2:

` `

 ` # import important module in python ` ` import ` ` numpy as np `   ` # make a matrix with NumPy ` ` gfg ` ` = ` ` np.matrix (` ` `[1.1, 2, 3 .five; 4.2, 5.5, 6; 7, 8, 9.3] `` `) ` ` `  ` # applying the matrix.astype () method ` ` geeks ` ` = gfg.astype ( int ) ``   print (geeks) `
` `

` ` Exit:

` [[1 2 3] [4 5 6] [7 8 9]] `

## Python | Numpy matrix.astype (): StackOverflow Questions

You have four main options for converting types in pandas:

1. `to_numeric()` - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also `to_datetime()` and `to_timedelta()`.)

2. `astype()` - convert (almost) any type to (almost) any other type (even if it"s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

3. `infer_objects()` - a utility method to convert object columns holding Python objects to a pandas type if possible.

4. `convert_dtypes()` - convert DataFrame columns to the "best possible" dtype that supports `pd.NA` (pandas" object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.

# 1. `to_numeric()`

The best way to convert one or more columns of a DataFrame to numeric values is to use `pandas.to_numeric()`.

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

## Basic usage

The input to `to_numeric()` is a Series or a single column of a DataFrame.

``````>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64
``````

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

``````# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
``````

You can also use it to convert multiple columns of a DataFrame via the `apply()` method:

``````# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
``````

As long as your values can all be converted, that"s probably all you need.

## Error handling

But what if some values can"t be converted to a numeric type?

`to_numeric()` also takes an `errors` keyword argument that allows you to force non-numeric values to be `NaN`, or simply ignore columns containing these values.

Here"s an example using a Series of strings `s` which has the object dtype:

``````>>> s = pd.Series(["1", "2", "4.7", "pandas", "10"])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object
``````

The default behaviour is to raise if it can"t convert a value. In this case, it can"t cope with the string "pandas":

``````>>> pd.to_numeric(s) # or pd.to_numeric(s, errors="raise")
ValueError: Unable to parse string
``````

Rather than fail, we might want "pandas" to be considered a missing/bad numeric value. We can coerce invalid values to `NaN` as follows using the `errors` keyword argument:

``````>>> pd.to_numeric(s, errors="coerce")
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64
``````

The third option for `errors` is just to ignore the operation if an invalid value is encountered:

``````>>> pd.to_numeric(s, errors="ignore")
# the original Series is returned untouched
``````

This last option is particularly useful when you want to convert your entire DataFrame, but don"t not know which of our columns can be converted reliably to a numeric type. In that case just write:

``````df.apply(pd.to_numeric, errors="ignore")
``````

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

## Downcasting

By default, conversion with `to_numeric()` will give you either a `int64` or `float64` dtype (or whatever integer width is native to your platform).

That"s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like `float32`, or `int8`?

`to_numeric()` gives you the option to downcast to either "integer", "signed", "unsigned", "float". Here"s an example for a simple series `s` of integer type:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

Downcasting to "integer" uses the smallest possible integer that can hold the values:

``````>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2   -7
dtype: int8
``````

Downcasting to "float" similarly picks a smaller than normal floating type:

``````>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.0
2   -7.0
dtype: float32
``````

# 2. `astype()`

The `astype()` method enables you to be explicit about the dtype you want your DataFrame or Series to have. It"s very versatile in that you can try and go from one type to the any other.

## Basic usage

Just pick a type: you can use a NumPy dtype (e.g. `np.int16`), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and `astype()` will try and convert it for you:

``````# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype("category")
``````

Notice I said "try" - if `astype()` does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a `NaN` or `inf` value you"ll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing `errors="ignore"`. Your original object will be return untouched.

## Be careful

`astype()` is powerful, but it will sometimes convert values "incorrectly". For example:

``````>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64
``````

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

``````>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8
``````

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using `pd.to_numeric(s, downcast="unsigned")` instead could help prevent this error.

# 3. `infer_objects()`

Version 0.21.0 of pandas introduced the method `infer_objects()` for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here"s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

``````>>> df = pd.DataFrame({"a": [7, 1, 5], "b": ["3","2","1"]}, dtype="object")
>>> df.dtypes
a    object
b    object
dtype: object
``````

Using `infer_objects()`, you can change the type of column "a" to int64:

``````>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object
``````

Column "b" has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use `df.astype(int)` instead.

# 4. `convert_dtypes()`

Version 1.0 and above includes a method `convert_dtypes()` to convert Series and DataFrame columns to the best possible dtype that supports the `pd.NA` missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to `Int64`, a column of NumPy `int32` values will become the pandas dtype `Int32`.

With our `object` DataFrame `df`, we get the following result:

``````>>> df.convert_dtypes().dtypes
a     Int64
b    string
dtype: object
``````

Since column "a" held integer values, it was converted to the `Int64` type (which is capable of holding missing values, unlike `int64`).

Column "b" contained string objects, so was changed to pandas" `string` dtype.

By default, this method will infer the type from object values in each column. We can change this by passing `infer_objects=False`:

``````>>> df.convert_dtypes(infer_objects=False).dtypes
a    object
b    string
dtype: object
``````

Now column "a" remained an object column: pandas knows it can be described as an "integer" column (internally it ran `infer_dtype`) but didn"t infer exactly what dtype of integer it should have so did not convert it. Column "b" was again converted to "string" dtype as it was recognised as holding "string" values.

The simplest way to get row counts per group is by calling `.size()`, which returns a `Series`:

``````df.groupby(["col1","col2"]).size()
``````

Usually you want this result as a `DataFrame` (instead of a `Series`) so you can do:

``````df.groupby(["col1", "col2"]).size().reset_index(name="counts")
``````

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

## Detailed example:

Consider the following example dataframe:

``````In : df
Out:
col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17
``````

First let"s use `.size()` to get the row counts:

``````In : df.groupby(["col1", "col2"]).size()
Out:
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64
``````

Then let"s use `.size().reset_index(name="counts")` to get the row counts:

``````In : df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out:
col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1
``````

### Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

``````In : (df
...: .groupby(["col1", "col2"])
...: .agg({
...:     "col3": ["mean", "count"],
...:     "col4": ["median", "min", "count"]
...: }))
Out:
col4                  col3
median   min count      mean count
col1 col2
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1
``````

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`. It looks like this:

``````In : gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...:  .reset_index()
...: )
...:
Out:
col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63
``````

### Footnotes

The code used to generate the test data is shown below:

``````In : import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["E", "F"],
...:         ["E", "F"],
...:         ["G", "H"]
...:         ])
...:
...: df = pd.DataFrame(
...:     np.hstack([keys,np.random.randn(10,4).round(2)]),
...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...:     df[["col3", "col4", "col5", "col6"]].astype(float)
...:
``````

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN` entries in the mean calculation without telling you about it.

There are several ways to select rows from a Pandas dataframe:

1. Boolean indexing (`df[df["col"] == value`] )
2. Positional indexing (`df.iloc[...]`)
3. Label indexing (`df.xs(...)`)
4. `df.query(...)` API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column `"A"` == `"foo"`

(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)

Setup

The first thing we"ll need is to identify a condition that will act as our criterion for selecting rows. We"ll start with the OP"s case `column_name == some_value`, and include some other common use cases.

Borrowing from @unutbu:

``````import pandas as pd, numpy as np

df = pd.DataFrame({"A": "foo bar foo bar foo bar foo foo".split(),
"B": "one one two three two two one three".split(),
"C": np.arange(8), "D": np.arange(8) * 2})
``````

# 1. Boolean indexing

... Boolean indexing requires finding the true value of each row"s `"A"` column being equal to `"foo"`, then using those truth values to identify which rows to keep. Typically, we"d name this series, an array of truth values, `mask`. We"ll do so here as well.

``````mask = df["A"] == "foo"
``````

We can then use this mask to slice or index the data frame

``````df[mask]

A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
``````

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn"t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the `mask`.

# 2. Positional indexing

Positional indexing (`df.iloc[...]`) has its use cases, but this isn"t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

``````mask = df["A"] == "foo"
df.iloc[pos]

A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
``````

# 3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

``````df.set_index("A", append=True, drop=False).xs("foo", level=1)

A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
``````

# 4. `df.query()` API

`pd.DataFrame.query` is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

``````df.query("A == "foo"")

A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
``````

My preference is to use the `Boolean` `mask`

Actual improvements can be made by modifying how we create our `Boolean` `mask`.

`mask` alternative 1 Use the underlying NumPy array and forgo the overhead of creating another `pd.Series`

``````mask = df["A"].values == "foo"
``````

I"ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the `mask`

``````%timeit mask = df["A"].values == "foo"
%timeit mask = df["A"] == "foo"

5.84 ¬µs ¬± 195 ns per loop (mean ¬± std. dev. of 7 runs, 100000 loops each)
166 ¬µs ¬± 4.45 ¬µs per loop (mean ¬± std. dev. of 7 runs, 10000 loops each)
``````

Evaluating the `mask` with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding `pd.Series` object.

Next, we"ll look at the timing for slicing with one `mask` versus the other.

``````mask = df["A"].values == "foo"

219 ¬µs ¬± 12.3 ¬µs per loop (mean ¬± std. dev. of 7 runs, 1000 loops each)
239 ¬µs ¬± 7.03 ¬µs per loop (mean ¬± std. dev. of 7 runs, 1000 loops each)
``````

The performance gains aren"t as pronounced. We"ll see if this holds up over more robust testing.

`mask` alternative 2 We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe‚Äîyou must take care of the `dtypes` when doing so!

Instead of `df[mask]` we will do this

``````pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
``````

If the data frame is of mixed type, which our example is, then when we get `df.values` the resulting array is of `dtype` `object` and consequently, all columns of the new data frame will be of `dtype` `object`. Thus requiring the `astype(df.dtypes)` and killing any potential performance gains.

``````%timeit df[m]

216 ¬µs ¬± 10.4 ¬µs per loop (mean ¬± std. dev. of 7 runs, 1000 loops each)
1.43 ms ¬± 39.6 ¬µs per loop (mean ¬± std. dev. of 7 runs, 1000 loops each)
``````

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

``````np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

d1

A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5
``````

``````%%timeit

179 ¬µs ¬± 8.73 ¬µs per loop (mean ¬± std. dev. of 7 runs, 10000 loops each)
``````

Versus

``````%%timeit

87 ¬µs ¬± 5.12 ¬µs per loop (mean ¬± std. dev. of 7 runs, 10000 loops each)
``````

We cut the time in half.

`mask` alternative 3

@unutbu also shows us how to use `pd.Series.isin` to account for each element of `df["A"]` being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely `"foo"`. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

``````mask = df["A"].isin(["foo"])

A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
``````

However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We"ll use `np.in1d`

``````mask = np.in1d(df["A"].values, ["foo"])

A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
``````

Timing

I"ll include other concepts mentioned in other posts as well for reference.

Code Below

Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of `1.0`.

``````res.div(res.min())

10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175
``````

You"ll notice that the fastest times seem to be shared between `mask_with_values` and `mask_with_in1d`.

``````res.T.plot(loglog=True)
`````` Functions

``````def mask_standard(df):

def query(df):
return df.query("A == "foo"")

def xs_label(df):
return df.set_index("A", append=True, drop=False).xs("foo", level=-1)

``````

Testing

``````res = pd.DataFrame(
index=[
],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)

for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = "{}(d)".format(i)
setp = "from __main__ import d, {}".format(i)
res.at[i, j] = timeit(stmt, setp, number=50)
``````

Special Timing

Looking at the special case when we have a single non-object `dtype` for the entire data frame.

Code Below

``````spec.div(spec.min())

10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735
``````

Turns out, reconstruction isn"t worth it past a few hundred rows.

``````spec.T.plot(loglog=True)
`````` Functions

``````np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list("ABCDE"))

def reconstruct(df):
v = df.values

spec = pd.DataFrame(
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
``````

Testing

``````for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = "{}(d)".format(i)
setp = "from __main__ import d, {}".format(i)
spec.at[i, j] = timeit(stmt, setp, number=50)
``````

## Small data-sets (< 150rows)

``````["".join(i) for i in zip(df["Year"].map(str),df["quarter"])]
``````

or slightly slower but more compact:

``````df.Year.str.cat(df.quarter)
``````

## Larger data sets (> 150rows)

``````df["Year"].astype(str) + df["quarter"]
``````

UPDATE: Timing graph Pandas 0.23.4 Let"s test it on 200K rows DF:

``````In : df
Out:
Year quarter
0  2014      q1
1  2015      q2

In : df = pd.concat([df] * 10**5)

In : df.shape
Out: (200000, 2)
``````

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

``````In : %timeit df["Year"].astype(str) + df["quarter"]
10 loops, best of 3: 131 ms per loop

In : %timeit df["Year"].map(str) + df["quarter"]
10 loops, best of 3: 161 ms per loop

In : %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In : %timeit df.loc[:, ["Year","quarter"]].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In : %timeit df[["Year","quarter"]].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In : %timeit df[["Year","quarter"]].apply(lambda x : "{}{}".format(x,x), axis=1)
1 loop, best of 3: 24.7 s per loop
``````

Timing using CPU/GPU optimization:

``````In : %timeit df["Year"].astype(str) + df["quarter"]
10 loops, best of 3: 53.3 ms per loop

In : %timeit df["Year"].map(str) + df["quarter"]
10 loops, best of 3: 65.5 ms per loop

In : %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In : %timeit df.loc[:, ["Year","quarter"]].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In : %timeit df[["Year","quarter"]].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In : %timeit df[["Year","quarter"]].apply(lambda x : "{}{}".format(x,x), axis=1)
1 loop, best of 3: 9.38 s per loop
``````

You need to select that column:

``````In :
df.loc[df["First Season"] > 1990, "First Season"] = 1
df

Out:
Team  First Season  Total Games
0      Dallas Cowboys          1960          894
1       Chicago Bears          1920         1357
2   Green Bay Packers          1921         1339
3      Miami Dolphins          1966          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers          1950         1003
``````

So the syntax here is:

``````df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
``````

You can check the docs and also the 10 minutes to pandas which shows the semantics

EDIT

If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to `int` this will convert `True` and `False` to `1` and `0` respectively:

``````In :
df["First Season"] = (df["First Season"] > 1990).astype(int)
df

Out:
Team  First Season  Total Games
0      Dallas Cowboys             0          894
1       Chicago Bears             0         1357
2   Green Bay Packers             0         1339
3      Miami Dolphins             0          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers             0         1003
``````

### How do I select by partial string from a pandas DataFrame?

This post is meant for readers who want to

• search for a substring in a string column (the simplest case)
• search for multiple substrings (similar to `isin`)
• match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay")
• match multiple whole words
• Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values"

...and would like to know more about what methods should be preferred over others.

(P.S.: I"ve seen a lot of questions on similar topics, I thought it would be good to leave this here.)

Friendly disclaimer, this is post is long.

### Basic Substring Search

``````# setup
df1 = pd.DataFrame({"col": ["foo", "foobar", "bar", "baz"]})
df1

col
0     foo
1  foobar
2     bar
3     baz
``````

`str.contains` can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.

Here is an example of regex-based search,

``````# find rows in `df1` which contain "foo" followed by something
df1[df1["col"].str.contains(r"foo(?!\$)")]

col
1  foobar
``````

Sometimes regex search is not required, so specify `regex=False` to disable it.

``````#select all rows containing "foo"
df1[df1["col"].str.contains("foo", regex=False)]
# same as df1[df1["col"].str.contains("foo")] but faster.

col
0     foo
1  foobar
``````

Performance wise, regex search is slower than substring search:

``````df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2["col"].str.contains("foo")]
%timeit df2[df2["col"].str.contains("foo", regex=False)]

6.31 ms ¬± 126 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
2.8 ms ¬± 241 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
``````

Avoid using regex-based search if you don"t need it.

Addressing `ValueError`s
Sometimes, performing a substring search and filtering on the result will result in

``````ValueError: cannot index with vector containing NA / NaN values
``````

This is usually because of mixed data or NaNs in your object column,

``````s = pd.Series(["foo", "foobar", np.nan, "bar", "baz", 123])
s.str.contains("foo|bar")

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object

s[s.str.contains("foo|bar")]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)
``````

Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify `na=False` to ignore non-string data,

``````s.str.contains("foo|bar", na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool
``````

How do I apply this to multiple columns at once?
The answer is in the question. Use `DataFrame.apply`:

``````# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains("foo|bar", na=False), axis=1)

A      B
0   True   True
1   True  False
2  False   True
3   True  False
4  False  False
5  False  False
``````

All of the solutions below can be "applied" to multiple columns using the column-wise `apply` method (which is OK in my book, as long as you don"t have too many columns).

If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at `select_dtypes`.

### Multiple Substring Search

This is most easily achieved through a regex search using the regex OR pipe.

``````# Slightly modified example.
df4 = pd.DataFrame({"col": ["foo abc", "foobar xyz", "bar32", "baz 45"]})
df4

col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4["col"].str.contains(r"foo|baz")]

col
0     foo abc
1  foobar xyz
3      baz 45
``````

You can also create a list of terms, then join them:

``````terms = ["foo", "baz"]
df4[df4["col"].str.contains("|".join(terms))]

col
0     foo abc
1  foobar xyz
3      baz 45
``````

Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...

``````. ^ \$ * + ? { } [ ]  | ( )
``````

Then, you"ll need to use `re.escape` to escape them:

``````import re
df4[df4["col"].str.contains("|".join(map(re.escape, terms)))]

col
0     foo abc
1  foobar xyz
3      baz 45
``````

`re.escape` has the effect of escaping the special characters so they"re treated literally.

``````re.escape(r".foo^")
# "\.foo\^"
``````

### Matching Entire Word(s)

By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here‚Äîin particular, our pattern will need to specify word boundaries (``).

For example,

``````df3 = pd.DataFrame({"col": ["the sky is blue", "bluejay by the window"]})
df3

col
0        the sky is blue
1  bluejay by the window

``````

Now consider,

``````df3[df3["col"].str.contains("blue")]

col
0        the sky is blue
1  bluejay by the window
``````

v/s

``````df3[df3["col"].str.contains(r"blue")]

col
0  the sky is blue
``````

### Multiple Whole Word Search

Similar to the above, except we add a word boundary (``) to the joined pattern.

``````p = r"(?:{})".format("|".join(map(re.escape, terms)))
df4[df4["col"].str.contains(p)]

col
0  foo abc
3   baz 45
``````

Where `p` looks like this,

``````p
# "\b(?:foo|baz)\b"
``````

### A Great Alternative: Use List Comprehensions!

Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.

``````df1[df1["col"].str.contains("foo", regex=False)]
``````

Use the `in` operator inside a list comp,

``````df1[["foo" in x for x in df1["col"]]]

col
0  foo abc
1   foobar
``````

``````regex_pattern = r"foo(?!\$)"
df1[df1["col"].str.contains(regex_pattern)]
``````

Use `re.compile` (to cache your regex) + `Pattern.search` inside a list comp,

``````p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1["col"]]]

col
1  foobar
``````

If "col" has NaNs, then instead of

``````df1[df1["col"].str.contains(regex_pattern, na=False)]
``````

Use,

``````def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1["col"]]]

col
1  foobar

``````

### More Options for Partial String Matching: `np.char.find`, `np.vectorize`, `DataFrame.query`.

In addition to `str.contains` and list comprehensions, you can also use the following alternatives.

`np.char.find`
Supports substring searches (read: no regex) only.

``````df4[np.char.find(df4["col"].values.astype(str), "foo") > -1]

col
0     foo abc
1  foobar xyz
``````

`np.vectorize`
This is a wrapper around a loop, but with lesser overhead than most pandas `str` methods.

``````f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1["col"], "foo")
# array([ True,  True, False, False])

df1[f(df1["col"], "foo")]

col
0  foo abc
1   foobar
``````

Regex solutions possible:

``````regex_pattern = r"foo(?!\$)"
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1["col"])]

col
1  foobar
``````

`DataFrame.query`
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

``````df1.query("col.str.contains("foo")", engine="python")

col
0     foo
1  foobar
``````

More information on `query` and `eval` family of methods can be found at Dynamic Expression Evaluation in pandas using pd.eval().

### Recommended Usage Precedence

1. (First) `str.contains`, for its simplicity and ease handling NaNs and mixed data
2. List comprehensions, for its performance (especially if your data is purely strings)
3. `np.vectorize`
4. (Last) `df.query`

This FutureWarning isn"t from Pandas, it is from numpy and the bug also affects matplotlib and others, here"s how to reproduce the warning nearer to the source of the trouble:

``````import numpy as np
print(np.__version__)   # Numpy version "1.12.0"
"x" in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the
future will perform elementwise comparison
False
``````

Another way to reproduce this bug using the double equals operator:

``````import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here
``````

An example of Matplotlib affected by this FutureWarning under their quiver plot implementation: https://matplotlib.org/examples/pylab_examples/quiver_demo.html

## What"s going on here?

There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy"s numeric types. Notice the right operand is python"s turf, a primitive string, and the middle operation is python"s turf, but the left operand is numpy"s turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.

Should it be elementwise comparison or Scalar if item exists in the array?

If your code or library is using the `in` or `==` operators to compare python string to numpy ndarrays, they aren"t compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style.

## Submitted Bug reports:

Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

## Two workaround solutions:

Either lockdown your version of python and numpy, ignore the warnings and expect the behavior to not change, or convert both left and right operands of `==` and `in` to be from a numpy type or primitive python numeric type.

Suppress the warning globally:

``````import warnings
import numpy as np
warnings.simplefilter(action="ignore", category=FutureWarning)
print("x" in np.arange(5))   #returns False, without Warning
``````

Suppress the warning on a line by line basis.

``````import warnings
import numpy as np

with warnings.catch_warnings():
warnings.simplefilter(action="ignore", category=FutureWarning)
print("x" in np.arange(2))   #returns False, warning is suppressed

print("x" in np.arange(10))   #returns False, Throws FutureWarning
``````

Just suppress the warning by name, then put a loud comment next to it mentioning the current version of python and numpy, saying this code is brittle and requires these versions and put a link to here. Kick the can down the road.

TLDR: `pandas` are Jedi; `numpy` are the hutts; and `python` is the galactic empire.

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Pandas can represent integer data with possibly missing values using `arrays.IntegerArray`. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into `array()` or `Series`:

``````arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64
``````

For convert column to nullable integers use:

``````df["myCol"] = df["myCol"].astype("Int64")
``````

Given this df:

``````        date
0 2001-08-10
1 2002-08-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2003-08-14
8 2003-07-29
``````

and, if it"s not already the case:

``````df["date"] = df["date"].astype("datetime64")
``````

To show the count of dates by month:

``````df.groupby(df["date"].dt.month).count().plot(kind="bar")
``````

`.dt` allows you to access the datetime properties.

Which will give you: You can replace month by year, day, etc..

If you want to distinguish year and month for instance, just do:

``````df.groupby([df["date"].dt.year, df["date"].dt.month]).count().plot(kind="bar")
``````

Which gives: Was it what you wanted ? Is this clear ?

Hope this helps !

First, to convert a Categorical column to its numerical codes, you can do this easier with: `dataframe["c"].cat.codes`.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using `select_dtypes`. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

``````In : df = pd.DataFrame({"col1":[1,2,3,4,5], "col2":list("abcab"),  "col3":list("ababb")})

In : df["col2"] = df["col2"].astype("category")

In : df["col3"] = df["col3"].astype("category")

In : df.dtypes
Out:
col1       int64
col2    category
col3    category
dtype: object
``````

Then by using `select_dtypes` to select the columns, and then applying `.cat.codes` on each of these columns, you can get the following result:

``````In : cat_columns = df.select_dtypes(["category"]).columns

In : cat_columns
Out: Index([u"col2", u"col3"], dtype="object")

In : df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In : df
Out:
col1  col2  col3
0     1     0     0
1     2     1     1
2     3     2     0
3     4     0     1
4     5     1     1
``````