  # Filtering a list based on a list of booleans

filter | StackOverflow

I have a list of values which I need to filter given the values in a list of booleans:

``````list_a = [1, 2, 4, 6]
filter = [True, False, True, False]
``````

I generate a new filtered list with the following line:

``````filtered_list = [i for indx,i in enumerate(list_a) if filter[indx] == True]
``````

which results in:

``````print filtered_list
[1,4]
``````

The line works but looks (to me) a bit overkill and I was wondering if there was a simpler way to achieve the same.

1- Don"t name a list `filter` like I did because it is a built-in function.

2- Don"t compare things to `True` like I did with `if filter[idx]==True..` since it"s unnecessary. Just using `if filter[idx]` is enough.

You"re looking for `itertools.compress`:

``````>>> from itertools import compress
>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> list(compress(list_a, fil))
[1, 4]
``````

## Timing comparisons(py3.x):

``````>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> %timeit list(compress(list_a, fil))
100000 loops, best of 3: 2.58 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]  #winner
100000 loops, best of 3: 1.98 us per loop

>>> list_a = [1, 2, 4, 6]*100
>>> fil = [True, False, True, False]*100
>>> %timeit list(compress(list_a, fil))              #winner
10000 loops, best of 3: 24.3 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
10000 loops, best of 3: 82 us per loop

>>> list_a = [1, 2, 4, 6]*10000
>>> fil = [True, False, True, False]*10000
>>> %timeit list(compress(list_a, fil))              #winner
1000 loops, best of 3: 1.66 ms per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
100 loops, best of 3: 7.65 ms per loop
``````

Don"t use `filter` as a variable name, it is a built-in function.

## List comprehension vs. lambda + filter

I happened to find myself having a basic filtering need: I have a list and I have to filter it by an attribute of the items.

My code looked like this:

``````my_list = [x for x in my_list if x.attribute == value]
``````

But then I thought, wouldn"t it be better to write it like this?

``````my_list = filter(lambda x: x.attribute == value, my_list)
``````

It"s more readable, and if needed for performance the lambda could be taken out to gain something.

Question is: are there any caveats in using the second way? Any performance difference? Am I missing the Pythonic Way‚Ñ¢ entirely and should do it in yet another way (such as using itemgetter instead of the lambda)?

## How do I do a not equal in Django queryset filtering?

### Question by MikeN

In Django model QuerySets, I see that there is a `__gt` and `__lt` for comparative values, but is there a `__ne` or `!=` (not equals)? I want to filter out using a not equals. For example, for

``````Model:
bool a;
int x;
``````

I want to do

``````results = Model.objects.exclude(a=True, x!=5)
``````

The `!=` is not correct syntax. I also tried `__ne`.

I ended up using:

``````results = Model.objects.exclude(a=True, x__lt=5).exclude(a=True, x__gt=5)
``````

## Filter dict to contain only certain keys?

I"ve got a `dict` that has a whole bunch of entries. I"m only interested in a select few of them. Is there an easy way to prune all the other ones out?

## How to filter Pandas dataframe using "in" and "not in" like in SQL

How can I achieve the equivalents of SQL"s `IN` and `NOT IN`?

I have a list with the required values. Here"s the scenario:

``````df = pd.DataFrame({"country": ["US", "UK", "Germany", "China"]})
countries_to_keep = ["UK", "China"]

# pseudo-code:
df[df["country"] not in countries_to_keep]
``````

My current way of doing this is as follows:

``````df = pd.DataFrame({"country": ["US", "UK", "Germany", "China"]})
df2 = pd.DataFrame({"country": ["UK", "China"], "matched": True})

# IN
df.merge(df2, how="inner", on="country")

# NOT IN
not_in = df.merge(df2, how="left", on="country")
not_in = not_in[pd.isnull(not_in["matched"])]
``````

But this seems like a horrible kludge. Can anyone improve on it?

## Filter dataframe rows if value in column is in a set list of values

I have a Python pandas DataFrame `rpt`:

``````rpt
<class "pandas.core.frame.DataFrame">
MultiIndex: 47518 entries, ("000002", "20120331") to ("603366", "20091231")
Data columns:
STK_ID                    47518  non-null values
STK_Name                  47518  non-null values
RPT_Date                  47518  non-null values
sales                     47518  non-null values
``````

I can filter the rows whose stock id is `"600809"` like this: `rpt[rpt["STK_ID"] == "600809"]`

``````<class "pandas.core.frame.DataFrame">
MultiIndex: 25 entries, ("600809", "20120331") to ("600809", "20060331")
Data columns:
STK_ID                    25  non-null values
STK_Name                  25  non-null values
RPT_Date                  25  non-null values
sales                     25  non-null values
``````

and I want to get all the rows of some stocks together, such as `["600809","600141","600329"]`. That means I want a syntax like this:

``````stk_list = ["600809","600141","600329"]

rst = rpt[rpt["STK_ID"] in stk_list] # this does not works in pandas
``````

Since pandas not accept above command, how to achieve the target?

## pandas: filter rows of DataFrame with operator chaining

Most operations in `pandas` can be accomplished with operator chaining (`groupby`, `aggregate`, `apply`, etc), but the only way I"ve found to filter rows is via normal bracket indexing

``````df_filtered = df[df["column"] == value]
``````

This is unappealing as it requires I assign `df` to a variable before being able to filter on its values. Is there something more like the following?

``````df_filtered = df.mask(lambda x: x["column"] == value)
``````

## How can I filter a Django query with a list of values?

I"m sure this is a trivial operation, but I can"t figure out how it"s done.

There"s got to be something smarter than this:

``````ids = [1, 3, 6, 7, 9]

for id in ids:
MyModel.objects.filter(pk=id)
``````

I"m looking to get them all in one query with something like:

``````MyModel.objects.filter(pk=[1, 3, 6, 7, 9])
``````

How can I filter a Django query with a list of values?

## Difference between filter and filter_by in SQLAlchemy

Could anyone explain the difference between `filter` and `filter_by` functions in SQLAlchemy? Which one should I be using?

## How to use filter, map, and reduce in Python 3

`filter`, `map`, and `reduce` work perfectly in Python 2. Here is an example:

``````>>> def f(x):
return x % 2 != 0 and x % 3 != 0
>>> filter(f, range(2, 25))
[5, 7, 11, 13, 17, 19, 23]

>>> def cube(x):
return x*x*x
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]

return x+y
55
``````

But in Python 3, I receive the following outputs:

``````>>> filter(f, range(2, 25))
<filter object at 0x0000000002C14908>

>>> map(cube, range(1, 11))
<map object at 0x0000000002C82B70>

Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
NameError: name "reduce" is not defined
``````

I would appreciate if someone could explain to me why this is.

Screenshot of code for further clarity: ## Get a filtered list of files in a directory

I am trying to get a list of files in a directory using Python, but I do not want a list of ALL the files.

What I essentially want is the ability to do something like the following but using Python and not executing ls.

``````ls 145592*.jpg
``````

If there is no built-in method for this, I am currently thinking of writing a for loop to iterate through the results of an `os.listdir()` and to append all the matching files to a new list.

However, there are a lot of files in that directory and therefore I am hoping there is a more efficient method (or a built-in method).

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here"s what this post will go through:

• The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

• merging with different column names
• merging with multiple columns
• avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

• Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
• Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

# Enough talk - just show me how to use `merge`!

### Setup & Basics

``````np.random.seed(0)
left = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
right = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": np.random.randn(4)})

left

key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357
``````

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by Note This, along with the forthcoming figures all follow this convention:

• blue indicates rows that are present in the merge result
• red indicates rows that are excluded from the result (i.e., removed)
• green indicates missing values that are replaced with `NaN`s in the result

To perform an INNER JOIN, call `merge` on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

``````left.merge(right, on="key")
# Or, if you want to be explicit
# left.merge(right, on="key", how="inner")

key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
``````

This returns only rows from `left` and `right` which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by This can be performed by specifying `how="left"`.

``````left.merge(right, on="key", how="left")

key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
``````

Carefully note the placement of NaNs here. If you specify `how="left"`, then only keys from `left` are used, and missing data from `right` is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is... ...specify `how="right"`:

``````left.merge(right, on="key", how="right")

key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357
``````

Here, keys from `right` are used, and missing data from `left` is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by specify `how="outer"`.

``````left.merge(right, on="key", how="outer")

key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
``````

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely: ### Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from `left` only,

``````(left.merge(right, on="key", how="left", indicator=True)
.query("_merge == "left_only"")
.drop("_merge", 1))

key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN
``````

Where,

``````left.merge(right, on="key", how="left", indicator=True)

key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both``````

And similarly, for a RIGHT-Excluding JOIN, ``````(left.merge(right, on="key", how="right", indicator=True)
.query("_merge == "right_only"")
.drop("_merge", 1))

key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357``````

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), You can do this in similar fashion‚Äî

``````(left.merge(right, on="key", how="outer", indicator=True)
.query("_merge != "both"")
.drop("_merge", 1))

key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357
``````

### Different names for key columns

If the key columns are named differently‚Äîfor example, `left` has `keyLeft`, and `right` has `keyRight` instead of `key`‚Äîthen you will have to specify `left_on` and `right_on` as arguments instead of `on`:

``````left2 = left.rename({"key":"keyLeft"}, axis=1)
right2 = right.rename({"key":"keyRight"}, axis=1)

left2

keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
``````
``````left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")

keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278
``````

### Avoiding duplicate key column in output

When merging on `keyLeft` from `left` and `keyRight` from `right`, if you only want either of the `keyLeft` or `keyRight` (but not both) in the output, you can start by setting the index as a preliminary step.

``````left3 = left2.set_index("keyLeft")
left3.merge(right2, left_index=True, right_on="keyRight")

value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278
``````

Contrast this with the output of the command just before (that is, the output of `left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")`), you"ll notice `keyLeft` is missing. You can figure out what column to keep based on which frame"s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

### Merging only a single column from one of the `DataFrames`

For example, consider

``````right3 = right.assign(newcol=np.arange(len(right)))
right3
key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3
``````

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

``````left.merge(right3[["key", "newcol"]], on="key")

key     value  newcol
0   B  0.400157       0
1   D  2.240893       1
``````

If you"re doing a LEFT OUTER JOIN, a more performant solution would involve `map`:

``````# left["newcol"] = left["key"].map(right3.set_index("key")["newcol"]))
left.assign(newcol=left["key"].map(right3.set_index("key")["newcol"]))

key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
``````

As mentioned, this is similar to, but faster than

``````left.merge(right3[["key", "newcol"]], on="key", how="left")

key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
``````

### Merging on multiple columns

To join on more than one column, specify a list for `on` (or `left_on` and `right_on`, as appropriate).

``````left.merge(right, on=["key1", "key2"] ...)
``````

Or, in the event the names are different,

``````left.merge(right, left_on=["lkey1", "lkey2"], right_on=["rkey1", "rkey2"])
``````

### Other useful `merge*` operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on `merge`, `join`, and `concat` as well as the links to the function specifications.

*You are here.

The `or` and `and` python statements require `truth`-values. For `pandas` these are considered ambiguous so you should use "bitwise" `|` (or) or `&` (and) operations:

``````result = result[(result["var"]>0.25) | (result["var"]<-0.25)]
``````

These are overloaded for these kind of datastructures to yield the element-wise `or` (or `and`).

Just to add some more explanation to this statement:

The exception is thrown when you want to get the `bool` of a `pandas.Series`:

``````>>> import pandas as pd
>>> x = pd.Series()
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
``````

What you hit was a place where the operator implicitly converted the operands to `bool` (you used `or` but it also happens for `and`, `if` and `while`):

``````>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print("fun")
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print("fun")
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
``````

Besides these 4 statements there are several python functions that hide some `bool` calls (like `any`, `all`, `filter`, ...) these are normally not problematic with `pandas.Series` but for completeness I wanted to mention these.

In your case the exception isn"t really helpful, because it doesn"t mention the right alternatives. For `and` and `or` you can use (if you want element-wise comparisons):

• ``````>>> import numpy as np
>>> np.logical_or(x, y)
``````

or simply the `|` operator:

``````>>> x | y
``````
• ``````>>> np.logical_and(x, y)
``````

or simply the `&` operator:

``````>>> x & y
``````

If you"re using the operators then make sure you set your parenthesis correctly because of the operator precedence.

There are several logical numpy functions which should work on `pandas.Series`.

The alternatives mentioned in the Exception are more suited if you encountered it when doing `if` or `while`. I"ll shortly explain each of these:

• If you want to check if your Series is empty:

``````>>> x = pd.Series([])
>>> x.empty
True
>>> x = pd.Series()
>>> x.empty
False
``````

Python normally interprets the `len`gth of containers (like `list`, `tuple`, ...) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: `if x.size` or `if not x.empty` instead of `if x`.

• If your `Series` contains one and only one boolean value:

``````>>> x = pd.Series()
>>> (x > 50).bool()
True
>>> (x < 50).bool()
False
``````
• If you want to check the first and only item of your Series (like `.bool()` but works even for not boolean contents):

``````>>> x = pd.Series()
>>> x.item()
100
``````
• If you want to check if all or any item is not-zero, not-empty or not-False:

``````>>> x = pd.Series([0, 1, 2])
>>> x.all()   # because one element is zero
False
>>> x.any()   # because one (or more) elements are non-zero
True
``````

There are many ways to convert an instance to a dictionary, with varying degrees of corner case handling and closeness to the desired result.

## 1. `instance.__dict__`

``````instance.__dict__
``````

which returns

``````{"_foreign_key_cache": <OtherModel: OtherModel object>,
"_state": <django.db.models.base.ModelState at 0x7ff0993f6908>,
"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
"foreign_key_id": 2,
"id": 1,
"normal_value": 1,
``````

This is by far the simplest, but is missing `many_to_many`, `foreign_key` is misnamed, and it has two unwanted extra things in it.

## 2. `model_to_dict`

``````from django.forms.models import model_to_dict
model_to_dict(instance)
``````

which returns

``````{"foreign_key": 2,
"id": 1,
"many_to_many": [<OtherModel: OtherModel object>],
"normal_value": 1}
``````

This is the only one with `many_to_many`, but is missing the uneditable fields.

## 3. `model_to_dict(..., fields=...)`

``````from django.forms.models import model_to_dict
model_to_dict(instance, fields=[field.name for field in instance._meta.fields])
``````

which returns

``````{"foreign_key": 2, "id": 1, "normal_value": 1}
``````

This is strictly worse than the standard `model_to_dict` invocation.

## 4. `query_set.values()`

``````SomeModel.objects.filter(id=instance.id).values()
``````

which returns

``````{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
"foreign_key_id": 2,
"id": 1,
"normal_value": 1,
``````

This is the same output as `instance.__dict__` but without the extra fields. `foreign_key_id` is still wrong and `many_to_many` is still missing.

## 5. Custom Function

The code for django"s `model_to_dict` had most of the answer. It explicitly removed non-editable fields, so removing that check and getting the ids of foreign keys for many to many fields results in the following code which behaves as desired:

``````from itertools import chain

def to_dict(instance):
opts = instance._meta
data = {}
for f in chain(opts.concrete_fields, opts.private_fields):
data[f.name] = f.value_from_object(instance)
for f in opts.many_to_many:
data[f.name] = [i.id for i in f.value_from_object(instance)]
return data
``````

While this is the most complicated option, calling `to_dict(instance)` gives us exactly the desired result:

``````{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
"foreign_key": 2,
"id": 1,
"many_to_many": ,
"normal_value": 1,
``````

## 6. Use Serializers

Django Rest Framework"s ModelSerialzer allows you to build a serializer automatically from a model.

``````from rest_framework import serializers
class SomeModelSerializer(serializers.ModelSerializer):
class Meta:
model = SomeModel
fields = "__all__"

SomeModelSerializer(instance).data
``````

returns

``````{"auto_now_add": "2018-12-20T21:34:29.494827Z",
"foreign_key": 2,
"id": 1,
"many_to_many": ,
"normal_value": 1,
``````

This is almost as good as the custom function, but auto_now_add is a string instead of a datetime object.

## Bonus Round: better model printing

If you want a django model that has a better python command-line display, have your models child-class the following:

``````from django.db import models
from itertools import chain

class PrintableModel(models.Model):
def __repr__(self):
return str(self.to_dict())

def to_dict(instance):
opts = instance._meta
data = {}
for f in chain(opts.concrete_fields, opts.private_fields):
data[f.name] = f.value_from_object(instance)
for f in opts.many_to_many:
data[f.name] = [i.id for i in f.value_from_object(instance)]
return data

class Meta:
abstract = True
``````

So, for example, if we define our models as such:

``````class OtherModel(PrintableModel): pass

class SomeModel(PrintableModel):
normal_value = models.IntegerField()
foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")
``````

Calling `SomeModel.objects.first()` now gives output like this:

``````{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
"foreign_key": 2,
"id": 1,
"many_to_many": ,
"normal_value": 1,
``````

If you like ascii art:

• `"VALID"` = without padding:

``````   inputs:         1  2  3  4  5  6  7  8  9  10 11 (12 13)
|________________|                dropped
|_________________|
``````
• `"SAME"` = with zero padding:

``````               pad|                                      |pad
inputs:      0 |1  2  3  4  5  6  7  8  9  10 11 12 13|0  0
|________________|
|_________________|
|________________|
``````

In this example:

• Input width = 13
• Filter width = 6
• Stride = 5

Notes:

• `"VALID"` only ever drops the right-most columns (or bottom-most rows).
• `"SAME"` tries to pad evenly left and right, but if the amount of columns to be added is odd, it will add the extra column to the right, as is the case in this example (the same logic applies vertically: there may be an extra row of zeros at the bottom).

Edit:

• With `"SAME"` padding, if you use a stride of 1, the layer"s outputs will have the same spatial dimensions as its inputs.
• With `"VALID"` padding, there"s no "made-up" padding inputs. The layer only uses valid input data.

There is a way of doing this and it actually looks similar to R

``````new = old[["A", "C", "D"]].copy()
``````

Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you"ll probably want to use `.copy()` to avoid a `SettingWithCopyWarning`.

An alternative method is to use `filter` which will create a copy by default:

``````new = old.filter(["A","B","D"], axis=1)
``````

Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a `drop` (this will also create a copy by default):

``````new = old.drop("B", axis=1)
``````

### How to deal with `SettingWithCopyWarning` in Pandas?

This post is meant for readers who,

1. Would like to understand what this warning means
2. Would like to understand different ways of suppressing this warning
3. Would like to understand how to improve their code and follow good practices to avoid this warning in the future.

Setup

``````np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list("ABCDE"))
df
A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1
``````

# What is the `SettingWithCopyWarning`?

To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.

When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A "view" is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a "copy" is a replication of data from the original, and modifying the copy has no effect on the original.

As mentioned by other answers, the `SettingWithCopyWarning` was created to flag "chained assignment" operations. Consider `df` in the setup above. Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,

``````df[df.A > 5]["B"]

1    3
2    6
Name: B, dtype: int64
``````

And,

``````df.loc[df.A > 5, "B"]

1    3
2    6
Name: B, dtype: int64
``````

These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:

``````df.loc[df.A > 5, "B"] = 4
# becomes
df.__setitem__((df.A > 5, "B"), 4)
``````

With a single `__setitem__` call to `df`. OTOH, consider this code:

``````df[df.A > 5]["B"] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__("B", 4)
``````

Now, depending on whether `__getitem__` returned a view or a copy, the `__setitem__` operation may not work.

In general, you should use `loc` for label-based assignment, and `iloc` for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use `at` and `iat`.

More can be found in the documentation.

Note
All boolean indexing operations done with `loc` can also be done with `iloc`. The only difference is that `iloc` expects either integers/positions for index or a numpy array of boolean values, and integer/position indexes for the columns.

For example,

``````df.loc[df.A > 5, "B"] = 4
``````

Can be written nas

``````df.iloc[(df.A > 5).values, 1] = 4
``````

And,

``````df.loc[1, "A"] = 100
``````

Can be written as

``````df.iloc[1, 0] = 100
``````

And so on.

# Just tell me how to suppress the warning!

Consider a simple operation on the "A" column of `df`. Selecting "A" and dividing by 2 will raise the warning, but the operation will work.

``````df2 = df[["A"]]
df2["A"] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df2
A
0  2.5
1  4.5
2  3.5
``````

There are a couple ways of directly silencing this warning:

1. (recommended) Use `loc` to slice subsets:

`````` df2 = df.loc[:, ["A"]]
df2["A"] /= 2     # Does not raise
``````
2. Change `pd.options.mode.chained_assignment`
Can be set to `None`, `"warn"`, or `"raise"`. `"warn"` is the default. `None` will suppress the warning entirely, and `"raise"` will throw a `SettingWithCopyError`, preventing the operation from going through.

`````` pd.options.mode.chained_assignment = None
df2["A"] /= 2
``````
3. Make a `deepcopy`

`````` df2 = df[["A"]].copy(deep=True)
df2["A"] /= 2
``````

@Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.

``````class ChainedAssignent:
def __init__(self, chained=None):
acceptable = [None, "warn", "raise"]
assert chained in acceptable, "chained must be in " + str(acceptable)
self.swcw = chained

def __enter__(self):
self.saved_swcw = pd.options.mode.chained_assignment
pd.options.mode.chained_assignment = self.swcw
return self

def __exit__(self, *args):
pd.options.mode.chained_assignment = self.saved_swcw
``````

The usage is as follows:

``````# some code here
with ChainedAssignent():
df2["A"] /= 2
# more code follows
``````

Or, to raise the exception

``````with ChainedAssignent(chained="raise"):
df2["A"] /= 2

SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
``````

# The "XY Problem": What am I doing wrong?

A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem "Y" that is actually a symptom of a deeper rooted problem "X". Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.

Question 1
I have a DataFrame

``````df
A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1
``````

I want to assign values in col "A" > 5 to 1000. My expected output is

``````      A  B  C  D  E
0     5  0  3  3  7
1  1000  3  5  2  4
2  1000  6  8  8  1
``````

Wrong way to do this:

``````df.A[df.A > 5] = 1000         # works, because df.A returns a view
df[df.A > 5]["A"] = 1000      # does not work
df.loc[df.A  5]["A"] = 1000   # does not work
``````

Right way using `loc`:

``````df.loc[df.A > 5, "A"] = 1000
``````

Question 21
I am trying to set the value in cell (1, "D") to 12345. My expected output is

``````   A  B  C      D  E
0  5  0  3      3  7
1  9  3  5  12345  4
2  7  6  8      8  1
``````

I have tried different ways of accessing this cell, such as `df["D"]`. What is the best way to do this?

1. This question isn"t specifically related to the warning, but it is good to understand how to do this particular operation correctly so as to avoid situations where the warning could potentially arise in future.

You can use any of the following methods to do this.

``````df.loc[1, "D"] = 12345
df.iloc[1, 3] = 12345
df.at[1, "D"] = 12345
df.iat[1, 3] = 12345
``````

Question 3
I am trying to subset values based on some condition. I have a DataFrame

``````   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1
``````

I would like to assign values in "D" to 123 such that "C" == 5. I tried

``````df2.loc[df2.C == 5, "D"] = 123
``````

Which seems fine but I am still getting the `SettingWithCopyWarning`! How do I fix this?

This is actually probably because of code higher up in your pipeline. Did you create `df2` from something larger, like

``````df2 = df[df.A > 5]
``````

? In this case, boolean indexing will return a view, so `df2` will reference the original. What you"d need to do is assign `df2` to a copy:

``````df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]
``````

Question 4
I"m trying to drop column "C" in-place from

``````   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1
``````

But using

``````df2.drop("C", axis=1, inplace=True)
``````

Throws `SettingWithCopyWarning`. Why is this happening?

This is because `df2` must have been created as a view from some other slicing operation, such as

``````df2 = df[df.A > 5]
``````

The solution here is to either make a `copy()` of `df`, or use `loc`, as before.

As of Django 1.8 refreshing objects is built in. Link to docs.

``````def test_update_result(self):
obj = MyModel.objects.create(val=1)
MyModel.objects.filter(pk=obj.pk).update(val=F("val") + 1)
# At this point obj.val is still 1, but the value in the database
# was updated to 2. The object"s updated value needs to be reloaded
# from the database.
obj.refresh_from_db()
self.assertEqual(obj.val, 2)
``````

You could use a loop:

``````conditions = (check_size, check_color, check_tone, check_flavor)
for condition in conditions:
result = condition()
if result:
return result
``````

This has the added advantage that you can now make the number of conditions variable.

You could use `map()` + `filter()` (the Python 3 versions, use the `future_builtins` versions in Python 2) to get the first such matching value:

``````try:
# Python 2
from future_builtins import map, filter
except ImportError:
# Python 3
pass

conditions = (check_size, check_color, check_tone, check_flavor)
return next(filter(None, map(lambda f: f(), conditions)), None)
``````

but if this is more readable is debatable.

Another option is to use a generator expression:

``````conditions = (check_size, check_color, check_tone, check_flavor)
checks = (condition() for condition in conditions)
return next((check for check in checks if check), None)
``````

Simplest of all solutions:

``````filtered_df = df[df["name"].notnull()]
``````

Thus, it filters out only rows that doesn"t have NaN values in "name" column.

For multiple columns:

``````filtered_df = df[df[["name", "country", "region"]].notnull().all(1)]
``````

# Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo"s answer, that uses the full list of the current `scipy.stats` distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

## Example Fitting

Using the El Ni√±o dataset from `statsmodels`, the distributions are fit and error is determined. The distribution with the least error is returned.

### All Distributions ### Best Fit Distribution ### Example Code

``````%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)
matplotlib.style.use("ggplot")

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0

# Best holders
best_distributions = []

# Estimate distribution parameters from data
for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):

print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

distribution = getattr(st, distribution)

# Try to fit the distribution
try:
# Ignore warnings from data that can"t be fit
with warnings.catch_warnings():
warnings.filterwarnings("ignore")

# fit dist to data
params = distribution.fit(data)

# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))

# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass

# identify if this distribution is better
best_distributions.append((distribution, params, sse))

except Exception:
pass

return sorted(best_distributions, key=lambda x:x)

def make_pdf(dist, params, size=10000):
"""Generate distributions"s Probability Distribution Function """

# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]

# Get sane start and end points of distribution
start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)

return pdf

# Load data from statsmodels datasets

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])["color"])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u"El Ni√±o sea temp.
All Fitted Distributions")
ax.set_xlabel(u"Temp (¬∞C)")
ax.set_ylabel("Frequency")

# Make PDF with best params
pdf = make_pdf(best_dist, best_dist)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)

param_names = (best_dist.shapes + ", loc, scale").split(", ") if best_dist.shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist)])
dist_str = "{}({})".format(best_dist.name, param_str)

ax.set_title(u"El Ni√±o sea temp. with best fit distribution
" + dist_str)
ax.set_xlabel(u"Temp. (¬∞C)")
ax.set_ylabel("Frequency")
``````