NLP | Tokenizer training and filtering stop words in a sentence

File handling | filter | NLP | Python Methods and Functions | String Variables

Let's look at the following text to understand the concept. This kind of text is very common in any web text corpus.

  Example of TEXT:  A guy: So, what are your plans for the party? B girl: well! I am not going! A guy: Oh, but u should enjoy. 

To download the text file, click here .

Code # 1: Tutorial Tokenizer

# Loading libraries

from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext


text = webtext.raw ( 'C: Geeksforgeeks data_for_training_tokenizer .txt' )

sent_tokenizer  = PunktSentenceTokenizer (text)

sents_1 = sent_tokenizer.tokenize (text)


print (sents_1 [ 0 ])

print ( "" sents_1 [ 678 ])


 'White guy: So, do you have any plans for this evening?' 'Hobo: Got any spare change?' 

Code # 2: Default Offer Tokenizer

from nltk.tokenize import sent_tokenize

sents_2 = sent_tokenize (text)


print (sents_2 [ 0 ])

print ( "" sents_2 [ 678 ] )


 'White guy: So, do you have any plans for this evening? '' Girl: But you already have a Big Mac ... Hobo: Oh, this is all theatrical.' 

This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in goes into a typical paragraph sentence structure.

How does learning work?
The PunktSentenceTokenizer class follows an unsupervised learning algorithm to find out what constitutes a break offers. This is unattended because there is no need to give any labeled training data, just raw text.

Filtering stop words in the tokenized sentence

Stop- words — these are ordinary words that appear in the text, but, as a rule, do not affect the meaning of the sentence. They are almost irrelevant for information retrieval and natural language processing purposes. For example — "A" and "a". Most search engines filter stop words from search queries and documents. 
The NLTK library comes with the nltk_data / corpora / stopwords / corpus of words —  nltk_data / corpora / stopwords / which contains wordlists for many languages.

Code # 3: Stopwords with Python

# Loading library

from nltk.corpus import stopwords

# Using stop words from English

english_stops = set (stopwords.words ( ' english' ))

# Print stop word list in English

words  = [ "Let's" , 'see ',' how '," it' s", 'working' ]


print ( " Before stopwords removal: " , words)

print ( " After stopwords removal: " ,

  [word for word in words if word not in english_stops])

Exit :

 Before stopwords removal: ["Let's", 'see',' how', "it's", 'working'] After stopwords removal: ["Let's", 'see',' working']? 

Code # 4: Complete list of languages ​​used in NLTK stopwords.

stopwords.fileids ()

Exit :

 ['danish',' dutch', 'english',' finnish', 'french',' german', 'hungarian' , 'italian',' norwegian', 'portuguese',' russian', 'spanish',' swedish', 'turkish'] 

NLP | Tokenizer training and filtering stop words in a sentence: StackOverflow Questions

List comprehension vs. lambda + filter

I happened to find myself having a basic filtering need: I have a list and I have to filter it by an attribute of the items.

My code looked like this:

my_list = [x for x in my_list if x.attribute == value]

But then I thought, wouldn"t it be better to write it like this?

my_list = filter(lambda x: x.attribute == value, my_list)

It"s more readable, and if needed for performance the lambda could be taken out to gain something.

Question is: are there any caveats in using the second way? Any performance difference? Am I missing the Pythonic Way‚Ñ¢ entirely and should do it in yet another way (such as using itemgetter instead of the lambda)?

How do I do a not equal in Django queryset filtering?

Question by MikeN

In Django model QuerySets, I see that there is a __gt and __lt for comparative values, but is there a __ne or != (not equals)? I want to filter out using a not equals. For example, for

    bool a;
    int x;

I want to do

results = Model.objects.exclude(a=True, x!=5)

The != is not correct syntax. I also tried __ne.

I ended up using:

results = Model.objects.exclude(a=True, x__lt=5).exclude(a=True, x__gt=5)

Filter dict to contain only certain keys?

I"ve got a dict that has a whole bunch of entries. I"m only interested in a select few of them. Is there an easy way to prune all the other ones out?

How to filter Pandas dataframe using "in" and "not in" like in SQL

How can I achieve the equivalents of SQL"s IN and NOT IN?

I have a list with the required values. Here"s the scenario:

df = pd.DataFrame({"country": ["US", "UK", "Germany", "China"]})
countries_to_keep = ["UK", "China"]

# pseudo-code:
df[df["country"] not in countries_to_keep]

My current way of doing this is as follows:

df = pd.DataFrame({"country": ["US", "UK", "Germany", "China"]})
df2 = pd.DataFrame({"country": ["UK", "China"], "matched": True})

# IN
df.merge(df2, how="inner", on="country")

not_in = df.merge(df2, how="left", on="country")
not_in = not_in[pd.isnull(not_in["matched"])]

But this seems like a horrible kludge. Can anyone improve on it?

Filter dataframe rows if value in column is in a set list of values

I have a Python pandas DataFrame rpt:

<class "pandas.core.frame.DataFrame">
MultiIndex: 47518 entries, ("000002", "20120331") to ("603366", "20091231")
Data columns:
STK_ID                    47518  non-null values
STK_Name                  47518  non-null values
RPT_Date                  47518  non-null values
sales                     47518  non-null values

I can filter the rows whose stock id is "600809" like this: rpt[rpt["STK_ID"] == "600809"]

<class "pandas.core.frame.DataFrame">
MultiIndex: 25 entries, ("600809", "20120331") to ("600809", "20060331")
Data columns:
STK_ID                    25  non-null values
STK_Name                  25  non-null values
RPT_Date                  25  non-null values
sales                     25  non-null values

and I want to get all the rows of some stocks together, such as ["600809","600141","600329"]. That means I want a syntax like this:

stk_list = ["600809","600141","600329"]

rst = rpt[rpt["STK_ID"] in stk_list] # this does not works in pandas 

Since pandas not accept above command, how to achieve the target?

pandas: filter rows of DataFrame with operator chaining

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I"ve found to filter rows is via normal bracket indexing

df_filtered = df[df["column"] == value]

This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?

df_filtered = df.mask(lambda x: x["column"] == value)

How can I filter a Django query with a list of values?

I"m sure this is a trivial operation, but I can"t figure out how it"s done.

There"s got to be something smarter than this:

ids = [1, 3, 6, 7, 9]

for id in ids:

I"m looking to get them all in one query with something like:

MyModel.objects.filter(pk=[1, 3, 6, 7, 9])

How can I filter a Django query with a list of values?

Difference between filter and filter_by in SQLAlchemy

Could anyone explain the difference between filter and filter_by functions in SQLAlchemy? Which one should I be using?

How to use filter, map, and reduce in Python 3

filter, map, and reduce work perfectly in Python 2. Here is an example:

>>> def f(x):
        return x % 2 != 0 and x % 3 != 0
>>> filter(f, range(2, 25))
[5, 7, 11, 13, 17, 19, 23]

>>> def cube(x):
        return x*x*x
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]

>>> def add(x,y):
        return x+y
>>> reduce(add, range(1, 11))

But in Python 3, I receive the following outputs:

>>> filter(f, range(2, 25))
<filter object at 0x0000000002C14908>

>>> map(cube, range(1, 11))
<map object at 0x0000000002C82B70>

>>> reduce(add, range(1, 11))
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    reduce(add, range(1, 11))
NameError: name "reduce" is not defined

I would appreciate if someone could explain to me why this is.

Screenshot of code for further clarity:

IDLE sessions of Python 2 and 3 side-by-side

Get a filtered list of files in a directory

I am trying to get a list of files in a directory using Python, but I do not want a list of ALL the files.

What I essentially want is the ability to do something like the following but using Python and not executing ls.

ls 145592*.jpg

If there is no built-in method for this, I am currently thinking of writing a for loop to iterate through the results of an os.listdir() and to append all the matching files to a new list.

However, there are a lot of files in that directory and therefore I am hoping there is a more efficient method (or a built-in method).

Answer #1

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here"s what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • merging with multiple columns
    • avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough talk - just show me how to use merge!

Setup & Basics

left = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
right = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": np.random.randn(4)})


  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893


  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on="key")
# Or, if you want to be explicit
# left.merge(right, on="key", how="inner")

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how="left".

left.merge(right, on="key", how="left")

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how="left", then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how="right":

left.merge(right, on="key", how="right")

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how="outer".

left.merge(right, on="key", how="outer")

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely:

Enter image description here

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,

(left.merge(right, on="key", how="left", indicator=True)
     .query("_merge == "left_only"")
     .drop("_merge", 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN


left.merge(right, on="key", how="left", indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on="key", how="right", indicator=True)
     .query("_merge == "right_only"")
     .drop("_merge", 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on="key", how="outer", indicator=True)
     .query("_merge != "both"")
     .drop("_merge", 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({"key":"keyLeft"}, axis=1)
right2 = right.rename({"key":"keyRight"}, axis=1)


  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893


  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index("keyLeft")
left3.merge(right2, left_index=True, right_on="keyRight")

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")), you"ll notice keyLeft is missing. You can figure out what column to keep based on which frame"s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[["key", "newcol"]], on="key")

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you"re doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left["newcol"] = left["key"].map(right3.set_index("key")["newcol"]))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[["key", "newcol"]], on="key", how="left")

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=["key1", "key2"] ...)

Or, in the event the names are different,

left.merge(right, left_on=["lkey1", "lkey2"], right_on=["rkey1", "rkey2"])

Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

*You are here.

Answer #2

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use "bitwise" | (or) or & (and) operations:

result = result[(result["var"]>0.25) | (result["var"]<-0.25)]

These are overloaded for these kind of datastructures to yield the element-wise or (or and).

Just to add some more explanation to this statement:

The exception is thrown when you want to get the bool of a pandas.Series:

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What you hit was a place where the operator implicitly converted the operands to bool (you used or but it also happens for and, if and while):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print("fun")
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print("fun")
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Besides these 4 statements there are several python functions that hide some bool calls (like any, all, filter, ...) these are normally not problematic with pandas.Series but for completeness I wanted to mention these.

In your case the exception isn"t really helpful, because it doesn"t mention the right alternatives. For and and or you can use (if you want element-wise comparisons):

  • numpy.logical_or:

    >>> import numpy as np
    >>> np.logical_or(x, y)

    or simply the | operator:

    >>> x | y
  • numpy.logical_and:

    >>> np.logical_and(x, y)

    or simply the & operator:

    >>> x & y

If you"re using the operators then make sure you set your parenthesis correctly because of the operator precedence.

There are several logical numpy functions which should work on pandas.Series.

The alternatives mentioned in the Exception are more suited if you encountered it when doing if or while. I"ll shortly explain each of these:

  • If you want to check if your Series is empty:

    >>> x = pd.Series([])
    >>> x.empty
    >>> x = pd.Series([1])
    >>> x.empty

    Python normally interprets the length of containers (like list, tuple, ...) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: if x.size or if not x.empty instead of if x.

  • If your Series contains one and only one boolean value:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    >>> (x < 50).bool()
  • If you want to check the first and only item of your Series (like .bool() but works even for not boolean contents):

    >>> x = pd.Series([100])
    >>> x.item()
  • If you want to check if all or any item is not-zero, not-empty or not-False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    >>> x.any()   # because one (or more) elements are non-zero

Answer #3

There are many ways to convert an instance to a dictionary, with varying degrees of corner case handling and closeness to the desired result.

1. instance.__dict__


which returns

{"_foreign_key_cache": <OtherModel: OtherModel object>,
 "_state": <django.db.models.base.ModelState at 0x7ff0993f6908>,
 "auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is by far the simplest, but is missing many_to_many, foreign_key is misnamed, and it has two unwanted extra things in it.

2. model_to_dict

from django.forms.models import model_to_dict

which returns

{"foreign_key": 2,
 "id": 1,
 "many_to_many": [<OtherModel: OtherModel object>],
 "normal_value": 1}

This is the only one with many_to_many, but is missing the uneditable fields.

3. model_to_dict(..., fields=...)

from django.forms.models import model_to_dict
model_to_dict(instance, fields=[ for field in instance._meta.fields])

which returns

{"foreign_key": 2, "id": 1, "normal_value": 1}

This is strictly worse than the standard model_to_dict invocation.

4. query_set.values()


which returns

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is the same output as instance.__dict__ but without the extra fields. foreign_key_id is still wrong and many_to_many is still missing.

5. Custom Function

The code for django"s model_to_dict had most of the answer. It explicitly removed non-editable fields, so removing that check and getting the ids of foreign keys for many to many fields results in the following code which behaves as desired:

from itertools import chain

def to_dict(instance):
    opts = instance._meta
    data = {}
    for f in chain(opts.concrete_fields, opts.private_fields):
        data[] = f.value_from_object(instance)
    for f in opts.many_to_many:
        data[] = [ for i in f.value_from_object(instance)]
    return data

While this is the most complicated option, calling to_dict(instance) gives us exactly the desired result:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

6. Use Serializers

Django Rest Framework"s ModelSerialzer allows you to build a serializer automatically from a model.

from rest_framework import serializers
class SomeModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = "__all__"



{"auto_now_add": "2018-12-20T21:34:29.494827Z",
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

This is almost as good as the custom function, but auto_now_add is a string instead of a datetime object.

Bonus Round: better model printing

If you want a django model that has a better python command-line display, have your models child-class the following:

from django.db import models
from itertools import chain

class PrintableModel(models.Model):
    def __repr__(self):
        return str(self.to_dict())

    def to_dict(instance):
        opts = instance._meta
        data = {}
        for f in chain(opts.concrete_fields, opts.private_fields):
            data[] = f.value_from_object(instance)
        for f in opts.many_to_many:
            data[] = [ for i in f.value_from_object(instance)]
        return data

    class Meta:
        abstract = True

So, for example, if we define our models as such:

class OtherModel(PrintableModel): pass

class SomeModel(PrintableModel):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

Calling SomeModel.objects.first() now gives output like this:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

Answer #4

If you like ascii art:

  • "VALID" = without padding:

       inputs:         1  2  3  4  5  6  7  8  9  10 11 (12 13)
                      |________________|                dropped
  • "SAME" = with zero padding:

                   pad|                                      |pad
       inputs:      0 |1  2  3  4  5  6  7  8  9  10 11 12 13|0  0

In this example:

  • Input width = 13
  • Filter width = 6
  • Stride = 5


  • "VALID" only ever drops the right-most columns (or bottom-most rows).
  • "SAME" tries to pad evenly left and right, but if the amount of columns to be added is odd, it will add the extra column to the right, as is the case in this example (the same logic applies vertically: there may be an extra row of zeros at the bottom).


About the name:

  • With "SAME" padding, if you use a stride of 1, the layer"s outputs will have the same spatial dimensions as its inputs.
  • With "VALID" padding, there"s no "made-up" padding inputs. The layer only uses valid input data.

Answer #5

There is a way of doing this and it actually looks similar to R

new = old[["A", "C", "D"]].copy()

Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you"ll probably want to use .copy() to avoid a SettingWithCopyWarning.

An alternative method is to use filter which will create a copy by default:

new = old.filter(["A","B","D"], axis=1)

Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop (this will also create a copy by default):

new = old.drop("B", axis=1)

Answer #6

How to deal with SettingWithCopyWarning in Pandas?

This post is meant for readers who,

  1. Would like to understand what this warning means
  2. Would like to understand different ways of suppressing this warning
  3. Would like to understand how to improve their code and follow good practices to avoid this warning in the future.


df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list("ABCDE"))
   A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1

What is the SettingWithCopyWarning?

To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.

When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A "view" is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a "copy" is a replication of data from the original, and modifying the copy has no effect on the original.

As mentioned by other answers, the SettingWithCopyWarning was created to flag "chained assignment" operations. Consider df in the setup above. Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,

df[df.A > 5]["B"]
1    3
2    6
Name: B, dtype: int64


df.loc[df.A > 5, "B"]

1    3
2    6
Name: B, dtype: int64

These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:

df.loc[df.A > 5, "B"] = 4
# becomes
df.__setitem__((df.A > 5, "B"), 4)

With a single __setitem__ call to df. OTOH, consider this code:

df[df.A > 5]["B"] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__("B", 4)

Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.

In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.

More can be found in the documentation.

All boolean indexing operations done with loc can also be done with iloc. The only difference is that iloc expects either integers/positions for index or a numpy array of boolean values, and integer/position indexes for the columns.

For example,

df.loc[df.A > 5, "B"] = 4

Can be written nas

df.iloc[(df.A > 5).values, 1] = 4


df.loc[1, "A"] = 100

Can be written as

df.iloc[1, 0] = 100

And so on.

Just tell me how to suppress the warning!

Consider a simple operation on the "A" column of df. Selecting "A" and dividing by 2 will raise the warning, but the operation will work.

df2 = df[["A"]]
df2["A"] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/ SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

0  2.5
1  4.5
2  3.5

There are a couple ways of directly silencing this warning:

  1. (recommended) Use loc to slice subsets:

     df2 = df.loc[:, ["A"]]
     df2["A"] /= 2     # Does not raise 
  2. Change pd.options.mode.chained_assignment
    Can be set to None, "warn", or "raise". "warn" is the default. None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError, preventing the operation from going through.

     pd.options.mode.chained_assignment = None
     df2["A"] /= 2
  3. Make a deepcopy

     df2 = df[["A"]].copy(deep=True)
     df2["A"] /= 2

@Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.

class ChainedAssignent:
    def __init__(self, chained=None):
        acceptable = [None, "warn", "raise"]
        assert chained in acceptable, "chained must be in " + str(acceptable)
        self.swcw = chained

    def __enter__(self):
        self.saved_swcw = pd.options.mode.chained_assignment
        pd.options.mode.chained_assignment = self.swcw
        return self

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = self.saved_swcw

The usage is as follows:

# some code here
with ChainedAssignent():
    df2["A"] /= 2
# more code follows

Or, to raise the exception

with ChainedAssignent(chained="raise"):
    df2["A"] /= 2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

The "XY Problem": What am I doing wrong?

A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem "Y" that is actually a symptom of a deeper rooted problem "X". Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.

Question 1
I have a DataFrame

       A  B  C  D  E
    0  5  0  3  3  7
    1  9  3  5  2  4
    2  7  6  8  8  1

I want to assign values in col "A" > 5 to 1000. My expected output is

      A  B  C  D  E
0     5  0  3  3  7
1  1000  3  5  2  4
2  1000  6  8  8  1

Wrong way to do this:

df.A[df.A > 5] = 1000         # works, because df.A returns a view
df[df.A > 5]["A"] = 1000      # does not work
df.loc[df.A  5]["A"] = 1000   # does not work

Right way using loc:

df.loc[df.A > 5, "A"] = 1000

Question 21
I am trying to set the value in cell (1, "D") to 12345. My expected output is

   A  B  C      D  E
0  5  0  3      3  7
1  9  3  5  12345  4
2  7  6  8      8  1

I have tried different ways of accessing this cell, such as df["D"][1]. What is the best way to do this?

1. This question isn"t specifically related to the warning, but it is good to understand how to do this particular operation correctly so as to avoid situations where the warning could potentially arise in future.

You can use any of the following methods to do this.

df.loc[1, "D"] = 12345
df.iloc[1, 3] = 12345[1, "D"] = 12345
df.iat[1, 3] = 12345

Question 3
I am trying to subset values based on some condition. I have a DataFrame

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

I would like to assign values in "D" to 123 such that "C" == 5. I tried

df2.loc[df2.C == 5, "D"] = 123

Which seems fine but I am still getting the SettingWithCopyWarning! How do I fix this?

This is actually probably because of code higher up in your pipeline. Did you create df2 from something larger, like

df2 = df[df.A > 5]

? In this case, boolean indexing will return a view, so df2 will reference the original. What you"d need to do is assign df2 to a copy:

df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]

Question 4
I"m trying to drop column "C" in-place from

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

But using

df2.drop("C", axis=1, inplace=True)

Throws SettingWithCopyWarning. Why is this happening?

This is because df2 must have been created as a view from some other slicing operation, such as

df2 = df[df.A > 5]

The solution here is to either make a copy() of df, or use loc, as before.

Answer #7

As of Django 1.8 refreshing objects is built in. Link to docs.

def test_update_result(self):
    obj = MyModel.objects.create(val=1)
    MyModel.objects.filter("val") + 1)
    # At this point obj.val is still 1, but the value in the database
    # was updated to 2. The object"s updated value needs to be reloaded
    # from the database.
    self.assertEqual(obj.val, 2)

Answer #8

You could use a loop:

conditions = (check_size, check_color, check_tone, check_flavor)
for condition in conditions:
    result = condition()
    if result:
        return result

This has the added advantage that you can now make the number of conditions variable.

You could use map() + filter() (the Python 3 versions, use the future_builtins versions in Python 2) to get the first such matching value:

    # Python 2
    from future_builtins import map, filter
except ImportError:
    # Python 3

conditions = (check_size, check_color, check_tone, check_flavor)
return next(filter(None, map(lambda f: f(), conditions)), None)

but if this is more readable is debatable.

Another option is to use a generator expression:

conditions = (check_size, check_color, check_tone, check_flavor)
checks = (condition() for condition in conditions)
return next((check for check in checks if check), None)

Answer #9

Simplest of all solutions:

filtered_df = df[df["name"].notnull()]

Thus, it filters out only rows that doesn"t have NaN values in "name" column.

For multiple columns:

filtered_df = df[df[["name", "country", "region"]].notnull().all(1)]

Answer #10

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo"s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution"s histogram and the data"s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

All Fitted Distributions

Best Fit Distribution

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (16.0, 12.0)"ggplot")

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Best holders
    best_distributions = []

    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in _distn_names if not d in ["levy_stable", "studentized_range"]]):

        print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

        distribution = getattr(st, distribution)

        # Try to fit the distribution
            # Ignore warnings from data that can"t be fit
            with warnings.catch_warnings():
                # fit dist to data
                params =

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                # if axis pass in add to plot
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                except Exception:

                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        except Exception:

    return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
    """Generate distributions"s Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index("YEAR").values.ravel())

# Plot for comparison
ax = data.plot(kind="hist", bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams["axes.prop_cycle"])[1]["color"])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_title(u"El Niño sea temp.
 All Fitted Distributions")
ax.set_xlabel(u"Temp (°C)")

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
ax = pdf.plot(lw=2, label="PDF", legend=True)
data.plot(kind="hist", bins=50, density=True, alpha=0.5, label="Data", legend=True, ax=ax)

param_names = (best_dist[0].shapes + ", loc, scale").split(", ") if best_dist[0].shapes else ["loc", "scale"]
param_str = ", ".join(["{}={:0.2f}".format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = "{}({})".format(best_dist[0].name, param_str)

ax.set_title(u"El Niño sea temp. with best fit distribution 
" + dist_str)
ax.set_xlabel(u"Temp. (°C)")