pandas get rows which are NOT in other dataframe

StackOverflow

I"ve two pandas data frames that have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which are not in dataframe2?

df1 = pandas.DataFrame(data = {"col1" : [1, 2, 3, 4, 5], "col2" : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {"col1" : [1, 2, 3], "col2" : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Expected result:

   col1  col2
3     4    13
4     5    14

Answer rating: 272

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {"col1" : [1, 2, 3, 4, 5, 3], 
                           "col2" : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {"col1" : [1, 2, 3],
                           "col2" : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=["col1","col2"], 
                   how="left", indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all["_merge"] == "left_only"

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=["col1","col2"])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict("l")).all(1)

Answer rating: 231

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column"s values are not in this common:

In [119]:

common = df1.merge(df2,on=["col1","col2"])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

Another method as you"ve found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

However if df2 does not start rows in the same manner then this won"t work:

df2 = pd.DataFrame(data = {"col1" : [2, 3,4], "col2" : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

Answer rating: 97

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

df1[~df1.index.isin(df2.index)]




Tutorials