Python | Pandas Dataframe.duplicated ()

Counters | File handling | Python Methods and Functions

An important part of data analysis is analyzing duplicate values ​​ and removing them. The Pandas duplicated() method helps to analyze only duplicate values. It returns a logical series that is True for unique elements only.

Syntax :

 DataFrame.duplicated (subset = None, keep = 'first') 

Options:

subset: Takes a column or list of column label. It's default value is none. After passing columns, it will consider them only for duplicates.
keep: Controls how to consider duplicate value. It has only three distinct value and default is 'first'.
- & gt;  If 'first', it considers first value as unique and rest of the same values ​​as duplicate.
- & gt;  If 'last', it considers last value as unique and rest of the same values ​​as duplicate.
- & gt;  If False, it consider all of the same values ​​as duplicates.

To download the CSV file you are using, click here.

Example # 1: Returning Boolean

In the following example boolean streak is returned based on duplicate values ​​in the Name column.

# import pandas package

import pandas as pd

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv" )

 
# sort by name

data.sort_values ​​( " First Name " , inplace = True )

 
# create bool series

bool_series = data [ "First Name " ]. Duplicated ()

  
# displaying data
data.head ()

 
# Display data
data [bool_series]

Output:
As shown in the output image since the keep parameter was a value on the mind by default, that is, "first", therefore whenever a name is encountered, the first is considered "Unique" and "res Duplicate".

Example # 2: removing duplicates
In this example, the keep parameter is set to False, so only unique values ​​are accepted, and duplicate values ​​are removed from the data.

# pandas package import

import pandas as pd

 
# create data frame from CSV file

data = pd.read_csv ( " employees.csv " )

  
# sort by name

data.sort_values ​​( " First Name " , inplace = True )

 
# create series bool

bool_series = data [ "First Name" ]. duplicated (keep = False )

 
# bool series
bool_series

 
# pass a NOT bool series to view only unique values ​​

data < / code> = data [~ bool_series]

 
# data display
data.info ()
data

Output:
Since the duplicated () method returns False for Duplicates, a NOT series is accepted to see the unique value in the data frame.