Change language

Dealing with Missing Data in Pandas

In Pandas, missing data is represented by two values:

  • None: None — it is a Python singleton object and is often used for missing data in Python code.
  • NaN: NaN (short for Not a Number) — it is a special floating point value recognized by all systems that use the IEEE standard floating point notation.

Pandas consider None and NaN essentially interchangeable to indicate missing or null values. To facilitate this convention, the Pandas DataFrame has several useful functions for detecting, removing and replacing empty values:

In this article, we are using a CSV file, to load the CSV file we are using, click here .

Check missing values ​​using isnull () and notnull()

To check for missing values ​​in the Pandas DataFrame, we use the isnull () function and notnull () . Both functions help to check if the value is NaN or not. These functions can also be used in the Pandas series to find null values ​​in a series.

Check for missing values ​​with isnull ()

To check for null values ​​in a Pandas DataFrame, we use isnull () this function returns a data frame with Boolean values ​​equal to True for NaN values.

Code # 1:

# import pandas as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dic t = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

’Second Score’ : [ 30 , 45 , 56 , np.nan],

’Third Score’ : [np.nan, 40 , 80 , 98 ]}

 
# creating a data frame from a list

df = pd.DataFrame ( dict )

 
# using the isnull () function
df.isnull ()

Output:

Code # 2:

# pandas package import

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( " employees.csv "

 
# create series bool True for NaN values ​​

bool_series = pd.isnull (data [ "Gender" ]) 

 
# data filtering
# display data only with Gender = NaN
data [bool_series] 

Output:
As shown in the output image , only rows that have Gender = NULL are displayed.

Check for missing values ​​using notnull ()

To check for null values ​​in Pandas Dataframe, we use the notnull () function, this function returns a data frame with boolean values ​​that are False for NaN values.

Code # 3:

# import pandas as pd

import pandas as pd

 
# numpy as np

import numpy as np

 
# dictionary of lists

dict = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

’Second Score’ : [ 30 , 45 , 56 , np.nan],

’Third Score’ : [np.nan, 40 , 80 , 98 ]}

 
# create a data frame using a dictionary

df = pd.DataFrame ( dict )

  
# using the notnull () function
df.notnull ()

Output:

Code # 4:

# pandas package import

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv"

 
# create a bool True series for NaN values ​​

bool_series = pd.notnull (data [ "Gender" ]) 

 
# filtering data
# display data only with Gender = Not NaN
data [bool_series] 

Output:
As shown in the output image, only strings that have Gender = NOT NULL are displayed.

Filling in missing values ​​with fillna () , replace () and interpolate()

To fill in null values ​​in datasets, we use fillna () , replace () and interpolate () these functions replace NaN values ​​with some native value. All of these functions help fill in null values ​​in DataFrame datasets. The Interpolate () function is mainly used to fill in NA values ​​in a data frame, but it uses various interpolation techniques to fill in missing values ​​rather than hardcoding the value.

Code # 1: padding zero values ​​with one value

# import pandas as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dict = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

  ’Second Score’ : [ 30 , 45 , 56 , np.nan],

’Third Score’ : [np.nan, 40 , 80 , 98 ]}

 
# create data frame from dictionary

df = pd.Data Frame ( dict )

 
# fill in the missing value with fillna ()

df.fillna ( 0 )

Exit:

Code # 2: fill the previous zero values ​​

# pandas import as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dict = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

’Second Score’ : [ 30 , 45 , 56 , np.nan],

’Third Score’ : [np.nan, 40 , 80 , 98 ]}

 
# create data frame from dictionary

df = pd.DataFrame ( dict )

 
# filling in the missing value
# previous

df.fillna (method = ’pad’ )

Output:

Code # 3: filling the zero value with the following

# import pandas as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dict = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

’Second Score’ : [ 30 , 45 , 56 , np.nan],

’Third Score’ : [np. nan, 40 , 80 , 98 ]}

  
# create data frame from dictionary

df = pd.DataFrame ( dict )

 
# filling in a null value using the fillna () function

df.fillna (method = ’bfill’ )

Output:

Code # 4: filling null values ​​in the CSV file

# import pandas package

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv" )

 
# Print the first 10 –24 lines
# data frame for rendering

data [ 10 : 25 ]


Now we’re going fill in all zero values ​​in the Gender column No Gender

Output:

Code # 5: Filling null values ​​with the replace () method

# import pandas package

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv"

 
# filling in null values ​​with fillna ()

data [ " Gender " ]. fillna ( " No Gender " , inplace = True

 
data

# import pandas package

import pandas as pd 

 
# create data frame from CSV file

data = pd.rea d_csv ( "employees.csv" )

 
# Print the first 10-24 lines
# data frame for rendering

data [ 10 : 25 ]

Output:

Now we’re going to replace everything the Nan values ​​in the data frame with a value of -99.

# import pandas package

import pandas as pd 

 
# create data frame from CS V-file

data = pd. read_csv ( "employees.csv"

 
# replaces Nan in data frame with -99

data.replace (to_replace = np.nan, value = - 99

Output:

Code # 6: Using the interpolate () function to fill in missing values ​​in a linear fashion.

# pandas import to pd

import pandas as pd 

 
# Create data frame

df = pd.DataFrame ({ "A" : [ 12 , 4 , 5 , None , 1 ], 

"B" : [ None , 2 , 54 , 3 , None ], 

"C" : [ 20 , 16 , None , 3 , 8 ], 

  " D " : [ 14 , 3 , None , None , 6 ]}) 

 
# Print the data frame
df 


Let’s interpolate the missing values ​​using a linear method. Note that the inline method ignores the index and treats the values ​​as equally spaced.

# to interpolate missing values ​​

df.interpolate (method = ’linear’ , limit_direction = ’forward’ )

Output:

As we can see in the output, the values ​​in the first line cannot be filled, since the direction of filling the values ​​is direct , and there is no previous value to interpolate.

Remove missing values ​​with dropna()

To remove null values ​​from a dataframe, we used dropna () this function by dropna () rows / columns of datasets with Null values.

Code # 1: deleting rows with at least 1 null value.

# import pandas as pd

import pandas as pd

 
# numpy import as np

import numpy as np

 
# dictionary of lists

dict = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

’Second Score’ : [ 30 , np.nan, 45 , 56 ],

’Third Score’ : [ 52 , 40 , 80 , 98 ],

’Fourth Score’ : [np.nan, np.nan, np.nan, 65 ]}

 
# create data frame from dictionary

df = pd.DataFrame ( dict )

 
df


Now we are discarding lines with at least one Nan value (null)


Now we are discarding lines with at least one value Nan (null)

# import pandas as pd

import pandas as pd

 
# numpy import as np

import numpy as np

 
# dictionary of lists

dict = { ’First Score’ : [ 100 , 90 , np.nan, 95 ],

’Second Score’ : [ 30 , np.nan, 45 , 56 ],

br /> # create data frame from dictionary

df = pd.DataFrame ( dict )

 
df

# import pandas as pd

import pandas as pd

 
# import numpy as np

Shop

Gifts for programmers

Learn programming in R: courses

$FREE
Gifts for programmers

Best Python online courses for 2022

$FREE
Gifts for programmers

Best laptop for Fortnite

$399+
Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best computer for crypto mining

$499+
Gifts for programmers

Best laptop for Sims 4

$

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically