Dealing with Missing Data in Pandas



In Pandas, missing data is represented by two values:

  • None: None — it is a Python singleton object and is often used for missing data in Python code.
  • NaN: NaN (short for Not a Number) — it is a special floating point value recognized by all systems that use the IEEE standard floating point notation.

Pandas consider None and NaN essentially interchangeable to indicate missing or null values. To facilitate this convention, the Pandas DataFrame has several useful functions for detecting, removing and replacing empty values:

In this article, we are using a CSV file, to load the CSV file we are using, click here .

Check missing values ​​using isnull () and notnull()

To check for missing values ​​in the Pandas DataFrame, we use the isnull () function and notnull () . Both functions help to check if the value is NaN or not. These functions can also be used in the Pandas series to find null values ​​in a series.

Check for missing values ​​with isnull ()

To check for null values ​​in a Pandas DataFrame, we use isnull () this function returns a data frame with Boolean values ​​equal to True for NaN values.

Code # 1:

# import pandas as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dic t = { `First Score` : [ 100 , 90 , np.nan, 95 ],

`Second Score` : [ 30 , 45 , 56 , np.nan],

`Third Score` : [np.nan, 40 , 80 , 98 ]}

 
# creating a data frame from a list

df = pd.DataFrame ( dict )

 
# using the isnull () function
df.isnull ()

Output:

Code # 2:

# pandas package import

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( " employees.csv "

 
# create series bool True for NaN values ​​

bool_series = pd.isnull (data [ "Gender" ]) 

 
# data filtering
# display data only with Gender = NaN
data [bool_series] 

Output:
As shown in the output image , only rows that have Gender = NULL are displayed.

Check for missing values ​​using notnull ()

To check for null values ​​in Pandas Dataframe, we use the notnull () function, this function returns a data frame with boolean values ​​that are False for NaN values.

Code # 3:

# import pandas as pd

import pandas as pd

 
# numpy as np

import numpy as np

 
# dictionary of lists

dict = { `First Score` : [ 100 , 90 , np.nan, 95 ],

`Second Score` : [ 30 , 45 , 56 , np.nan],

`Third Score` : [np.nan, 40 , 80 , 98 ]}

 
# create a data frame using a dictionary

df = pd.DataFrame ( dict )

  
# using the notnull () function
df.notnull ()

Output:

Code # 4:

# pandas package import

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv"

 
# create a bool True series for NaN values ​​

bool_series = pd.notnull (data [ "Gender" ]) 

 
# filtering data
# display data only with Gender = Not NaN
data [bool_series] 

Output:
As shown in the output image, only strings that have Gender = NOT NULL are displayed.

Filling in missing values ​​with fillna () , replace () and interpolate()

To fill in null values ​​in datasets, we use fillna () , replace () and interpolate () these functions replace NaN values ​​with some native value. All of these functions help fill in null values ​​in DataFrame datasets. The Interpolate () function is mainly used to fill in NA values ​​in a data frame, but it uses various interpolation techniques to fill in missing values ​​rather than hardcoding the value.

Code # 1: padding zero values ​​with one value

# import pandas as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dict = { `First Score` : [ 100 , 90 , np.nan, 95 ],

  `Second Score` : [ 30 , 45 , 56 , np.nan],

`Third Score` : [np.nan, 40 , 80 , 98 ]}

 
# create data frame from dictionary

df = pd.Data Frame ( dict )

 
# fill in the missing value with fillna ()

df.fillna ( 0 )

Exit:

Code # 2: fill the previous zero values ​​

# pandas import as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dict = { `First Score` : [ 100 , 90 , np.nan, 95 ],

`Second Score` : [ 30 , 45 , 56 , np.nan],

`Third Score` : [np.nan, 40 , 80 , 98 ]}

 
# create data frame from dictionary

df = pd.DataFrame ( dict )

 
# filling in the missing value
# previous

df.fillna (method = `pad` )

Output:

Code # 3: filling the zero value with the following

# import pandas as pd

import pandas as pd

 
# import numpy as np

import numpy as np

 
# dictionary of lists

dict = { `First Score` : [ 100 , 90 , np.nan, 95 ],

`Second Score` : [ 30 , 45 , 56 , np.nan],

`Third Score` : [np. nan, 40 , 80 , 98 ]}

  
# create data frame from dictionary

df = pd.DataFrame ( dict )

 
# filling in a null value using the fillna () function

df.fillna (method = `bfill` )

Output:

Code # 4: filling null values ​​in the CSV file

# import pandas package

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv" )

 
# Print the first 10 –24 lines
# data frame for rendering

data [ 10 : 25 ]


Now we`re going fill in all zero values ​​in the Gender column No Gender

Output:

Code # 5: Filling null values ​​with the replace () method

# import pandas package

import pandas as pd 

 
# create data frame from CSV file

data = pd.read_csv ( "employees.csv"

 
# filling in null values ​​with fillna ()

data [ " Gender " ]. fillna ( " No Gender " , inplace = True

 
data

# import pandas package

import pandas as pd 

 
# create data frame from CSV file

data = pd.rea d_csv ( "employees.csv" )

 
# Print the first 10-24 lines
# data frame for rendering

data [ 10 : 25 ]

Output:

Now we`re going to replace everything the Nan values ​​in the data frame with a value of -99.

# import pandas package

import pandas as pd 

 
# create data frame from CS V-file

data = pd. read_csv ( "employees.csv"

 
# replaces Nan in data frame with -99

data.replace (to_replace = np.nan, value = - 99

Output:

Code # 6: Using the interpolate () function to fill in missing values ​​in a linear fashion.

# pandas import to pd

import pandas as pd 

 
# Create data frame

df = pd.DataFrame ({ "A" : [ 12 , 4 , 5 , None , 1 ], 

"B" : [ None , 2 , 54 , 3 , None ], 

"C" : [ 20 , 16 , None , 3 , 8 ], 

  " D " : [ 14 , 3 , None , None , 6 ]}) 

 
# Print the data frame
df 


Let`s interpolate the missing values ​​using a linear method. Note that the inline method ignores the index and treats the values ​​as equally spaced.

# to interpolate missing values ​​

df.interpolate (method = `linear` , limit_direction = `forward` )

Output:

As we can see in the output, the values ​​in the first line cannot be filled, since the direction of filling the values ​​is direct , and there is no previous value to interpolate.

Remove missing values ​​with dropna()

To remove null values ​​from a dataframe, we used dropna () this function by dropna () rows / columns of datasets with Null values.

Code # 1: deleting rows with at least 1 null value.

# import pandas as pd

import pandas as pd

 
# numpy import as np

import numpy as np

 
# dictionary of lists

dict = { `First Score` : [ 100 , 90 , np.nan, 95 ],

`Second Score` : [ 30 , np.nan, 45 , 56 ],

`Third Score` : [ 52 , 40 , 80 , 98 ],

`Fourth Score` : [np.nan, np.nan, np.nan, 65 ]}

 
# create data frame from dictionary

df = pd.DataFrame ( dict )

 
df


Now we are discarding lines with at least one Nan value (null)


Now we are discarding lines with at least one value Nan (null)

# import pandas as pd

import pandas as pd

 
# numpy import as np

import numpy as np

 
# dictionary of lists

dict = { `First Score` : [ 100 , 90 , np.nan, 95 ],

`Second Score` : [ 30 , np.nan, 45 , 56 ],

br /> # create data frame from dictionary

df = pd.DataFrame ( dict )

 
df

# import pandas as pd

import pandas as pd

 
# import numpy as np