ML | Forecasting rainfall using linear regression



Forecasting precipitation — it is the application of science and technology to predict the amount of precipitation in a region. It is important to accurately determine rainfall for efficient water use, crop productivity and preliminary planning of water features.

In this article, we will use linear regression to predict rainfall. Linear regression tells us how many inches of precipitation we can expect.

The dataset is a publicly available weather dataset from Austin, Texas, available on Kaggle. The dataset can be found here .

Data cleansing:
Data comes in all forms, most of which are very messy and unstructured. They are rarely ready to use. Datasets big and small come with a lot of problems: invalid fields, missing and optional values, and values ​​in forms other than what we want. To bring it into a workable or structured form, we need to “cleanse” our data and prepare it for use. Some common cleanup includes parsing, converting to a one-off state, deleting unnecessary data, etc.

In our case, our data has several days where some factors were not captured. And the amount of precipitation in cm was marked as T if there were traces of precipitation. Our algorithm requires numbers, so we cannot work with the alphabets that appear in our data. so we need to clean up the data before applying it to our model

Clean up the data in Python:

Once the data has been cleaned up, it can be used as input to our linear regression model. Linear Regression — it is a linear approach to the formation of the relationship between the dependent variable and the set of independent explanatory variables. This is done by plotting the line that best matches our dot plot, that is, with the fewest errors. This gives predictions of the value, i.e. how many, by substituting the independent values ​​in the line equation.

We will use Scikit-learn`s linear regression model to train our dataset. Once the model is trained, we can provide our own data for various columns such as temperature, dew point, pressure, etc. to predict the weather based on these attributes.

# importing libraries

import pandas as pd

import numpy as np

 
# read data in pandas data frame

data = pd.read_csv ( " austin_weather.csv " )

 
# remove or remove unwanted columns in the data.

data = data.drop ([ `Events` , `Date` , `SeaLevelPressureHighInches`

  ` SeaLevelPressureLowInches` ], axis = 1 )

 
# some values ​​have & # 39; T & # 39;, which stands for precipitation trail
# we need to replace all occurrences of T with 0
# so we can use the data in our model

data = data.replace ( `T` , 0.0 )

 < br /> # the data also contains "-", indicating no
# or zero. This means the data is not available
# we must also replace these values.

data = data.replace ( `-` , 0.0 )

 
# save data to CSV file

data.to_csv ( `austin_final.csv` )

# importing libraries

import pandas as pd

import numpy as np

import sklearn as sk

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

 
# read cleaned data

data = pd.read_csv ( "austin_final .csv " )

  
# features or x data values ​​
# these columns are used to train the model
# last column ie the precipitation column
# will serve as a label

X = data.drop ([ `PrecipitationSumInches` ], axis = 1 )

 
# output or label.

Y = data [ `PrecipitationSumInches` ]

# convert it to a 2D vector

Y = Y.values.reshape ( - 1 , 1 )

 
# consider a random day in the dataset
# plot a graph and see it
# day

day_index = 798

days = [i for i in range (Y.size)]

  
# initialize the linear regression classifier

clf = LinearRegression ()

# train the classifier with ours
# input data.
clf.fit (X, Y)

 
# give an example of input to test our model
# this is a 2D vector containing values ​​
# for each column in the dataset.

inp = np.array ([[ 74 ], [ 60 ], [ 45 ], [ 67 ], [ 49 ], [ 43 ], [ 33 ], [ 45 ],

[ 57 ], [ 29.68 ], [ 10 ], [ 7 ], [ 2 ], [ 0 ], [ 20 ], [ 4 ] , [ 31 ]])

inp = inp.reshape ( 1 , - 1 )

 
# print the output.

print ( `The precipitation in inches for the input is:` , clf.predict (inp))

 
# build a graph of precipitation levels
# versus the total number of days.
# one day that`s red
# tracked here. A precipitate is falling
# approx. 2 inches.

print ( "the precipitation trend graph: " )

plt.scatter (days, Y, color = `g` )

plt.scatter (days [day_index], Y [day_index], color = `r` )

plt.title ( "Precipitation level " )

plt.xlabel ( " Days " )

plt.ylabel ( "Precipitation in inches" )

 

 
plt.show ()

x_vis = X. filter ([ ` TempAvgF` , `DewPointAvgF` , ` HumidityAvgPercent` ,

`SeaLevelPressureAvgInches` , `VisibilityAvgMiles` ,

  `WindAvgMPH` ], axis = 1 )

  
# build a graph with several characteristics (x-values)
# against precipitation or precipitation, to watch
# trends

 

print ( "Precipitation vs selected attributes graph:" )

 

for i in range (x_vis. columns.size):

plt.subplot ( 3 , 2 , i + 1 )

plt.scatter (days, x_vis [x_vis.columns.values ​​[i] [: 100 ]],

color = `g ` )

  

  plt.scatter (days [day_index], 

  x_vis [x_vis.columns.values ​​[i]] [day_index],

color = `r` )

 

plt. title (x_vis.columns.values ​​[i])

 
plt.show ()

Output:

 The precipitation in inches for the input is: [[1.33868402]] The precipit ation trend graph: 

Precipitation graph against selected attributes:

A day (in red) with about 2 inches of precipitation is tracked by several parameters (the same day is tracked by several parameters such as temperature, pressure, etc.). The X-axis denotes days, and the Y-axis denotes the magnitude of an element such as temperature, pressure, etc. The graph shows that precipitation can be high if the temperature is high and the humidity is high.