Important features of scikit-learn:
- Simple and powerful data mining tools. It includes a variety of classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.
- Available for everyone and reusable in a variety of contexts.
- Built with NumPy, SciPy and matplotlib.
- Open source, commercially available — BSD license.
In this article, we will see how we can easily build a machine learning model using scikit-learn.
- SciPy as its dependencies.
Before by installing scikit-learn, make sure you have NumPy and SciPy installed. If you have a working NumPy and SciPy installation, the easiest way to install scikit-learn — this is to use pip:
pip install -U scikit-learn
Let’s start with the simulation process now.
Step 1: Load dataset
Dataset — it is nothing more than a dataset. A dataset usually has two main components:
- Features : (also known as predictors, inputs, or attributes), they are just variables of our data. They can be more than one and are therefore represented by a matrix of objects ("X" is a generic term for representing a matrix of objects). The list of all element names is called element names .
- Response : (also known as target, label, or output) This is an output variable that depends on object variables. We usually have one response column, and this is represented by a response vector ("y" — this is the usual notation for representing a response vector). All possible values taken by the response vector are called target names .
Loading a sample dataset: scikit-learn comes with several sample datasets data like iris and digits for classification and set prices at housing Boston for regression.
Below is an example of how a sample dataset can be loaded:
Feature names: [’sepal length (cm) ’,’ sepal width (cm) ’,’ petal length (cm) ’,’ petal width (cm) ’] Target names: [’ setosa’ ’versicolor’’ virginica’] Type of X is: First 5 rows of X : [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]]
Loading an external dataset. Now consider the case where we want to load an external dataset. For this purpose, we can use the pandas library to easily download and manage the dataset.
To install pandas, use the following pip command:
pip install pandas
In pandas, the important data types are:
Series : Series — it is a one-dimensional labeled array capable of storing data of any type.
DataFrame : It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet or SQL table, or as a bunch of Series objects. This is generally the most commonly used pandas object.
Note. The CSV file used in the example below can be downloaded here: weather.csv
Shape: (14, 5) Features: Index ([u’Outlook’, u’Temperature’, u’Humidity’, u’Windy’, u’Play’], dtype = ’object’) Feature matrix: Outlook Temperature Humidity Windy 0 overcast hot high False 1 overcast cool normal True 2 overcast mild high True 3 overcast hot normal False 4 rainy mild high False Response vector: 0 yes 1 yes 2 yes 3 yes 4 yes Name: Play, dtype: object
Step 2: Dividing the dataset
One of the important aspects of all machine learning models is determining their accuracy. Now, to determine their accuracy, one can train a model using a given dataset, and then predict the response values for the same dataset using that model, and therefore find the model’s accuracy.
But this method has several disadvantages, for example:
- The goal is to estimate the likely performance of a model based on out-of-sample data.
- Maximum training accuracy rewards models that are too complex and do not necessarily generalize our model.
- Unreasonably complex models can exceed training data.
Best option — split our data into two parts: the first — to train a machine learning model, and the second — to test our model.
- Divide the dataset into two parts: the training set and the test set.
- Train the model on the training set .
- Test the model on a test set and see how well our model performed.
Train / test split advantages:
- The model can be trained and tested on data other than the data used for training.
- Response values are known for the test dataset, so predictions can be evaluated
- Testing accuracy — better estimate than out-of-sample performance training accuracy.
Consider the example below:
(90L, 4L) (60L, 4L) (90L,) (60L,)
The train_test_split function takes several arguments, which are described below:
- X, y : This is the feature matrix and the response vector to be separated.
- test_size : This is the ratio of test data to given data. For example, setting test_size = 0.4 for 150 X rows results in test data of 150 x 0.4 = 60 rows.
- random_state : If you use random_state = some_number, then you can ensure that your split is always the same. This is useful if you want reproducible results, for example when testing for consistency in the documentation (so everyone can see the same numbers).
Step 3: Train the Model
Now it’s time to train some forecasting model using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified / consistent interface for fit, prediction accuracy, etc.
The example below uses KNN classifier (K nearest neighbors) .
Note : we will not go into the details of how the algorithm works, since we are only interested in understanding its implementation.
Now consider the example below:
kNN model accuracy: 0.983333333333 Predictions: [’versicolor’,’ virginica’]
Important points to note from the above code :
- We create a knn classifier object using:
knn = KNeighborsClassifier (n_neighbors = 3)
- The classifier is trained using X_train data. The process is called fit . We pass in the feature matrix and the corresponding response vector.
knn.fit (X_train, y_train)
- Now we need to test our classifier against the X_test data. This is done using the knn.predict method. Returns the predicted response vector, y_pred .
y_pred = knn.predict (X_test)
- We are now interested in finding the accuracy of our models by comparing y_test and y_pred . This is done using the precision_score module metrics method:
print (metrics.accuracy_score (y_test, y_pred))
- Consider the case when you want your model to make a prediction based on sample data. Then the sample input can be simply passed in just like we pass any matrix of objects.
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn. predict (sample)
- If you are not interested in training your classifier over and over and using a pretrained classifier, you can save its classifier with joblib ... All you need to do is:
joblib.dump (knn, ’iris_knn.pkl’)
- If you want to load an already saved classifier, use the following method:
knn = joblib.load (’iris_knn.pkl’)
As you get closer to the end of this article, here is some advantages of using scikit-learn over some other machine learning libraries (e.g. R libraries):
- Consistent interface with machine learning models
- Provides many customization options, but with reasonable defaults
- Exceptional documentation
- Rich feature set for related tasks .
- Active community for development and support.
- http://scikit-learn.org/stable/docu mentation.html
- https://github.com/justmarkham/scikit-learn -videos
This article courtesy of By Nikhil Kumar . If you like Python.Engineering and would like to contribute, you can also write an article using contrib.python.engineering, or email your article to [email protected] See my article appearing on the Python.Engineering homepage and help other geeks.
Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.