Pandas Embedded Data Visualization | ML



In this tutorial, we will learn about the built-in data visualization capabilities of pandas! It`s built in from matplotlib, but baked into pandas for easier use!

Let`s take a look!

Install
The easiest way to install pandas — use pip:

 pip install pandas 

or download it from here

This article demonstrates how to use the built-in data visualization function in pandas by building various types of charts.

Importing the required libraries and data files —

Sample csv files df1 and df2 used in this tutorial can be downloaded here .

import numpy as np

import pandas as pd

  
# There are some fake CSV data
# you can cheat serve as data

df1 = pd. read_csv ( `df1` , index_col = 0 )

df2 = pd.read_csv ( ` df2` )

Style Sheets —

Matplotlib has style sheets that you can use to make your graphics look a little better. These style sheets include plot_bmh , plot_fivethirtyeight , plot_ggplot and others. They basically create a set of style rules that your graphics follow. We recommend using them, they make all your areas look similar and look more professional. We can even create our own if we want the company areas to look the same (albeit a bit tedious).

Here`s how to use them.

Before plt.style.use () graphics look like this:

df1 [ ` A` ]. Hist ()

Output:

Name the style:

Now the plots look like this after calling the ggplot style:

import matplotlib.pyplot as plt

plt.style. use ( ` ggplot` )

df1 [ `A ` ]. hist ()

Output:

The graphs look like this after calling the style bmh :

plt.style.use ( `bmh ` )

df1 [ ` A` ]. hist ()

Output:

The graphs look like this after calling the dark_background style:

plt.style.use ( `dark_background` )

df1 [ ` A` ]. hist ()

Exit:

Plots look like this after calling the fivethirtyeight style :

plt.style.use ( `fivethirtyeight` )

df1 [ ` A` ]. hist ()

Exit:

Site types —

Exist There are several types of graphs built into pandas, most of which are statistical in nature:

  • df.plot.area
  • df.plot.barh
  • df. plot.density
  • df.plot.hist
  • df.plot.line
  • df.plot.scatter
  • df.plot. bar
  • df.plot.box
  • df.plot.hexbin
  • df.plot.kde
  • df.plot.pie

    You can also simply call df.plot (kind = & # 39; hist & # 39;) , or replace the kind argument with any of the key terms shown in the list above (for example, & # 39; box & # 39 ;, & # 39; barh & # 39; etc.). Let`s start through them!

    1.) Area

    An area chart or area chart displays graphically quantitative data. It is based on a line chart. The area between the axis and the line is usually highlighted with colors, textures, and shading. Usually one compares two or more quantities to an area chart.

    df2.plot.area (alpha = 0.4 )

    Output:

    2.) Barplots

    Histogram or histogram — it is a chart or chart that presents categorical data with rectangular bars with heights or lengths proportional to the values ​​they represent. Bars can be applied vertically or horizontally. A vertical bar chart is sometimes called a line chart.

    df2.head ()

    Output:

    df2.plot.bar ()

    Output:

    df2.plot.bar (stacked = True )

    Output:

    3.) Histograms

    Histogram — it is a graph that allows you to discover and show the underlying frequency distribution (shape) of a continuous dataset. This allows data to be checked for underlying distribution (e.g. normal distribution), outliers, skewness, etc.

    df1 [ `A` ]. plot.hist (bins = 50 )

    Output:

    4.) Line Plots

    Line Chart — it is a graph that shows the frequency of data along a number line. It is best to use a line chart when the data is a time series. It`s a quick and easy way to organize your data.

    df1.plot.line (x = df1.index, y = `B` , figsize = ( 12 , 3 ), lw = 1 )

    Output:

    5.) Scatter Plots

    Scatter plots are used when you want to show the relationship between two variables. Scatterplots are sometimes called correlation plots because they show how two variables are correlated.

    df1.plot.scatter (x = `A` , y = ` B` )

    Output:

    You can use c to paint based on a different column value. Use cmap to specify the color map to use. For all color maps, check: http://matplotlib.org/users/colormaps.html

    df1.plot.scatter (x = `A ` , y = ` B` , c = `C` , cmap = `coolwarm` )

    Output:

    Or use s to specify a size based on another column. S must be an array, not just a column name:

    df1.plot.scatter (x = `A` , y = `B` , s = df1 [ `C` ] * 200 )

    Output:

    6.) Plots box

    This is a graph that draws a rectangle to represent the second and third quartiles, usually with a vertical line inside to represent the median value. The lower and upper quartiles are shown with horizontal lines on either side of the rectangle. 
    Square graph — it is a standardized way to display the distribution of data based on a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). He can tell you about your emissions and their values. It can also tell you if your data is symmetric, how tightly your data is grouped, and if and how your data is garbled.

    df2.plot.box ( ) # You can also pass a by = argument for groupby

    Output:

    7.) Hexagonal bins

    Hexagonal binning — this is another way to solve the problem of having many points that start to overlap. The hexagonal density is binning, not points. The points are combined into hexagons with a grid, and the distribution (number of points per hexagon) is displayed using the color or area of ​​the hexagons. 
    Useful for 2D data, alternative to scatter plot:

    df = pd.DataFrame (np.random.randn ( 1000 , 2 ), columns = [ `a` , ` b` ])

    df.plot.hexbin (x = `a` , y = ` b` , gridsize = 25 , cmap = ` Oranges` )

    Output:

    8.) Kernel Density Estimation Graph (KDE)

    KDE — it is a technique that allows you to create a smooth curve from a dataset.

    This can be useful if you only want to visualize the "shape" of some data, as a kind of continuous replacement for a discrete histogram. It can also be used to create points that look like they come from a specific dataset — this behavior can lead to simple modeling when the modeled objects are modeled based on real data.

    df2 [ ` a` ]. plot.kde ()

    Output:

    df2.plot.density ()

    Output:

    This is it! Hopefully you can see why this plotting method is much easier to use than full matplotlib, it balances ease of use with control over pattern. Many plot calls also accept additional arguments to their parent matplotlib plt. call.