Change language

# K stands for clustering — Introduction

|

We are provided with a dataset of elements with specific characteristics and values ​​for these elements (for example, a vector). The challenge is to classify these elements into groups. For this we will use the kMeans algorithm; unsupervised learning algorithm.

overview

(It helps if you think of objects as points in n-dimensional space). The algorithm will divide the elements into k similarity groups. To calculate this similarity, we will use Euclidean distance as a measurement.

The algorithm works as follows:

1. First, we initialize k points, called mean s, randomly.
2. We classify each element according to its nearest average and update its coordinates, which are the average of the elements classified in that value.
3. We repeat the process for a given number of iterations, and at the end we have our clusters.

The above "points" are called averages because they contain the average of the items classified in it. We have many options to initialize these tools. An intuitive method is to initialize mean s of random elements in the dataset. Another method is to initialize the mean s at random values ​​between the boundaries of the dataset (if for the object the elements have values ​​in [0.3], we initialize the mean s with values ​​in [0.3]).

Above algorithm in pseudocode:

` Initialize k  mean s with random values ​​For a given number of iterations: Iterate through items: Find the  mean  closest to the item Assign item to  mean  Update  mean  `

We receive input as a text file ("data.txt"). Each line represents an element and contains numeric values ​​(one for each function), separated by commas. You can find sample data here.

We will read data from the file, storing it as a list. Each list item is a different list containing item values ​​for objects. We do this with the following function:

 ` def ` ` ReadData (fileName): `   ` # Read the file, breaking down the lines ` ` f ` ` = ` ` open ` ` (fileName, ` ` ’r’ ` `); ` ` lines ` ` = ` ` f.read (). splitlines (); ` ` f.close (); `   ` items ` ` = ` ` []; `   ` for ` ` i ` ` in ` ` range ` ` (` ` 1 ` `, ` ` len ` ` (lines)): ` ` line ` ` = ` ` lines [i] .split (` ` ’,’ ` `); ` ` itemFeatures ` ` = ` ` []; `   ` for ` ` j ` ` in ` ` range ` ` (` ` len ` ` (line) ` ` - ` ` 1 ` `): ` ` v ` ` = ` ` float ` ` (line [j]) ; ` ` # Convert object value to float ` ` itemFeatures.append (v); ` ` # Add function value to dict `   ` items.append (itemFeatures); `   ` shuffle (items); `   ` return ` ` items; `

Initialize Means

We want to initialize the values ​​of each mean in a range of element feature values. To do this, we need to find the minimum and maximum values ​​for each function. We accomplish this with the following function:

 ` def ` ` FindColMinMax (items): ` ` n ` ` = ` ` len ` ` (items [` ` 0 ` `]); ` ` minima ` ` = ` ` [sys.maxint ` ` for ` ` i ` ` in ` ` range ` ` (n)]; ` ` maxima ` ` = ` ` [` ` - ` ` sys.maxint ` ` - ` ` 1 ` ` for ` ` i ` ` in ` ` range ` ` (n)]; `   ` for ` ` item ` ` in ` ` items: ` ` for ` ` f ` ` in ` ` range ` ` (` ` len ` ` (item)): ` ` ` ` if ` ` (item [f] "minima [f]): ` ` ` ` minima [ f] ` ` = ` ` item [f]; `   ` if ` ` (item [f]" maxima [f]): ` ` ` ` maxima [f] ` ` = ` ` item [f]; `   ` return ` ` minima, maxima; `

Variables are lists containing the minimum and maximum values ​​of the elements, respectively ... We initialize the values ​​of each mean randomly between the corresponding minimum and maximum in these two lists:

 ` def ` ` InitializeMeans (items, k, cMin, cMax): `   ` # Initialize mean s random numbers between ` ` # min and max of each column / function ` ` f ` ` = ` ` len ` ` (items [` ` 0 ` `]); ` ` # number of functions ` ` mean s ` ` = ` ` [[` ` 0 ` ` for ` ` i ` ` in ` ` range ` ` (f)] ` ` for ` ` j ` ` in ` ` range ` ` (k)]; `   ` for ` ` mean ` ` in ` ` mean s: ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` ( mean )): ` ` `  ` ` ` # Set the value to a random floating point number ` ` # (add + -1 to avoid wide placement of the middle) ` ` mean [i] ` ` = ` ` uniform (cMin [i] ` ` + ` ` 1 ` `, cMax [i] ` ` - ` ` 1 ` `); `   ` return ` ` mean s; `

Euclidean Distance

We will use Euclidean Distance as a measure of similarity for our dataset (note: depending on your subjects, you may use a different measure of similarity).

 ` def ` ` EuclideanDistance (x, y): ` ` S ` ` = ` ` 0 ` `; ` ` # Sum of squared differences of elements ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` (x)): ` ` S ` ` + ` ` = ` ` math. ` ` pow ` ` (x [i] ` ` - ` ` y [i], ` ` 2 ` `); `   ` return ` ` math. sqrt (S); ` ` # Square root of sum `

Update Tool

To update the average, we need to find the average for its function for all elements in the average / cluster. We can do this by adding all the values ​​and then dividing them by the number of elements, or we can use a more elegant solution. We will calculate a new average without having to re-add all values ​​by doing the following:

` m = (m * (n-1) + x) / n `

where — the member average, the number of members in the cluster, and the member value for the added member. We do the above for each function to get a new average.

 ` def ` ` UpdateMean (n, mean , item): ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` ( mean )): ` ` m ` ` = ` ` mean [i]; ` ` m ` ` = ` ` (m ` ` * ` ` (n ` ` - ` ` 1 ` `) ` ` + ` ` item [i]) ` ` / ` ` float ` ` (n); ` ` mean [i] ` ` = ` ` round ` ` (m, ` ` 3 ` `); `   ` return ` ` mean ; `

Classify items

Now we need to write a function to classify an item into a group / cluster. For a given element, we will find its similarity to each average and classify the element by the closest one.

 ` def ` ` Classify ( mean s, item): `   ` # Classify element by mean with minimum spacing ` ` minimum ` ` = ` ` sys.maxint; ` ` index ` ` = ` ` - ` ` 1 ` `; `   ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` ( mean s)): `   ` # Find the distance from the element to the middle ` ` dis ` ` = ` ` EuclideanDistance (item, mean s [i]); `   ` if ` ` (dis "minimum): ` ` ` ` minimum ` ` = ` ` dis; ` ` index ` ` = ` ` i; `   ` return ` ` index; `

Find funds

To actually find funds, we will iterate over all the items, classify them by the closest cluster, and update the cluster average. We will repeat the process for some fixed number of iterations. If between two iterations no element changes the classification, we stop the process, since the algorithm found the optimal solution.

The function below takes as input (the number of desired clusters), the elements and the number of maximum iterations and returns averages and clusters. The classification of the element is stored in an array, and the number of elements in the cluster is — c.

 ` def ` ` CalculateMeans (k, items, maxIterations ` ` = ` ` 100000 ` `): ` ` `  ` # Find highs and lows for columns ` ` cMin, cMax ` ` = ` ` FindColMinMax (items); `   ` # Initialize funds at random points ` ` ` ` mean s ` ` = ` ` InitializeMeans (items, k, cMin, cMax); `   ` # Initialize clusters, storage array ` ` ` ` # number of items in the class ` ` clusterSizes ` ` = ` ` [` ` 0 ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` ( mean s))]; `   ` # Array for storing the cluster that the element is in ` ` ` ` belongsTo ` ` = ` ` [` ` 0 ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` (items))]; `   ` # Calculate funds ` ` ` ` for ` ` e ` ` in ` ` range ` ` (maxIterations): `   ` # If no change occurs cluster, stop ` ` noChange ` ` = ` ` True ` `; ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` (items)): `   ` item ` ` = ` ` items [i]; `   ` # Classify item into cluster and update ` ` ` ` # appropriate tools. ` ` index ` ` = ` ` Classify ( mean s, item); `   ` clusterSizes [index] ` ` + ` ` = ` ` 1 ` `; ` ` cSize ` ` = ` ` clusterSizes [index]; ` ` mean s [index] ` ` = ` ` UpdateMean (cSize, mean s [index], item); `   ` # Item changed by cluster ` ` ` ` if ` ` (index! ` ` = ` ` belongsTo [i]): ` ` noChange ` ` = ` ` False ` `; `   ` belongsTo [i] ` ` = ` ` index; `   ` # Nothing changed, return ` ` ` ` if ` ` (noChange): ` ` ` ` break ` `; `   ` return ` ` mean s; `

Find Clusters

Finally, we want to find clusters given the mean s. We will loop through all the elements and classify each element by the closest cluster.

 ` def ` ` FindClusters ( mean s, items) : ` ` clusters ` ` = ` ` [[] ` ` for ` ` i ` ` in ` ` range ` ` (` ` len ` ` ( mean s))]; ` ` # Starter clusters `   ` for ` ` item ` ` in ` ` items: `   ` # Classify element into cluster ` ` index ` ` = ` ` Classify ( mean s, item); `   ` # Add item to cluster ` ` ` ` clusters [index] .append (item); `   ` return ` ` clusters; `

Other popular similarity measures are:

1. Cosine distance: defines the cosine of the angle between point vectors of two points in n-dimensional space

2. Manhattan Distance: computes the sum of the absolute differences between the coordinates of the two data points.

3. Minkowski distance: it is also known as generalized distance metric. Can be used for both ordinal and quantitative variables.

You can find all the code on my GitHub plus an example dataset and plotting function ... Thanks for reading.

This article contributed by Anthony Maronikolakis . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting the article [email protected] ... See my article appearing on the Python.Engineering homepage and help other geeks.

## Shop

Learn programming in R: courses

\$

Best Python online courses for 2022

\$

Best laptop for Fortnite

\$

Best laptop for Excel

\$

Best laptop for Solidworks

\$

Best laptop for Roblox

\$

Best computer for crypto mining

\$

Best laptop for Sims 4

\$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

NUMPYNUMPY

How to specify multiple return types using type-hints

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

NUMPYNUMPY

glob exclude pattern

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

NUMPYNUMPY

Python CSV error: line contains NULL byte

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically