K-means clustering and its real use-case in security domain

Anirudh Grack
6 min readJul 19, 2021

--

What is K-means Clustering ?

K-means clustering is a method of separating data points into several similar groups, or “clusters,” characterized by their midpoints, which we call centroids.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

What are the basic steps for K-means clustering?

  • Step 1: Choose the number of clusters k.
  • Step 2: Select k random points from the data as centroids.
  • Step 3: Assign all the points to the closest cluster centroid.
  • Step 4: Re-compute the centroids of newly formed clusters.
  • Step 5: Repeat steps 3 and 4.

How K-means works?

K-means is the most popular of the partitioning clustering algorithms and was first used in 1967 by James Macqueen. ‘K’ is just a parameter specifying the number of clusters in which data points will be grouped. K-means works through an iterative process in finding the best k-cluster groups for data points, in such a way that one data point can not belong to the same group/cluster(no overlapping). For each cluster, a centroid(using the mean) is calculated, then, data points closet by distance to these clusters are assigned as belonging to the same group. Data points are assigned to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is as minimal as possible. Hence the iterative process for the best clusters. The distance used is calculated using the Euclidean distance measure.
Fun fact: K + means = Number of clusters + mean of the clusters for grouping data points.

Just like every other algorithm, the K-means algorithm has objectives for each use case. The two primary objectives are:
->Calculating the distance between points
->Reducing the sum of squared distances or errors between centroids and points(SSE).

One important point to note is that choosing the optimal value of k is very important. While smaller k values lead to fewer clusters with data mismatch, too large k values lead to more clusters with likely clusters in separate groups. The optimal value of K will be at that point when there is no change in centroids i.e centroid values remain constant. One very common method for determining this is the ‘ Elbow method’.
The elbow method fits the data points in a range of value of k using the ‘within-cluster sum of square’ principle, such that WCSS is as minimal as possible.

Types of clustering

There are several types of clustering algorithms. The most common ones are:
Hierarchical algorithms
Partitioning algorithms

Hierarchical clustering is further subdivided into:
Agglomerative clustering
Divisive clustering

Partitioning clustering is further subdivided into:
K-Means clustering
Fuzzy C-Means clustering

K-means fall under a partitioning clustering algorithm, as it divides data points into several partitions/groups using a weight/center method, where data points are categorized as belonging to the same cluster based on a general center or mean(Hence the name, K-means).

How do I select the number of clusters (K)?

Good question. The most common way to do this is by using what’s referred to as an elbow plot. You can build an elbow plot by iteratively running the K-means algorithm, first with K=1, then K=2, and so on, and computing the total variation within clusters at each value of K.

A common way of computing variation is by looking at the sum of squared distances from each point to its cluster center. If we ran K-means with only one cluster, we would have one huge cluster, so the within-cluster variance would be very high, and alternatively, if each cluster consisted of only one point, the total variation would be zero. As we increase the value of K and the number of clusters goes up, the total variation within clusters will necessarily decrease.

The goal of the elbow method is to find the inflection point on the curve, or the “elbow.” After this point, additional clusters do not minimize the within cluster variance significantly enough to justify additional groupings in the dataset. There is no hard and fast rule here, as it’s often up to the discretion of the data scientist, but looking at an elbow plot tends to be a good place to start.

K-Means Clustering Use Case in Security Domain

This Usecase talks about hotspots and use of k means clustering for crime pattern detection. It first identifies significant attributes in the database. Unlike other, it then gives weights to attributes in the data set. The most important attribute (e.g. type of crime) is given the highest priority (weight) as compared to other attributes. They used are this feature of the research paper in our project. The data with missing values are made as test cases. It suggests dividing the database according to respective states, using classification, to make the data easier to analyze. In this project, they subdividing the data into different types of crime allowing the user to get information of those crimes easily (e.g. percentage of the particular crime in a particular year, the hotspot of that particular crime).

Objective was Development of a spatiotemporal crime prediction model based on geographical information systems coupled with spatial statistical methods. In this ,clustering analyses are used to identify hot spots. Cluster analysis aims to collect data into groups according to several algorithms which are Kmeans, spatiotemporal analysis of crime (STAC), fuzzy, ISODATA, and geographical analysis machine (GAM) clustering techniques. Clusters of STAC do include more homogenous areas than the other methods. STAC is not restricted to include all the observations hence STAC is able to indicate denser crime areas than other methods. This is important in crime prevention for allocating resources effectively.

The Cluster has a special meaning which refers to a special group of crime i.e. a lot of crimes in a given specified regions. These clusters can be represented using a geo plot of the crime described on the map of the police jurisdiction.

1. Crime Data Mining for Indian Police Information System
This is all about India’s crime analysis system. It gives ways to enhance the currently existing system in Indian Police System called Crime Criminal Information System (CCIS). It suggests dividing the database according to respective states, using classification, to make the data easier to analyze. In our project, we are subdividing the data into different types of crime allowing the user to get information of those crimes easily (e.g. percentage of the particular crime in a particular year, the hotspot of that particular crime)

2. An Optimal KD Model for Crime Pattern Detection Based on Semantic Link Analysis-A Data Mining Tool
The system finds the critical path of the serial killers who are striking over again and again and determines links between their crime activities locations occurred in the past, travel record, and background history, etc. These findings increase the chance of finding these repeat offenders. The formation to integrate information from various crime incidents and also multiple sources and also discover regular patterns for the structure, organization, operation and information in criminal databases. If a particular criminal uses a pattern of path to commit consecutive crimes, then the next crime location of this serial killer can be predicted from the pattern observed. e.g.: in DHOOM 2, the last location of crime of criminal HRITHIK ROSHAN was predicted by the pattern he formed from his previous crime locations.

3. Evolving Data Mining Algorithms on the Prevailing Crime Trend

The crime data is divided into days of the week, to observe Spatial temporal distribution of crime. To the clustered results, a classification algorithm was applied to predict the future crime pattern. It enables us to build a model on predicting next records using previous year’s data.

RESULTS OF CRIME PATTERN ANALYSIS
The different clusters or the crime patterns are color-coded. For each group, the legend provides the total number of crimes incidents included in the group along with the significant attributes that characterize the group.

--

--

No responses yet