K-Means Clustering in Cyber Security

Mahek Batra
5 min readJul 19, 2021

Introduction

With the advancement in technology and the increase in the number of digital sources, data quantity increases every day and, consequently, the cyber security related data quantity. Traditional security systems such as Intrusion Detection Systems (IDS) are not capable of handling such a growing amount of data set in real time. Cyber security analytics is an alternative solution to such traditional security systems, which can use big data analytics techniques to provide a faster and scalable framework to handle a large amount of cyber security related data in real time.

K-means clustering is one of the commonly used clustering algorithms in cyber security analytics aimed at dividing security related data into groups of similar entities, which in turn can help in gaining important insights about the known and unknown attack patterns. This technique helps a security analyst to focus on the data specific to some clusters only for the analysis.

K-Means Clustering

The k-means algorithm is a clustering algorithm. That means that you have a bunch of points in some space, and you want to guess what groups they seem to be in. For example, say we have these points:

Understanding of k-Means Clustering

As a human, you can easily look at those and say that the ones in the top right are a cluster and the ones in the bottom left are a cluster. But if there were lots more clusters, or if they overlapped, or if they were in a 3-dimensional or much higher dimensional space, it would be harder.For example in the below image if you observed it is difficult to form the clusters on our own.For such type of collection of data points we have to use the approach of k-means.

With the k-means algorithm, you have to tell it how many clusters to look for (that’s the “k”), and you tell it some real data points and then it tries to guess a reasonable grouping of the points into k clusters.

So, Let me explain how it works

  • The algorithm starts with choosing k random data points.
  • Now, we will start our clustering around selected k points. So, for clustering, we simply calculate the distance between other points and selected points k. The distance here just means how far data points lie in a graph.
  • The maximum possible number of clusters will be equal to the number of observations in the dataset.

In clustering, we do not have a target to predict. We look at the data and then try to club similar observations and form different groups. Hence it is an unsupervised learning problem.

K-Means Algorithm in Intrusion Detection System

Intrusion detection system is a system that can detect all software and hardware, and the application value is high. At present the system has already become the main network security management tool, can collect different set information in the system, and then combined with the function of the system of detection and automatic response . Intrusion detection system is a behavior classifier, which operates through the judgment of information intrusion and non-invasive behavior.

Intrusion detection system is mainly to distinguish normal behavior and abnormal behavior and then make corresponding measures. In the midst of a data set, can through the simple data preprocessing and system audit, to use these data sets in our system, but this method is only used in simple normal behavior and behavior analysis, premise is to know the difference between the abnormal data and normal data.

By clustering algorithm, one group can not distinguish between normal and abnormal data processing, can summarize and find common ground, and then make a distinction. Clustering algorithm. Therefore, the application of unsupervised clustering algorithm in the field of abnormal detection can improve the detection efficiency of intrusion detection system and the practical application value is higher.

In data mining, the main need detailed analysis was carried out on the clustering algorithm, and grasp the methods of use of such algorithm, in the middle of the clustering algorithm, the K — means algorithm is one of the most commonly used and most practical way. Next, we analyze the k-means algorithm. K — means algorithm first determine the input parameters, the n in the sample data is divided into K class, the same data in a cluster similarity is high.

Establishment of Intrusion Detection Model

Four general intrusion detection model is set up, the first to use collection system, guarantee the connection records in the process of use, and can get clustering analysis of data sets, and then with the help of clustering algorithm distribution connection records, distinguish normal and abnormal connection records. In this study, k-means algorithm was used to complete cluster analysis.

Clustering algorithm results in more clustering, so there are some connection records in each cluster. According to the properties of a given connection record, the properties can be used to determine the two kinds of abnormal clustering and normal clustering. The exception clustering represents the clustering of the abnormal connection records, and the normal clustering represents the clustering of the normal connection records.

In system applications, if you can’t use tagged data, you can’t clearly determine the normal or abnormal condition of the connection record, and then make the clustering tag. Typically, a threshold is used to record the record of the connection above the threshold for the normal clustering, whereas the other is exception clustering.

Using cluster analysis result intrusion methods that connection records, first carries on the standardization, and then from the cluster aggregation clustering, to find the right to his central value close to the distance, complete classification operation according to the tag

Conclusion

The vast amount of data generated in the Internet era undoubtedly challenges the technology of large-scale data processing and data mining. In this paper, we study the problem of network security by using k-means clustering algorithm in data mining. Analyses the network security problems and performance better intrusion detection system in network security analysis simulation, let more people know the network intrusion behavior produces a variety of ways and means. In this way, we can ensure the security of the network information in the network information leak serious today

Thanks for reading

--

--

Mahek Batra

BE 3rd year || Information Technology|| Dedicated|| Passionate||