Survey Paper – Clustering in Data Mining
Essay by mbilalayub • August 23, 2017 • Research Paper • 3,107 Words (13 Pages) • 1,341 Views
Survey Paper – Clustering in Data Mining
Muhammad Bilal Ayub
College of Information Technology, Universiti Tenaga Nasional, Putrajaya Campus,
Jalan IKRAM-UNITEN, 43000 Kajang, Selangor, Malaysia
mbilalayub@gmail.com
4-Sep-2016
Abstract
Data mining techniques are way of extracting the useful information from massive amount of data that probably being accumulated as it grows, in most cases and environments it is true. Clustering such information is indeed have importance in knowing the inside of the data for data mining packages or applications. It gather and sort relevant objects in a way that each other objects are relevant within there specific groups which is called clustering. Transformation of data from various phases is part of the process. There are different methods of mining that provide criteria for mining process at some point or it evaluate the results, namely supervised and unsupervised learning. Clustering is categorized as unsupervised learning. The most appropriate clustering method constructs clusters with high association of data within the cluster and very little resemblance with other clusters. There is different type of clustering algorithms that helps to shuffle different set of data; the popular are partition-based algorithms, hierarchical-based algorithms, grid-based algorithms and density-based algorithms.
Partitioning algorithm divides the data set and plots the points into different partitions, whereby every partition symbolize as cluster. Hierarchical clustering separates the related dataset by constructing a hierarchy of interrelated clusters. Density based algorithms discover the cluster according to the plotted area that grows with growing density. Grid Density based algorithm use populated grids to shape clusters, substances in least populated areas that segregate clusters are generally measured as “noise” and “border”. This survey assess clustering and its diverse techniques in data mining.
Keywords: data mining, clustering, Partition Algorithm, Hierarchical Algorithm, Density-based Algorithm, Grid-based Algorithm
Introduction:
Data mining [1] is examination and investigation of massive amount of existing data that helps to determine hidden patterns and relationship. The definitive goal is to discover best possible way to serialize the capabilities of machine and human intelligence to find out something meaningful. It is valuable to understand that more the size of available data the batter the results would be possible to extract.
“Data mining is the component of wider process called KDD (knowledge discovery from database)” [1]. There are numerous steps involved in processing data from data mining techniques and its relevant algorithms where it transform the input data and after set manipulations it plot the data in a way that arrangement shows future approaches from the data, The end result can redirected to any conventional databases.
[pic 1]
Figure1 - Process of Data Mining – [11]
There are two types of data mining approaches, supervised learning and unsupervised learning.
Supervised learning, where perception of end result is already exists from different streams of data and understanding of data from input to output already exists.
Underlying relevant problems are classified into "regression" and "classification" problems. In regression, Prediction of results from the continuous data stream which is mapped with continues function, whereby n classification problem, it predict from input of data stream or variables to specific output. In fact it try to accommodate inputs to the available set defined categories.
Unsupervised learning, when there is no prediction of problem exists and end results are also not clear. It starts to build the structure from mining data, any variables and there changes do not have visibility of effects. The relationship between the data and different sets plays the major rule while plotting the information to possible results. In order words results are not predictable what cluster might represents.
Clustering:
In clustering data mining, it rearrange the objects that are alike to other objects and bond together in structural way so it is easier to discover the relationship, these objects are not categorize before going through data mining process. It is quite famous approach to drive the data to find out the relationship which is unpredicted. It is particularly helpful when substantial amount of data exist and this approach drilldown it to manageable and meaningful summary, it might be a stage where this summary is input for another technique to further fine tune the data to elaborate other dimensions. Imagining large number of pictures where we classify the pattern in it and then apply clustering algorithm to find valuable relationship of patterns for grouping into rather appropriate clusters. In a situation where customer data would be analyzed for marketing purposes by marketing department or finding the possibilities of introducing new product, it have to group the data and identify the segregation and apply relevant marketing strategy.
Segregation of data is essential phase in clustering, it have relationship with grouping structure. This strategy can be changed based on different scenarios or multi-tier phases of clustering which might take input from out of previous clustering algorithm. Variable scaling from fractions to whole numbers can create diverse affect on results.
There are few necessities associated with data mining while working on clustering, which are “scalability”, “Ability to deal with different types of attributes” , “Ability to handle dynamic data”, “Discovery of clusters with random shape”, “Minimal requirements for domain knowledge to determine input parameters”, “Able to deal with noise and outliers”, “Insensitive to order of input records’, “High dimensionality”, ‘Incorporation of user-specified constraints” and “Interpretability and usability”.
The variety of data that are used for analysis of clustering are “Interval scaled variables”, “Binary variables”, “Nominal”, “ordinal and ratio variables”, “Variables of mixed types” [2]. The five categories of clusters are used in clustering. The clusters are separated into these categories based on their distinctiveness. The categories of clusters are “Well-separated clusters”, “Center-based clusters”, “Contiguous clusters”, “Density-based clusters” and “Shared Property or Conceptual Clusters”. There are thousands of facets within the each cluster to describe by numerous attributes. This diverse data is usually seen in fields where “computer vision applications”, “pattern recognition” and “molecular biology” [2][4] are used, and indeed these examples signify that the data have many dimensions. This have challenges too which creates problems for clustering algorithms while plotting such data, for example plotting high volume of data in space measurement will become insignificant as high dimensions have supersede the whole concept of distance between plotted points.
...
...