USFDC Home  USF Electronic Theses and Dissertations   RSS 
Material Information
Subjects
Notes
Record Information

Full Text 
PAGE 1 Distributed Clustering for Scaling Classic Algorithms by Prodip Hore A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science and Engineering College of Engineering University of South Florida Major Professor: Dr. Lawrence Hall, Ph.D. Dr. Rafael Perez, Ph.D. Dr. Dmitry Goldgof, Ph.D. Date of Approval July 1, 2004 Keywords: Ensemble, Merging, Filter ing, Disputed Examples, Extrema Copyright, 2004, Prodip Hore PAGE 2 Acknowledgements I am grateful to my major professor, Dr Lawrence Hall, for his continuous support and help. I also appreciate the help of the other members of my committee: Dr. Rafael Perez and Dr. Dmitry Goldgof. I am also indebted to my parents, Suvash Chandra Hore and Rekha Rani Hore, and other members of my family for their support and inspiration. PAGE 3 Table of Contents List of Tables ii List of Figures iii Abstract v Chapter 1 Introduction and Related Work 1 Chapter 2 Background 4 Chapter 3 Parti tion Merging 10 3.1 Distan ce Matrix 12 3.2 Local Chai n Matrix 12 3.3 Gl obal Chain Ma trix 12 3.4 Distribute d Combining Algor ithm 15 3.5 Computing Disput ed Exampl es 16 Chapter 4 Centroid Filtering 19 Chapter 5 Data a nd Experiments 20 5.1 Iris Plant Database 20 5.2 MRI Databa se 20 5.3 Syntheti c Data 21 5.4 Plankton Data 21 5.5 Initialization of Cent roids 21 5.6 Experimental Setup 21 5.7 Iris Experime nts 22 5.8 MRI Experiments 27 5.9 Using Harmony to Filt er Centroids in the MRI Data Set 33 5.10 Synthetic Data Experiments 36 5.11 Plankton Data Experiments 42 Chapter 6 Discussion and Conclu sions 43 References 46 i PAGE 4 List of Tables Table 5.1 Results of Clustering (Fuzzy) Iris Global Data Set and 50 Random Initializations of Centro ids 22 Table 5.2 Results of Clustering (Hard) Iris Global Data Set and 50 Random Initializations of Centroid s 23 Table 5.3 Results of our DCombining Algorithm (Iris Data). Fuzzykmeans Applied to Each Subset 23 Table 5.4 Results of our DCombining Al gorithm (Iris Data). Hardkmeans Applied to Each Subset 23 Table 5.5 Results of Clustering (Fu zzy) MRI Global Data Set and 50 Random Initializations of Centroid s 28 Table 5.6 Results of Clustering (Hard) MRI Global Data Set and 50 Random Initializations of Centroid s 28 Table 5.7 Results of our DCombini ng Algorithm (MRI data). Fuzzykmeans Applied to Each Subset 29 Table 5.8 Results of our DCombining Al gorithm (MRI data). Hardkmeans Applied to Each Subset 29 Table 5.9 Using Harmony to Filter Ce ntroids in the MRI Data Set 33 Table 5.10 Results of Clustering (Fuzzy and Hard) Artificial Global Data Set and 50 Rando m Initializations of Centroids 36 Table 5.11 Results of our DCo mbining Algorithm without Filtering Cluster Centers 37 Table 5.12 Results of our DCombining Algor ithm after Filtering Cluster Centers Using Harmony 37 ii PAGE 5 List of Figures Figure 1.1 Overall Procedure of Distributed Clustering 3 Figure 2.1 Taxonomy of Cl ustering Algorithms 4 Figure 2.2 Example of Data Set 5 Figure 2.3 Example of Hier archical Clustering 5 Figure 2.4 Example of HardCluster ing 6 Figure 2.5 Example of Fuzzy Cluste ring 6 Figure 2.6 Hardkm eans Algorith m 8 Figure 2.7 Fuzzykm eans Algorithm 9 Figure 3.1 Example of Bipartite Matching 11 Figure 3.2 Example of Formation of Gl obal Chain Matrix from 3 Local Chain Matr ices 13 Figure 3.3 Consensus Chain Algorithm 14 Figure 3.4 Distributed (D )Combining Algorithm 15 Figure 3.5 Algorithm for ConfusionMa trix 16 Figure 3.6 Example Label Vector of Global Partition of 10 Examples 17 Figure 3.7 Example Label Vector Formed by DCombing Algorithm 17 Figure 3.8 Confusion Matrix a nd the Matched Label Pairs 18 Figure 4.1 Example of Perfect Harmony amon g Centroids in a Consensus Chain 19 Figure 5.1 Synthetic 2D Gaussian Distributed Data 20 Figure 5.2 Experiments with Iris Data 24 Figure 5.3 Result of DCombining Algor ithm on Iris Data with SemiRandom Initial ization 25 Figure 5.4 Result of DCombining Al gorithm on Iris Data with PureRandom Initial ization 26 Figure 5.5 Comparison of Global Fuzzy Clus tering of MRI Data with DCombining Al gorithm 30 Figure 5.6 Plotting of Disput ed Examples of MRI Data (Hardk means, k=7) 31 Figure 5.7 Plotting of Disput ed Examples of MRI Data (Fuzzyk means, k=7) 32 Figure 5.8 Effect of Applying Harmony Algorithm to Consensus Chains of MRI Data (Using Hardkmean s, k= 7) 34 Figure 5.9 Effect of Appl ying Harmony Algorithm to the Consensus Chains of Synthe tic Data (Using Fuzzyk means, k= 4) 38 iii PAGE 6 Figure 5.10 Effect of Applying Harmony Algorithm to the Consensus Chains of Syntheti c Data (Using Hardkmean s, k= 4) 39 Figure 5.11 Extremas of Global Clus tering of Synthe tic Data 40 Figure 5.12 Effect of Applying Harmony Algorithm to the Extrema Patterns of Synthe tic Data with Purerandom Initialization (Using Hardkmean s, k= 4) 40 Figure 5.13 Effect of Applying Harmony Algorithm to the Extrema Patterns of Synthe tic Data with Semirandom Initialization (Using Hardkmeans, k= 4) 41 iv PAGE 7 Distributed Clustering for Scaling Classic Algorithms Prodip Hore ABSTRACT Clustering large data sets recently has emerge d as an important area of research. The everincreasing size of data sets and poor scalability of clustering algorithms has drawn attention to distributed clustering for partitio ning large data sets. Sometimes, centrally pooling the distributed data is also expensive. There might be also constraints on data sharing between different dist ributed locations due to priv acy, security, or proprietary nature of the data. In this work we propose an algorithm to cluster largescale data sets without centrally pooling the data. Data at di stributed sites are clustered independently i.e. without any communication among them. After partitioning the local/distributed sites we send only the centroids of each site to a central location. Thus there is very little bandwidth cost in a wide area network scen ario. The distributed sites/subsets neither exchange cluster labels nor i ndividual data features thus providing the framework for privacy preserving distributive clustering. Centroids from each local site form an ensemble of centroids at the central site. Ou r assumption is that data in all distributed locations are from the same underlying distri bution and the set of centroids obtained by partitioning the data in each s ubset/distributed location gives us partial information about the position of the cluster centroids in that distribution. Now, the problem of finding a global partition using th e limited knowledge of the ensemble of centroids can be viewed as the problem of reaching a global consensu s on the position of cl uster centroids. A global consensus on the position of cluster cen troids of the global data using only the very limited statistics of the position of centroids from each local site is reached by grouping the centroids into consensus chains and computing the weighted mean of centroids in a consensus chain to represent a global cluster centroid. We compute the Euclidean distance of each example from the global set of centroids, and assign it to the centroid nearest to it. Experime ntal results show that quality of clusters generated by our algorithm is similar to the quality of clusters generated by clustering all the data at a time. We have shown that the disputed exampl es between the clusters generated by our algorithm and clustering all the data at a time lay on the border of clusters as expected. We also proposed a centroidfiltering algorithm to make partitions formed by our algorithm better. v PAGE 8 Chapter 1 Introduction and Related Work Unlabeled data can be grouped or partitioned in to a set of clusters in many ways. There are hierarchal clustering algorithms [17] iterative clustering algorithms, single pass clustering algorithms and more [21]. Clusteri ng data is often considered to be a slow process. This is especially true of iterative clustering algorithms such as the Kmeans family [25]. As larger unlabeled data sets b ecome available, the scalability of clustering algorithms becomes important. In recent ye ars a number of new clustering algorithms have been introduced to address the i ssue of scalability [8, 10, 11, 12, 13, and 14]. Various methods of accelerating kmeans have also been studied [4, 24, 25, 26 and 30]. All the above algorithms assume that the clustering algorithm is applied to all the data centrally pooled in a single location. Distributed computing has also emerged as an important area of research for scaling the clustering process and also due to some inherent problems in pooling data from distributed locations to a centralized lo cation for extracting knowledge. Combining multiple partitions has been studied for quite a few times [1, 2, 5, 6, 7, 27, 28, 29, and 31]. In [9] a parallel version of kmeans has been proposed which requires synchronized communication during each iteration [9], whic h might become difficult and costly in a wide area network. Moreover, there might be constraints like data could not be shared between different distributed locations due to privacy, security, or the proprietary nature of the data. Extracting know ledge from these types of distributed locations under restraints of data ex change is called privacy preserving data mining. There has been some work where data in distributed form has b een clustered independen tly i.e. without any message passing among them and multiple partitions combined using limited knowledge sharing [6, 7, 27, 28, and 29]. Knowledge reus e framework [3] has also been explored, where label vectors of different partitions are combined with out using any feature values [1 and 29]. In [6 and 27] distributed clus tering has been discussed under two different strict settings that impose severe constrai nts on the nature of the data or knowledge shared between local data sites. In [7], lo cal sites are first clustered using the DBSCAN algorithm and then representatives from each local site are sent to a central site, where DBSCAN is applied again to find a global model. Another density estimation based distributed clustering has been discussed in [28]. Some work on distributed data mining has also been done for association rule mi ning with limited knowledge sharing under the banner of privacypreserving data mining [32, 34, and 35]. 1 PAGE 9 In this thesis we propose a distributed clustering frame work under strict limited knowledge sharing similar to the setting in [6], where distributed site s dont allow sharing of cluster labels and attri butes of objects/examples among them. In [6] generative or probabilistic model parameters are sent to a central site, where virtual samples are generated and clustered using an EM algorithm [21 and 23] to obtain the global model. In our approach we cluster each distributed lo cation/local sites using a hardkmeans or a fuzzykmeans algorithm independently and send only the cluster centroids to a central location, where they form an ensemble of cen troids. Our assumption is that data in all subsets or distributed locati ons are from the same underlying distribution and the set of centroids obtained by partitioning the data in each subset/distribut ed location gives us partial information about the pos ition of the global cluster centroids. Now, the problem of finding a global partition using the limited knowledge of the en semble of centroids can be viewed as the problem of reaching a global consensus on the position of cluster centroids. We reach a global consensus on the position of cluster centroids of the global data after integrating the very limited statistics of the pos ition of centroids from each local site. Our approach also introduces an additional framework for filtering or removing inappropriate centroids from participating in the merging process, whose inclusion, otherwise, would have distorted the global partition. In the real life scenario this type of framework will be useful because one or more distributed sites might be very noisy. This framework will enable the central site to an alyze the merging process rather than blindly combining all the partitions. This approach can be applied to many existing algorithms, which might be labeled centroidal. That is, these algorithms ite ratively produce cluster centroids, which are representative of the data assigned to each clus ter. Examples of this type of clustering algorithm are the Kmeans clustering algorith ms (hard and fuzzy), and the EM algorithm [15]. Almost any type of cl ustering algorithm can be made into a centroidal algorithm by simply creating cluster centro ids to represent clustered da ta. Figure 1.1 shows the overall procedure of our algorithm. In Chapter 2, the particular clustering algorithm used in the experiments is discussed. Chapter 3 describes how the centroids from different subsets/di stributed sites are integrated to form the centroids of the gl obal partition. In Chapter 4 a centroidfiltering algorithm has been discussed to filter out noisy centroids and Chapter 5 presents the experimental data sets and results. Chapter 6 is a summary and discussion. 2 PAGE 10 Subset 1 Subset 2 Subset m Apply Fuzzykmeans or Hardkmeans Apply Fuzzykmeans or Hardkmeans Apply Fuzzykmeans or Hardkmeans Access the centroids formed by clustering partial data in the above subsets. Merge them to form the global set of centroids. Final global partition of the data Global data set Figure 1.1 Overall Procedure of Distributed Clustering 3 PAGE 11 Chapter 2 Background Clustering is one of the most important unsupervised learning problems. In unsupervised learning the data has no predefined label. Clustering could be viewed as the process of organizing objects into groups whose members are similar in some way. Objects or examples or patterns are defined by features, which describe them. Clustering algorithms group these examples into clusters such that examples in a cluster are more similar than examples in other clusters. There are many varieties of clustering algorithms but generally most of them fall under one of the two broad categories i.e. hierarchical or partitional algorithms. Below (Figure 2.1) is the taxonomy of clustering algorithms. There are also many other varieties of clustering algorithms [23] but we show the taxonomy of the most commonly used ones. Hierarchical Algorithms Partitional Algorithms Hardkmeans Fuzzykmeans EM Algorithms Clustering Figure 2.1 Taxonomy of Clustering Algorithms Hierarchical clustering algorithms do not form a final partition of the data but yield a dendrogram representing the nested grouping of patterns and similarity levels at which grouping change [23]. 4 PAGE 12 Illustrative example: Consider the following 5 examples E1, E2, E3, E4, and E5 in 2dimensional space (Figure 2.2) and lets assume that the spatial proximity among them is a measure of the similarity with each other. E1 E2 E5 X2 E3 E4 X1 Figure 2.2 Example of Data Set Applying a hierarchical algorithm to the above data set could yield a dendrogram as shown in Figure 2.3. E1 E2 E5 E3 E4 Similarity Figure 2.3 Example of Hierarchical Clustering 5 PAGE 13 We will not further discuss hierarchical algorithms because our work is based on scaling partitional algorithms in a distributed way. Partitional clustering algorithms produce a single partition of the data by iteratively optimizing a clustering criterion function or an objective function [23]. Hard kmeans results in a crisp partition of the data while fuzzy kmeans assigns a degree of membership to each example in a data set i.e. how much an example is associated with a cluster. Membership for a particular example in a cluster indicates how well it fits in that cluster. Applying hardkmeans to the data set (Figure 2.2) could yield the following two clusters (Figure 2.4). Cluster 1 Cluster 2 E1 E2 E5 X2 E3 E4 X1 Figure 2.4 Example of HardClustering Applying fuzzykmeans to the data set (Figure 2.2) could yield the following two overlapping clusters (Figure 2.5). Here the example E5 belongs to both the clusters but its membership degree would be higher for cluster 2 than in cluster 1. E1 E2 E5 X2 E3 E4 X1 Figure 2.5 Example of Fuzzy Clusters 6 PAGE 14 Similarity measures among examples or patterns are of fundamental importance and must be chosen carefully [23]. We have used a wellknown metric i.e. Euclidean distance for measuring dissimilarity among examples whose features are all continuous values. If every example or pattern has dimension d i.e. d features, then dissimilarity among any examples x i and x j is measured as follows: 2121,,,,)(dkkjkijijixxxxd = 2jixx (1) The widely used criterion function or objective function which hardkmeans and fuzzykmeans try to optimize is the squared error function: ikikc1iN1kmikm) (U,,Du)(U,JminvxVV (2), where U contains the cluster memberships with iku [0,1], if fuzzy or {0,1} if hard; 1m iku V = (v1, v2, ..., vc) vi specifies the i cp th cluster center of dimension p; m 1 is a weighting exponent that controls the degree of fuzzification of U ; and Dik(xk, vi) = Dik is the deviation of xk from the i th cluster prototype and 22ikikvxD We know describe the Hardkmeans and Fuzzykmeans algorithms using the Euclidean distance metric in Figure 2.6 and Figure 2.7 respectively. 7 PAGE 15 Input: Data set of n examples of s features and the value of k (number of clusters). Output: Partition of the input data into k regions. 1. Declare an n x k size U membership matrix. 2. Randomly generate k cluster center locations within the range of the data or randomly select k examples as initial cluster centroids. Let the centroids be c1, c2,, ck. 3. Calculate the distance measure 2,jijicxd according to equation (1), for all cluster centroids j=1 to k and data examples 1 to n. 4. Compute the U membership matrix as follows: otherwiseljulijiji;0 ,;122,vxvx i, j ; (ties are broken randomly); 5. Compute new cluster centroids nimjiniimjijuxuc1,1, for j=1 to k. Note the cluster centers in hardkmeans are just the centroids of the points in a cluster. 6. Repeat steps 3 to 5 until the change in U in two successive iterations is less than a given threshold Figure 2.6 Hardkmeans Algorithm 8 PAGE 16 Input: Data set of n examples, value of k (number of clusters), fuzzification value m>1 (we have used the value of 2). Output: Partition of the input data into k regions. 1. Declare an n x k size U membership matrix. 2. Randomly generate k cluster center locations within the range of the data or randomly select k examples as initial cluster centroids. Let the centroids be c1, c2,, ck. 3. Calculate the distance measure 2,jijicxd according to equation (1), for all cluster centroids j=1 to k and data examples 1 to n. 4. Compute the Fuzzy membership matrix as follows: 1112,,,klmlijijiddu if > 0 jid, =1 if = 0 jid, 5. Compute new cluster centroids nimjiniimjijuxuc1,1, for j=1 to k. 6. Repeat step 3 to 5 until the change in U in two successive iterations is less than a given threshold. Figure 2.7 Fuzzykmeans Algorithm Although, we have not used EM clustering algorithm in this work, one can also apply it to cluster data. In fact any partitioning algorithm whose clusters can be represented by centroids can use our scaling method to cluster largescale data. The EM algorithm makes a hypothesis that the pattern or examples of a data set are drawn from some distribution and proceeds with the goal to identify them by iteratively updating the hypothesis. The most commonly used assumption of the distribution is Gaussian. 9 PAGE 17 Chapter 3 Partition Merging After clustering is applied to each subset/distributed sites there are a set of centroids available which describe each partitioned subset. Our assumption is that data in all subsets or distributed locations are from the same underlying distribution and the set of centroids obtained by partitioning the data in each subset/distributed location gives us partial information about the position of the cluster centroids of the global data. Clustering or partitioning a subset/distributed location (say i) will produce a set of centroids where k is the number of clusters. For m subsets/distributed locations we will have m sets of centroids i.e. kjjiC1, kjjC1,1 , kjjC1,2 ,.., kjjmC1, forming an ensemble of centroids at the central site. Now, the problem of findings a global partition using the limited knowledge of the ensemble of centroids can be viewed as the problem of reaching a global consensus on the position of the centroids for the global data. One way to reach a global consensus is to group the ensemble of centroids into k consensus chains, where each consensus chain will contain m (number of subsets/distributed locations) centroids nmncc,...,1 one from each of the partition, where n is from 1 to k. The aim is to group similar centroids in each consensus chain. The objective is to globally optimize the assignment out of all possible families of centroid assignments to k consensus chain: f knfnchainconsensust1*)(_cosminarg (1) and iDnchainconsensustnmi1)(_cos (2) where ijmjnjninccdiD!1,21 (3) where is the distance function between centroid vectors in a consensus chain. We have used the Euclidean distance in computing the cost (3). .,.d After the consensus chains are created, we simply compute the weighted arithmetic mean of centroids in a consensus chain to represent a global centroid, where the weights of a 10 PAGE 18 centroid are determined from the size of the subsets/distributed location from where it has come. In other words, if we represent the centroids of each partition by nonadjacent vertices of a graph and the Euclidean distance between a centroid of a partition and other partitions as a weighted edge, then finding the globally optimum value for the objective function (2) reduces to the minimally weighted perfect mpartite matching problem, which is NPcomplete for m>2. So, we need another approach. If the consensus chains are formed after optimizing the objective function (1) i.e. mpartite minimally weighted perfect matching, then centroids in each consensus chain are assigned with minimum cost (2) having optimum consensus among centroid in a consensus for a global centroid position. But optimizing the objective function (1) is intractable. Thus we have used a heuristic algorithm to group the centroids into k consensus chains. We know that for 2 partitions we have a polynomial time algorithm i.e. minimally weighted perfect bipartite matching [33] to globally optimize the above objective function. We pick 2 partitions at random and group their centroids into k consensus chains after globally optimizing the above objective function using minimally weighted perfect bipartite matching (Figure 3.1 shows an example of it). Now, each consensus chain will contain 2 centroids (matched pairs) one from each partition. Next we pick the centroids of one of these already assigned partitions and a new partition randomly and again optimize the objective function for these two partitions and put the centroid of the new partition in the same consensus chain in which the matched centroid from the assigned partition belongs. In this way we continue grouping the centroids of partitions into the consensus chain one by one until they are exhausted. One can also use a greedy approach instead of minimally weighted bipartite matching for matching centroids of two partitions. We have observed that both the greedy approach and bipartite matching gives the same result on average. In 3.1 we have described the greedy approach and all experimental results are from using the greedy approach. But in future we plan to replace it by the Hungarian method of perfect bipartite matching [33]. Illustrative example: 7 1 6 5 6 4 3 6 9 Figure 3.1 Example of Bipartite Matching. The numbers on the edges are weights or cost. The blue colored thick lines show the mincost bipartite matching 11 PAGE 19 3.1 Distance Matrix The Distance matrix stores the Euclidean distance among the centroid vectors of any two selected subsets. If there are C clusters in each subset, the dimension of the Distance matrix will be C X C. 3.2 Local Chain Matrix The Local chain matrix stores the matched centroids correspondence between any two selected subsets/distributed location i.e. which centroid in the first selected subset matched to which centroid in the second subset. If there are C clusters in each subset, the dimension of the Local chain matrix will be C X 2. 3.3 Global Chain Matrix The Global chain matrix uses the information from local chain matrices to build the global consensus chain of centroids. Each row of the global chain matrix represents a consensus chain. If there are C clusters in each subset and there are M such subsets, the dimension of Global chain matrix will be C X M. For a given pair of subsets, we first find the distance between all centroid vectors. This gives us a distance matrix were each entry (i,j) is the distance from the i cc th centroid in one partition to the j th centroid in the other. The Euclidean distance is used here. The smallest distance entry provides the first pair of matched centroids. The next smallest distance will provide the second set of matching centroids, etc. until they are all paired. There is a difficulty if 2 centroids in one partition are closest to a single centroid in the other partition. In this case, the second centroid encountered will be matched with a centroid, which is not the closest. We say a collision has occurred. So, every centroid in one partition is mapped to the closest centroid in the partition for which correspondence is being generated. In the case that the closest centroid has already been paired with another centroid, the mapping will be to the next closest centroid, which has not been previously paired. In this way, each centroid of a partition is mapped to a corresponding centroid in the other partition. A global mapping is obtained by applying transitivity to matching pairs of partitions as described earlier. Consider four partitions, p 1 p 2 p 3 p 4 after finding a correspondence between p 1 and p 2 p 2 and p 3 and p 3 and p 4 by transitivity there is a mapping from each of the cluster centroids in p 1 to those in p 4 Illustrative Example: Consider the case that there are 4 subsets of data and each subset is grouped into 3 clusters. Let S1, S2, S3, and S4 be the subsets. Figure 3.2 (a) shows the two local chain matrices and Figure 3.2 (b) shows the global chain matrix. Figure 3.3 shows the 12 PAGE 20 consensus chain algorithm, which merges the local chain matrices to form the global chain matrix. S1 S2 2 2 1 1 3 3 S2 S3 3 1 1 3 2 2 S3 S4 3 1 1 2 2 3 (a) Arrow shows the transitive relation between 3 local chain matrices S1 S2 S3 S4 2 2 2 3 1 1 3 1 3 3 1 2 (b) Global chain matrix formed from above three local chain matrixes. Each row is a consensus chain Figure 3.2 Example of Formation of Global Chain Matrix from 3 Local Chain Matrices We call each row of the Global chain matrix a consensus chain. It tells us which centroid in which subset is matched to which other centroid(s) in other subsets. If the global chain matrix is formed without any collision at any stage, we say that the centroids of subsets have mapped perfectly. The whole process is known as centroid mapping. The input to the algorithm (Figure 3.3) is only the set of centroids of subsets. The number of centroids of a data set is not generally a function of the number of examples, so the algorithm generally takes constant time. After solving the matching problem, a consensus chain contains similar types of centroids. In this work, as stated earlier we weight each centroid by the number of examples in its subset and then create a weighted average centroid. So, if we have 4 partitions and 3 clusters consider the first cluster of the first partition. Assume that the subset it was created from contains 100 examples; it matches the second cluster of the second partition which has 200 examples; let this matches the second cluster of the third partition which has 50 examples and this matches the first cluster of the fourth partition which has 100 examples. Let the cluster centers be denoted ijc for the j th cluster of the i th partition. Each cluster center is a vector of dimension s for s features. So, we could create one cluster of the global partition call it 4501005020010042322211cccccgi After the global cluster centroids are created, all of the clustered examples can be assigned to the nearest centroid using a distance metric. In this work, the Euclidean distance is used. To avoid sending lots of data between processors, the cluster centroids themselves can be passed to other processors and the cluster assignment of the examples made locally. 13 PAGE 21 Large data sets may have many extrema or saddle points to which a clustering algorithm will converge depending upon the initialization. If some of them are significantly different, this will present a problem in combining centroids from different partitions. Even if the two partitions had exactly the same data, but were initialized differently, the final partition could be significantly different. In order to filter clusters that come from very different partitions, we developed the harmony algorithm. It will remove noisy or poorly matching centroids from a chain of centroids. Input: Centroids of subset partitions Output: Global chain matrix 1. M=number of subsets, C=clusters in each subset, I=1, J=2. Global chain matrix is initialized to zero. Number the subsets/distributed sites 1 to M randomly. 2. While (J! = M) { 2.1 Local chain matrix initialized to zero. Select subset I and subset J and compute the Euclidean distance among the centroid vectors and store it in the distance matrix such that position (row, col) of the distance matrix is the Euclidean distance between row th centroid vector of subset I and col th centroid vector of subset J. 2.2 Find the minimum value in the distance matrix and record its position (row, col) value in the local chain matrix. 2.3 While (local chain matrix is not completely filled up) { 2.3.1 Find the next minimum value in the distance matrix. If a collision doesnt occur, record the position (row, col) in the local chain matrix. } 2.4 If (I==1) The correspondence relation in the local chain matrix forms the first two columns of the global chain matrix. Else Fill the J th column of the Global chain matrix using the local chain matrix and the transitive relation it has with the I th column in the Global chain matrix. 2.5 I=I+1,J=J+1. } Figure 3.3 Consensus Chain Algorithm 14 PAGE 22 3.4 Distributed Combining Algorithm Now, we present the complete algorithm (Figure 3.4) for clustering largescale data without clustering all the data at once. Input: Global data Output: Partition of the global data 1. Divide the data into M subsets. In case of distributed sites the data is already divided. So, no needs to perform this step then. 2. Cluster each subset/distributed location by a standard fuzzy kmeans or hard kmeans algorithm. 3. Call the consensus chain algorithm to form the global chain matrix. 4. For each consensus chain, compute the weighted (according to the number of examples in each subset) arithmetic mean of centroids in that consensus chain. The Arithmetic mean of centroids in each consensus chain represents the centroid of a cluster of the global partition. 5. Compute the Euclidean distance between each example and the global set of centroids. Assign the example to the nearest cluster centroid. Figure 3.4 Distributed (D)Combining Algorithm Since, we are clustering a part of the data in subsets; speedup is expected [4]. As mentioned earlier, step 3 generally takes constant time. Step 5 will take O (n) time, where n=number of examples in global data. 15 PAGE 23 3.5 Computing Disputed Examples To measure the quality of the clusters the partition formed by our Distributedcombining algorithm is compared to the partition formed by global clustering, and we call the examples that are placed in different clusters in the above two partitions disputed examples. If we have C clusters formed in a partition then examples in a cluster can be assigned a label number. A collection of such label numbers is called a label vector. If a partition has C clusters then its label number varies from 1 to C. For computing disputed examples we need to solve the correspondence relation between the label vector produced by our Dcombining algorithm and the label vector of the most typical global partition. The correspondence relation between these two label vectors could be solved using logic like the consensus chain algorithm with little modification. The distance matrix will be replaced by the confusion matrix formed by the two label vectors and instead of finding C (number of clusters) noncolliding smallest values from a distance matrix, it finds C noncolliding values from the confusion matrix to pair up the labels of the two partitions, such that their sum is globally optimally maximum (using the Hungarian method of maxcost assignment [33]). Disputed examples will provide a measure of the quality of the partition found by our Dcombining algorithm i.e. how much our partition differs when compared to a global partition. The complete algorithm for computing disputed examples is shown in Figure 3.5. Input: Two label vectors whose disputed examples are to be found. Output: Number of disputed examples between the 2 input label vectors. 1. Declare a C x C confusion matrix and initialize it to zero. 2. Read the first label number of the first input label vector and first label number of the second input label vector. 3. Increment the (row th, col th ) position of the confusion matrix by one, where row=label number of the first input label vector and col=label number of the second input label vector. 4. Read the next label number from the first input label vector and the second input label vector and perform step 3. 5. Continue step 3 and 4 until all the labels are read from both file. 6. Select C noncolliding optimal matched labels by using the Hungarian method of assignment problem (maxcost). 7. Add up all the values outside the position of the matched labels in the confusion matrix to find the number of disputed examples. Figure 3.5 Algorithm for Confusion Matrix 16 PAGE 24 Illustrative Example: Consider there are 10 examples in the global data and the most typical global partition produced 3 clusters and the label vector is shown is Figure 3.6. Let us assume that all examples of the first cluster have label 1, second cluster have label 2, and the third cluster have label 3. Please note that the cluster label numbers are sym bolic in nature i.e. cluster label 1 of a cluster doesnt mean that it is ne cessarily same as cluster label 1 of another cluster. Example Number Label 1 1 2 1 3 1 4 2 5 2 6 2 7 3 8 3 9 3 10 3 Figure 3.6 Example Label Vector of Global Partition of 10 Examples Now, let us assume that the label vector of our Dcombining algorithm is shown in Figure 3.7: Example Number Label 1 3 2 3 3 3 4 2 5 2 6 2 7 2 8 1 9 1 10 1 Figure 3.7 Example Label Vector Formed by DCombing Algorithm 17 PAGE 25 The confusion matrix formed (as described in Figure 3.5) by the above two label vectors (Figure 3.6 and Figure 3.7) is shown in Figure 3.8 (a) and Figure 3.8 (b) shows the matched label pairs. The number or entry of the table (Figure 3.8 a) is the number of times labels i from the global partition matches label j from the Dcombining algorithm. The row represents the label number of the global partition (Label g ) and column represents (Label d ) the label number of the Dcombining algorithm. Label d 1 Label d 2 Label d 3 Label g 1 Label g 2 Label g 3 0 0 3 0 3 0 3 1 0 (a) Confusion matrix Label g 1 Label d 3 Label g 2 Label d 2 Label g 3 Label d 1 (b) Matched label pairs Figure 3.8 Confusion Matrix and the Matched Label Pairs After applying the Hungarianmethod of maxcost assignment algorithm [33] (taken from Knuth's Stanford Graph base) to the confusion matrix (Figure 3.8 a), it finds 3 noncolliding values such that their sum is optimally the maximum. The 3 selected values give the matched labels (Figure 3.8 b). Number of disputed examples computed (using Figure 3.8) are: 0+0+0+0+1+0=1 18 PAGE 26 Chapter 4 Centroid Filtering The centroids of the consensuschain are clearly similar if each of them has the same distance from the other. This would be an unlikely extreme case. An information theoretic formulation is used to measure the harmony of the centroids in a chain as shown in Equation 4. Initially all centroids are active i.e. ready to participate in the merging process. H(CH_i) = ),tot_dd(j)logtot_dd(j)(Alog1212Aj (4) where A is a number of active centroids in a consensus chain, d (j) is the sum of the distances of the j th active centroid vector from all other active centroid vectors in a consensus chain, tot_d = d(1)+d(2)+ + d(A), for n centroids in a chain; and CH_i indicates the chain for the i th global cluster. After computing the harmony, H, using all the centroids in the chain, we check whether the elimination of one of the centroids from the chain will increase the harmony value above a threshold. If it does, that centroid is made inactive. An inactive centroid will not be used in creating the global centroid. After a centroid is removed, the algorithm can be applied to this shortened chain. It halts when there is no choice that will increase the H value or there are only three centroids left. The filtering algorithm is applied when there is a collision in forming the consensus chain. If all centroids are exactly same in a consensus chain, then the default harmony=1. Illustrative Example: Lets consider the case of three centroids (in a consensus chain). The three centroids will be in perfect harmony (value 1) if the Euclidean distance among them is equal. C 2 C 1 C3 Figure 4.1 Example of Perfect Harmony among Centroids in a Consensus Chain. Three centroids are equidistant from each other. Harmony is maximum i.e. 1 19 PAGE 27 Chapter 5 Data and Experiments We have performed experiments on the small Iris data set because it is tractable and on an MRI data set with air removed. Experiments have also been done on an artificial data set to examine the effects of subsets that do not have balanced class distributions. An experiment has been done on a mediumsized data set consisting of features extracted from underwater images of plankton [20]. 5.1 Iris Plant Database The Iris plant data set consists of 150 examples each with 4 numeric attributes [19]. It consists of 3 classes of 50 examples each. One class is linearly separable from the other two. 5.2 MRI Database The MRI data set consists of 22,320 examples, each consisting of 3 numeric attributes. The attributes are T1 weighted, T2 weighted, and PD (proton density) weighted images of the human brain. 5.3 Synthetic Data 500 examples were generated from 4 slightly overlapping Gaussian distributions in 2D. The mean of the Gaussian are (300, 300), (800,300), (300,800), and (800,800). The standard deviations of the mixtures are 100 (the same in both dimension). This is shown in Figure 5.1. Attribute2 Attribute1 Figure 5.1 Synthetic 2D Gaussian Distributed Data. Different shades of + x, *, and o symbols indicate different clusters 20 PAGE 28 5.4 Plankton Data The plankton data consists of 350,000 sample s of plankton from the underwater SIPPER camera which records 8 gray levels. There were 37 features extracted. The samples were taken from the twelve most commonly en countered classes of plankton during the acquisition in the Gulf of Mexico. The class sizes range from about 9,000 to 44,000 examples. 5.5 Initialization of Centroids Iterative clustering algorithms are sensitive to initialization. Th ere are two ways one might approach initialization in a distributed clustering proble m. Each subset of data might be given its own random initialization (P urerandom). Alternatively, all subset of data could be given the same random initiali zation (Semirandom). This is a lowcost approach to providing a potentially more uniform set of partitions. We will report experimental results with both initialization approaches. 5.6 Experimental Setup The full data was clustered 50 times in a single memory with random centroid initialization within the range of the data fo r all data sets except the plankton data. We kept a count of the extrema (or saddlepoints, but for the proceeding we will just discuss extrema), which occurs most often. The averag e number of examples put into different clusters (disputed examples) by clustering al l the data and distributing the data into smaller subsets and then merging the partitions was recorded. The standard deviation of disputed examples from each of the extremas with the most often occurring extrema was computed. Each data set, with the exception of the plankt on data set, is broken into subsets and each one of them clustered and combined as described above using 50 pure random and 50 semirandom centroid initializations. We co mputed the average number of disputed examples and their standard deviation of th e extremas found after combination with the most typical extrema found during global clus tering. Thus we compare the results of our distributed algorithm with the most typical gl obal partition. Since there can be more than one extrema in a data set, we have chosen th e clusters of the most typical extrema as the reference frame, and plotted the disputed examples of similar types of extremas produced by our combining algorithm. Disputed examples of other types of extremas produced by the Dcombining algorithm can also be plot ted by choosing the closest extrema (by J value) from the global clustering and making th at the reference frame. For the Iris data we have used 2 attributes (petal length and petal width) to plot the disputed examples because these features contain most of the information about the data. 21 PAGE 29 5.7 Iris Experiments We performed experiments with the Iris data by dividing it randomly into 2 and 3 subsets and clustering each subset using the fuzzyk means and the hard kmeans algorithm. The number of clusters in each subset was designated as 3. Tables 5.1 and 5.2 show the average number of disputed examples and their standard deviation found during global clustering and measured relative to the most typical extrema. Tables 5.3 and 5.4 show the results from applying the DCombining algorithm. Using 2 subsets and fuzzyk means on each subset, the same clusters were obtained as with global clustering. With 3 subsets, we got only 2 disputed examples on average. The average standard deviation of disputed examples is 0, like from global clustering. Hence clustering of 3 subsets by the combining algorithm also produces just 1 type of extrema. We plotted the 2 disputed examples in Figure 5.2 (a) and found that these disputed examples (encircled ones) lie on the border of global clusters in an area of overlap as you might expect because there little difference in the position of cluster centroids. Clustering the global data by hardk means gives more than one extrema (Figure 5.2 b). Similarly, our combining algorithm also results in more than one extrema (Figure 5.3 a, b), and the patterns of extremas are quite similar to Figure 5.2 b (extremas of the global cluster). The disputed examples found using hardk means in each subset also lie on the border (Figure 5.4 a) as expected. Comparison of Figure 5.4 (b) and Figure 5.3 (b) show that clustering subsets by purerandom centroid initialization may produce more extremas than semirandom initialization for an equivalent number of subsets. This was because the likelihood of combining extremas from different subsets, whose centroids differ significantly, appears larger during purerandom initialization than in semirandom initialization. This is because every subset has been initialized with centroids independent of the other, thus the probability of convergence of the objective function to a J value, which differs significantly is high. It has been observed that under this condition the centroids of subsets have more collisions in the consensus chain algorithm. Table 5.1 Results of Clustering (Fuzzy) Iris Global Data Set and 50 Random Initializations of Centroids (3 clusters) (150 examples) Average disputed examples Standard Deviation of disputed examples Fuzzy 0 0 22 PAGE 30 Table 5.2 Results of Clustering (Hard) Iris Global Data Set and 50 Random Initializations of Centroids (3 clusters) (150 examples) Average disputed Standard Deviation of disputed examples Hard 7.9 20.63 Table 5.3 Results of Our Dcombining Algorithm (Iris Data). Fuzzykmeans Applied to Each Subset (K=3 in each subset) Average disputed examples (Semirandom) Standard Deviation of disputed examples (Semirandom) Average disputed examples (Purerandom) Standard Deviation of disputed examples (Purerandom) 2Subsets (Fuzzy) 0 0 0 0 3Subsets (Fuzzy) 2 0 2 0 Table 5.4 Results of our DCombining Algorithm (Iris Data). Hardkmeans Applied to Each Subset (K=3 in each subset) Average disputed examples (Semirandom) Standard Deviation of disputed examples (Semirandom) Average disputed examples (Purerandom) Standard Deviation Of disputed examples (Purerandom) 2Subsets (Hard) 4.82 15.38 8.74 20.07 3Subsets (Hard) 8.3 11.69 9.8 14.43 23 PAGE 31 2 disputed examples (encircled). (a) 2 disputed examples (encircled) for 3 subsetshardk means in each subset (Iris data). Different shades of + and * symbols denote the clusters 010203001656260Disp uted examples of ex tremasNo. of ti mes Count Disputed examples of extrema (b) Disputed examples of extremas of Iris (global hard clustering) Figure 5.2 Experiments with Iris Data 24 PAGE 32 05 1 0 1 50tsNo. of ti 2 012646566 Disagreemen me s Count 20 15 10 5 0 0 1 2 64 65 66 Disputed examples of extrema (a) Disputed examples of extremas of Iris2 subsetshardk means in each subset (semirandom initialization) 05101520250383648Di sputed examples of ext remasNo. of time s Count Disputed examples of extrema (b) Disputed examples of extremas of Iris3 subsetshardk means in each subset (semirandom initialization) Figure 5.3 Result of DCombining Algorithm on Iris Data with SemiRandom Initialization 25 PAGE 33 Different shades of + and * symbols represent different clusters. (a) Encircled examples showing the 8 disputed examples of Iris data (Hard k means applied to data in each subset) 051015200383033364960 Disagreements No. of times Count Disputed examples of extrema (b) Disputed examples of extremas of Iris data using 3 subsetshardkmeans in each subset (Purerandom initialization) Figure 5.4 Result of DCombining Algorithm on Iris Data with PureRandom Initialization 26 PAGE 34 5.8 MRI Experiments We have performed experiments on the 22,320 example MRI data using 2, 3, 4, and 5 subsets. MRI data has clusters of varying size and density. One could also utilize other clustering algorithms on each subset, which bett er deal with differing size and density clusters. We conducted experiments by initializing the number of clusters, k, to 7 in each subset. Tables 5.5 and 5.6 show the result of globally clustering the MRI data using fuzzykmeans and hardkmeans respectively. Th e results of combining MRI data using 2, 3, 4, and 5 subsets and clustering using fuzzyk means and hardkmeans are shown in Tables 5.7 and 5.8 respectively. The results s how that the average number of disputed examples and the standard deviation of ex tremas obtained by our DCombining algorithm using semirandom initialization is quite consis tent with the average number of disputed examples and standard deviation of extrem as obtained during global clustering of the MRI data (using k=7). This means that th e pattern of extremas generated by our DCombining algorithm using semirandom initializa tion is quite similar to the pattern of extremas generated by clustering all the data at a time. Figure 5.5 (a) shows the histogram of the disputed examples of all extremas that occurred during global clustering of the MRI data (using fuzzy k means, k=7). It has lots of extremas whose disputed examples are 0 (most typical) or close to zero. There is also an equa l number of extremas whose disputed examples are large (6796 at the other end of the histogram in Figure 5.5 a). The extremas generated by our algorithm using 5subsets (fuzzy k means in each subset) generate a similar pattern of extremas (Figur e 5.5 b). We also found some other types of extremas in the middle (Figure 5.5 b) i.e. ( no such similar extrema exist in Figure 5.5 a). This is expected because the J values of many of the extremas in the global data are quite different (Figure 5.5 a). Thus, sometimes our algorithm combines a highly diverse ensemble of centroids. With purerandom initialization this occurs more often. We got similar results using hard cl ustering and with k=10. We pl otted the disputed examples found by our DCombining algorithm (using 5 subs ets with hard k means in each subset). Figure 5.6 (a) shows the disputed examples only and Figure 5.6 (b) shows them with global clusters in the backgr ound. It shows that the disputed examples lie on the spatial border of clusters with small changes in the centroids locations. Similar plotting has been obtained by clustering subsets with fuzzy k means. Figure 5.7 (a) shows 47 disputed examples using 5subsets, fuzzykmeans applied to examples of each subset, and k=7. Figure 5.7 (b) shows the disputed examples with global clusters in the background. 27 PAGE 35 Table 5.5 Results of Clustering (Fuzzy) MRI Global Data Set and 50 Random Initializations of Centroids (7 clusters) Average disputed examples Standard Deviation of disputed examples Fuzzy 2990.54 3407.42 Table 5.6 Results of Clustering (Hard) MRI Global Data Set and 50 Random Initializations of Centroids (7 clusters) Average disputed examples Standard Deviation of disputed examples Hard 4022.84 3720.89 28 PAGE 36 Table 5.7 Results of Our DCombining Algorithm (MRI data). Fuzzykmeans Applied to Each Subset (7 clusters) Average disputed examples (Semirandom) Standard Deviation of disputed examples (Semirandom) Average disputed examples (Purerandom) Standard Deviation of disputed examples (Purerandom) 2Subsets (Fuzzy) 3040.68 3229.14 3288.82 2665.02 3Subsets (Fuzzy) 2980.76 3268.57 3380.84 2381.51 4Subsets (Fuzzy) 3175.38 3213.51 3848.04 1815.08 5Subsets (Fuzzy) 3073.8 3160.03 3469.36 1927.57 Table 5.8 Results of Our DCombining Algorithm (MRI Data). Hardkmeans Applied to Each Subset (7 clusters) Average disputed examples (Semirandom) Standard Deviation of disputed examples (Semirandom) Average disputed examples (Purerandom) Standard Deviation of disputed examples (Purerandom) 2Subsets (Hard) 4077.88 3630.29 3755.46 2947.79 3Subsets (Hard) 4069.34 3661.02 3486.14 2355.76 4Subsets (Hard) 4053.10 3278.49 3872.98 2434.57 5Subsets (Hard) 4096.46 3282.17 3089.94 1765.54 29 PAGE 37 05101520250146796 Disagreemen tsNo. of times Disputed examples of extrema Count (a) The disputed examples of extremas of clustering of all the MRI data (k=7) using fuzzyk means 0246810123941444735174715626268076809 Disagreemen tsNo. ofes tim Count Disputed examples of extrema (b) The disputed examples of extremas obtained using 5 subsets of MRI data (fuzzy, k=7 in each subset, semirandom initialization ) Figure 5.5 Comparison of Global Fuzzy Clustering of MRI Data with DCombining Algorithm 30 PAGE 38 T2 T1 PD (a) Plotting of 1384 disputed examples of MRI data. No of subsets used is 5, k=7, and hard clustering in each subset T2 T1 PD (b) Disputed examples lie on the spatial border (+green symbols) of clusters. Clusters have been represented using different colors Figure 5.6 Plotting of Disputed Examples of MRI Data (Hardk means, k=7) 31 PAGE 39 T2 T1 PD (a) The plotting of 47 disputed examples of MRI data. No of subsets used is 5, k=7, and fuzzy clustering in each subset Disputed examples PD T2 T1 (b) 47 Disputed examples lie on the spatial border (+green symbols) of clusters as expected. Clusters have been represented using different colors Figure 5.7 Plotting of Disputed Examples of MRI Data (Fuzzyk means, k=7) 32 PAGE 40 5.9 Using Harmony to Filter Centroids in the MRI Data Table 5.9 shows the result of applying the Harmony algorithm with a 1% threshold. There's a significant reduction in the number of disputed examples. It is most dramatic for the MRI data where the numbers of disputed examples go from the thousands to the hundreds. We believe that indicates that these partitions are essentially as good as those obtained from using all the centroids. Figure 5.8 (a) shows 1384 disputed examples obtained without using the Harmony algorithm. Figure 5.8 (b) shows that the number of disputed examples reduced to 346 from 1384 after applying harmony algorithm and Figure 5.8 (c) shows that the reduced disputed examples also lie on the spatial border of clusters of the global partition. This shows that Harmony algorithm effectively reduces the number of disputed examples on the spatial border of the clusters. We got similar plotting with fuzzykmeans applied to each subset. It has been observed that the Harmony algorithm successfully removes noisy centroids i.e. when noise exists in minority proportion. It sometimes fails to filter centroids when a significant amount of noise is present in a consensus chain i.e. when there is no clear cut majority of a particular set of harmonious centroids in a consensus chain. In a real life scenario one or more distributed sites/subsets may be noisy. Thus it worth using it to fine tune the final partition. Table 5.9 Using Harmony to Filter Centroids in the MRI Data Set (Boost Threshold used = 1% ) J1 (J2 for fuzzy) value of global cluster (most often occurring) No. of Disputed examples from our Dcombining algorithm J1 value of Dcombining algorithm. No. of Disputed examples after applying Harmony booster J1 value after applying Harmony booster MRI (hard, 7 clusters, 5 subsets) 230359434.83 1384 234797809.09 346 231125304.72 MRI (fuzzy, 7 clusters, 5 subsets) 106134159.51 3395 245836863.33 (J1 value) 174 236170371.82 (J1 value) 33 PAGE 41 (a) Left figure shows 1384 disputed examples of MRI data using 5 subsetsHardkmeans in each subset (7 cluster, hardkmeans in each subset) T2 T1 PD T1 T2 PD (a) Right figure shows the result after applying the HarmonyBooster. Disputed examples decreased to 346 (Boost threshold=1%) Figure 5.8 Effect of Applying Harmony Algorithm to Consensus Chains of MRI Data (Using Hardkmeans, k= 7) 34 PAGE 42 Green color + symbols PD T1 T2 (c) Disputed examples (346 +green symbols) lie on the spatial border of clusters Figure 5.8 Continued 35 PAGE 43 5.10 Synthetic Data Experiments This data set was used to pr ovide an indication of what mi ght happen in the case that the random subsets/distributed sites received a poor selection of examples. The data was broken into four subsets as follows. One s ubset was formed with 250 examples from class 1 and 250 examples from class 3, which correspond to the o a nd respectively in Figure 5.1. So, this was a very nonrepresen tative subset. The other subsets got examples randomly selected from each of the classes with equal probability as long as an example was available. These data sets were slightly skewed. The data was clustered into four classes. In this cas e, one would expect the distributed partition to be at best a bit worse than the partition created with all of the data. Tables 5.105.12 show that this is so. However, after applying the harmony algorithm to smooth the cluster chains there are on average between 14 and 16 examples put in di fferent classes as s hown in Table 12 for hard kmeans and there are 22 examples for fuzzykmeans applied to each subset. While this is clearly more than the couple that changes with different random initializations using all the data, we believe it shows that the final partition would be quite usable because most times people are trying to get a general idea of how the data is grouped. Figure 5.9 (a) and (b) show the plotting of disputed examples before and after applying Harmony. Fuzzykmeans has been used in ea ch subset and it clearly shows that Harmony makes the partition better. Similar re sults have been plotted in Figure 5.10 (a)(b), where hardkmeans has been used in each subset. Figure 5.11 s hows the pattern of extremas formed by clustering all the data at once i.e. global clustering. It shows that all the extremas have disputed examples between 04 compared to the most typical global partition. Figure 5.12 (a) shows that combini ng partitions (hardkmeans, pure random initialization) without using Harmony result ed in extrema patterns whose disputed examples vary from 56 to 99 while after applying Harmony the number of disputed examples on average reduces to between 11 to 17 (Figure 5.12 b). Similar results have been obtained in Figure 5.13 (a)(b) with se mirandom centroid initialization. Please note in this experiment the result of semirandom and purerandom initialization doesnt show much difference because the global data doesn t appear to have multiple extrema (Figure 11), which differ among them significantly. T hus it is less sensitive to initialization. Table 5.10 Results of Clustering (Fuzzy a nd Hard) Artificial Global Data Set and 50 Random Initializations of Centroids (4 clusters) (2000 examples) Average disagreement Standard Deviation Hard 1.78 1.58 Fuzzy 0 0 36 PAGE 44 Table 5.11 Results of Our DCombining Algorithm without Filtering Cluster Centers (K=4 in each subset) (500 examples in each subset) Average disagreement (Semirandom) Standard Deviation (Semirandom) Average disagreement (Purerandom) Standard Deviation (Purerandom) 4Subsets (Hard) 65.62 9.93 69.60 11.14 4Subsets (Fuzzy) 92 0 92 0 Table 5.12 Results of our DCombining Algorithm after Filt ering Cluster Centers Using Harmony (K=4 in each subset) (500 examples in each subset) Average disagreement (Semirandom) Standard Deviation (Semirandom) Average disagreement (Purerandom) Standard Deviation (Purerandom) 4Subsets (Hard) 15.66 1.75 14.5 2.31 4Subsets (Fuzzy) 22 0 22 0 37 PAGE 45 (a) It shows 92 disputed examples (+ symbols) of synthetic data using 4 subsetsFuzzykmeans applied to each subset (b) It shows the result after applying the Harmony algorithm. The number of Disputed is examples (+ symbols) decreased to 22 (Boost threshold=1%) Figure 5.9 Effect of Applying Harmony Algorithm to the Consensus Chains of Synthetic Data (Using Fuzzykmeans, k= 4). One of the Subsets is Unbalanced. Different Colors of . Symbol Denotes Different Clusters 38 PAGE 46 (a) 78 disputed examples (+ symbols) of the Synthetic data using 4 subsetsHardkmeans in each subset (b) The result after applying the Harmony algorithm. Disputed examples (+ symbols) are decreased to 17 (Boost threshold=1%) Figure 5.10 Effect of Applying Harmony Algorithm to the Consensus Chains of Synthetic data (Using Hardkmeans, k= 4). One of the Subsets is Unbalanced. Different Colors of . Symbol Denotes Different Clusters 39 PAGE 47 Count No. of disputed examples Figure 5.11 Extrema of Global Clustering of Synthetic data. Shows the Disputed Examples of Extremas of Clustering of all the Synthetic Data (k=4) Using Hardk means (Global clustering) Count No. of disputed examples (a) Results after applying the DCombining algorithm to Synthetic Data without using the Harmony algorithm to filter centroids (Purerandom initialization) Count No. of disputed examples (b) Results after applying DCombining algorithm to Synthetic Data with the Harmony algorithm to filter centroids (Purerandom initialization) Figure 5.12 Effect of Applying the Harmony Algorithm to the Extrema Patterns of Synthetic Data with PureRandom Initialization (Using Hardkmeans, k= 4) 40 PAGE 48 Count No. of disputed examples (a) Results after applying DCombining algorithm to Synthetic Data without using the Harmony algorithm to filter centroids (Semirandom initialization) Count No. of disputed examples (b) Results after applying DCombining algorithm to Synthetic Data with the Harmony algorithm to filter centroids (Semirandom initialization) Figure 5.13 Effect of Applying Harmony Algorithm to the Extrema Patterns of Synthetic Data with SemiRandom Initialization (Using Hardkmeans, k= 4) 41 PAGE 49 5.11 Plankton Data Experiments Due to the size of this data set, we are focusing on the time saved by clustering in a distributed way. All experiments were on a SUN Enterprise 3000 multiprocessor system using UltraSPARC II processors running at 248 MHz with 3 GB of RAM. All the data was clustered one time, which took approxima tely 407 minutes (nearly seven hours). It took approximately 43 minutes to cluster the data in a distri buted way using 14 equal size subsets. The merging algorithm took near abou t 4.28 minutes to form the global partition. So, we got a speed up of 8.6 times. There were 17,028 (4.8%) disputed examples after merging the distributed partitions. Utilizing the harmony algorithm resulted in just 8147 (2.3%) disputed examples. Such an overall pa rtition of the data should be quite useful given the typical instability of clustering al gorithms and the fact that general cluster characteristics are likely preserved. 42 PAGE 50 Chapter 6 Discussion and Conclusions Experimental results show that the differe nce between the average number of disputed examples produced by our DCombining al gorithm and global clustering is 0.8 % (averaged both for Fuzzy and Hard clusteri ng, 3 subsets, semirandom initializations) on the Iris data and 0.34 % (averaged both for Fuzzy and Hard clustering, 5 subsets, semirandom initializations) for the MRI data se t. We have observed that the Dcombining algorithm produces extremas (or clusters) more similar to the global extremas, if the centroids in different subsets map without any collision. If th e global data set has lots of extremas and if their J1 (J2) values differ si gnificantly, then it is more probable that the centroids of different subsets will coll ide during centroid mapping. This likelihood becomes higher during purerandom initialization of the centroids in each subset. It has been observed that our DCombining algorith m produces more distinct extremas using purerandom centroid initialization than us ing semirandom centroid initialization. The Iris data set when globally clustered us ing fuzzy kmeans has only 1 extrema (Table 5.1), similarly, our DCombining algorithm generates 1 extrema (Table 5.3). The MRI data when clustered globally using the fuzzy kmeans algorithm produces extremas whose J2 (J1) value (thus al so disputed examples) diffe r significantly (Figure 5.5 a). Similarly, our Distributedcombining algorith m also produces similar patterns of extrema, which differ significantly (Figure 5.5 b). Plotting the dis puted examples shows that they lie on the spatial bor der of clusters of the globa l partition (Figure 5.2 a, 5.4 a, 5.6, 5.7, 5.8, 5.9, 5.10) as expected if the part itions are not radically different. Sometimes, our algorithm produced extremas, which didnt exist in the global partition. This happens when the J value of extremas differs signifi cantly in the global data and subsets also inherit that property. In summary, the quality of partition formed is more often better with semirandom initialization than with purerandom initialization. Combining all the representatives in an ensemb le of centroids in unsupervised learning is not always good due to the unstable characte ristics of the clustering algorithms or nonrepresentative subsets. We have shown with the artificial data th at even if data in a subset is heavily skewed we can still recover good partitions i.e. average difference of disputed examples between DCombining algorithm a nd global clustering is 3.85% (averaged both for Fuzzy and Hard clustering, 4 subsets, semirandom initializations) without using the filtering algorithm and 0.85% (averaged both for Fuzzy and Hard clustering, 4 subsets, semirandom initializations) after using the ce ntroid filtering algor ithm. Experimental 43 PAGE 51 results show that minimizing the amount of diversity among an ensemble of centroids may give a better partition than merging all (Table 5.9, Figure 5.8, Table 5.12, Figure 5.9, Figure 5.10, Figure 5.12, and Figure 13). Thus on average the quality of partition improves by using Harmony algorithm. These parti tions should be compared with single pass kmeans in time and quality [8 and 13]. Clustering very largescale da ta using hardk means or fuzzy kmeans is very time consuming. Sometimes, the data may be geographi cally distributed or may be too large to fit in a single memory. There may be constrai nts like data could not be shared between different distributed locations due to privacy, security, or th e proprietary nature of the data. Extracting knowledge from these types of distributed locations under restraints of data exchange is called priv acy preserving data mining. In this thesis we proposed a distributed clustering algorithm that provides a framework for integrating an ensemble of centroids to form a global partition. We r each a global consensus on the positions of the centroids after merging an ensemble of centroids. As mentioned earlier the result s on two real data sets show that the quality of clusters produced by our combining algorithm was sim ilar (difference in the average number of disputed examples between the partitions of Dcombining and global clustering is within 1%, averaged both for Fuzzy and Hard clusteri ng ) to the quality of clusters generated by global clustering. The general cluster struct ure is maintained. If the initial centroid assignments of subsets are semirandom, our combining algorithm tends to produce an extrema pattern closer to th e pattern of extremas found during global clustering. If pure random initialization is used, the similarity of extremas decreases a little. This is because the likelihood of a perfect mapping of centroids of disjoint subsets is higher during semirandom centroid initialization than in pure random centroid initialization. Data sets having lots of extremas, whose J value differs heavily among them, sometimes, produce clusters dissimilar to those obtained from clustering all the data. To overcome this problem, we proposed a Harmony algorithm to smooth the cluster centroids to be combined. It tends to eliminate noisy or discor dant centroids. Results show that use of the Harmony algorithm on the MRI data (Figure 5.8) reduces the number of disputed examples from 1384 to 346 on the border of global clusters. A synthetic data set with 4 Gaussian classes was used to examine how this algorithm would perform in the case of a poor selection of data in the subsets. Results showed that even in this case, in conjunction with th e Harmony algorithm, a partition that was a reasonable approximation (difference in th e average number of disputed examples between the partitions of Dcombining and global clustering is 0.85%, averaged both for Fuzzy and Hard clustering) of the data partit ion gotten from clustering with all the data was obtained, improving the quality of clusters. 44 PAGE 52 A mediumsized real world data set of 350,000 examples, 37 dimensions and 12 classes was used to look at both the potential speed up of this approach and how well it matched clustering with all the data on a bigger set. The clustering was accomplished significantly faster, approximately 9 times, using 14 subsets of data. The final partition matched the partition obtained after clus tering with all the data within 2.3%. Our assumption behind merging the centroids of distributed partitions is that data in all distributed sites are from the same underlyi ng distribution. If the above assumption is violated i.e. if all the distributed partitions are radically different from each other then our merging algorithm will still merge those part itions and the centroid filtering algorithm may not detect any noisy centroids because in the consensus chain there may be no consensus for a global centroid. For example, a problem would arise if there were 6 clusters and 3 sites each of which had data fr om 2 clusters with all pairs of clusters disjoint. In future we plan to de tect a scenario like this. The approach presented here is a scalable, distributed approa ch that can be applied to wellunderstood iterative clusteri ng algorithms. It also provid es a framework or privacy preserving data mining. We have shown it provi des very representative clustering or data partitions. It allows partitions to be built from very large data sets, which will closely approximate those obtained from clustering with all the data and using algorithms such as hard kmeans or fuzzy kmeans. 45 PAGE 53 References [1] A. Strehl and J. Ghosh, Clusters en semblesa knowledge reuse framework for combining multiple partitions. Journal of Machine learning Research, 3, 2002, pp. 583617. [2] A. Topchy, A.K Jain, and W. Punch, Combining Multiple Weak Clusterings. Proceedings of IEEE Intl. Conf. on Data Mining, 2003, pp. 331338. [3] K. D Bollacker and Joydeep Ghosh, A S upraClassifier Architecture for Scaleable Knowledge Reuse. In Proc. Intl Conf. on MachineLearning (ICML98), 64. [4] I. Davidson and Ashwin Satyanarayana, Speeding up Kmeans clustering by Bootstrap averaging.. To Appear in the Workshop on Clustering Large Data Sets @ IEEE ICDM 2004. [5] A.L.N Fred, Finding Consistent Clusters in Data Partitions. In Proc. 3d Int. Workshop on multiple Classifier Systems. Eds. F. Roli, J. Kittler, LNCS 2364, 2002, pp. 309318. [6] Joydeep Ghosh and Srujana Merugu, Distri buted Clustering with Limited Knowledge Sharing, Proceedings of the 5th Interna tional Conference on Advances in Pattern Recognition (ICAPR)", pp. 4853, Dec 2003. [7] Eshref Januzaj, HansPeter Kriegel, a nd Martin Pfeifle, Towards Effective and Efficient Distributed Clustering, Proc. Int. Workshop on Clustering Large Data Sets, 3rd IEEE International Conference on Data Mining (ICDM ), 2003, pp. 4958. [8] Fredrik Farnstrom, James Lewis, and Charles Elkan, Scalability of Clustering Algorithms Revisited, SIGKDD Explorations, 2(1), 2000, pp. 5157. [9] Inderjit S. Dhillon and Dharmendra S. Modha, A DataClustering Algorithm On Distributed Memory Multiprocessors, Proceedings of Largescale Parallel KDD Systems Workshop, ACM SIGKDD, 1999, pp. 245260. [10] Ganti, V., Gehrke, J. and Ramakrishna n, R. (1999). Mining very large databases, Computer, August, 3845. 46 PAGE 54 [11] Zhang, T., Ramakrishnan, R. and Li vny, M. (1996). BIRCH : An Efficient Data Clustering Method for Very Large Databa ses, Proc. ACM SIGMOD Int'l. Conf. on Management of Data, ACM Press, NY,103 114. [12] Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A. L. and French, J. C. (1999), Clustering Large Datasets in Arbitrary Metric Spaces, Proc. 15th Int'l. Conf. on Data Engineering, IEEE CS Press, Los Alamitos, CA, 502511. [13] Bradley, P., Fayyad, U. and Reina, C. ( 1998). Scaling clustering algorithms to large databases, Proc. 4th Int'l. Conf. Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, 915. [14] Domingos, P. and Hulten, G. (2001). A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering, Proc. Eighteenth Int'l. Conf. on Machine Learning,106113. [15] A.P. Dempster, N.M. Laird and D.B. Rubin. Maximum Likelihood from incomplete data via the EM algorithm, j ournal of the Royal statistical society, Series B, 39:138, 1977. [16] J. Ghosh. Scalable clustering methods for Data mining. In N.Ye, editor, Handbook of Data Mining, pp. 247277. Lawrence Erlbaum 2003. [17] J. Tantrum, A. Murua and W. Stuetz le, Assessment and pruning of hierarchical model based clustering, KDD 2003, pp. 197205, 2003. [18] W. Peter, J. Chiochetti, and C. Gi andina, New Unsupervised Clustering Algorithm for Large Data sets, KDD 2003, pp. 643648, 2003. [19] C.J. Merz and P.M. Murphy. UCI Reposito ry of Machine Learning Databases Univ. of CA., Dept. of CIS, Irvine, CA. http://www.ics.uci.edu/\~~mlearn/MLRepository.html [20] Remsen, A., Samson, S., Hopkins, T. L., 2004. What you see is not what you catch: A comparison of concurrently collected net, optical plankton counter (OPC), and Shadowed Image Particle Profiling Evalua tion Recorder (SIPPER) data from the northeast Gulf of Mexico. Deep Sea Research I 51(1) 129151. http://www.marine.usf.edu/sipper/Remsen_DSRa.pdf [21] AK Jain and RC Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs NJ, USA, 1988. [22] Bezdek, J. C., Keller, J. M., Krishnapuram, R. and Pal, N. R. (1999). Fuzzy models and algorithms for Pattern R ecognition and Image Processing, Kluwer, Norwell, MA. 47 PAGE 55 [23] A.K Jain, M.N Murty, and P.J Fl ynn, ACM Computing Surveys, 31, 3, 1999. [24] F. Hoppner, Speeding up fuzzy cmeans: using a hierarchical data organization to control the precision of memb ership calculation, Fuzzy Se ts and Systems, 128, 2002, pp. 365376. [25] S. Eschrich, J. Ke, L.O. Hall and D. B. Goldgof, Fast Accurate Fuzzy Clustering through Data Reduction, IEEE Transactions on Fuzzy Systems, 11, 2, 2003, pp. 262270. [26] Dan Pelleg and Andrew Moore, Accel erating Exact kmeans Algorithms with Geometric Reasoning, Proceedings of the Fi fth International Conference on Knowledge Discovery in Databases, 1999, pp. 277281. [27] Joydeep Ghosh, Alexander Strehl, a nd Srujana Merugu, A Consensus Framework for Integrating Distributed Clusterings U nder Limited Knowledge Sharing, Proceedings of the National Science F oundation (NSF) Workshop on Next Generation Data Mining, 2002, pp. 99108. [28] Matthias Klusch, Stefano Lodi, and Gian luca Moro, Distributed Clustering Based on Sampling Local Density Estimates, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 2003, pp. 485490. [29] PierreEmmanuel JOUVE, Nicolas NICOLOYANNIS, A New Method for Combining Partitions, Applications for Distributed Clustering, In Proceedings of the International Workshop on Parallel and Dist ributed Machine Lear ning and Data Mining (ECML/PKDD03), 2003, pp.3546. [30] Charles Elkan, Using the Triangle In equality to Accelerate kMeans, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), 2003, pp.147153. [31] D. K. Tasoulis, M. N. Vrahatis, Unsupe rvised Distributed Clustering, In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks. Innsbruck, Austria. [32] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD, pp. 217, 2002. [33] Kuhn, H.W., The Hungarian Method for the Assignment Problem. Naval. Res. Logist. Quart., 2 (1955), 8397. 48 PAGE 56 [34] D. Agrawal and C. C. Aggarwal, On the design and quantification of privacy preserving data mining algorithms, In Proceedings of the Twentieth ACM SIGACTSIGMODSIGART Symposium on Princi ples of Database Systems, 2001, pp. 247255. [35] B. Pinkas. Cryptographic techniques for privacypreserving data mining. SIGKDD Explorations, 4(2):12, 2002. 49 xml version 1.0 encoding UTF8 standalone no record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd leader nam Ka controlfield tag 001 001478754 003 fts 006 med 007 cr mnuuuuuu 008 040811s2004 flua sbm s0000 eng d datafield ind1 8 ind2 024 subfield code a E14SFE0000395 035 (OCoLC)56389739 9 AJS2444 b SE SFE0000395 040 FHM c FHM 090 QA76 (ONLINE) 1 100 Hore, Prodip. 0 245 Distributed clustering for scaling classic algorithms h [electronic resource] / by Prodip Hore. 260 [Tampa, Fla.] : University of South Florida, 2004. 502 Thesis (M.S.C.S.)University of South Florida, 2004. 504 Includes bibliographical references. 516 Text (Electronic thesis) in PDF format. 538 System requirements: World Wide Web browser and PDF reader. Mode of access: World Wide Web. 500 Title from PDF of title page. Document formatted into pages; contains 56 pages. 520 ABSTRACT: Clustering large data sets recently has emerged as an important area of research. The everincreasing size of data sets and poor scalability of clustering algorithms has drawn attention to distributed clustering for partitioning large data sets. Sometimes, centrally pooling the distributed data is also expensive. There might be also constraints on data sharing between different distributed locations due to privacy, security, or proprietary nature of the data. In this work we propose an algorithm to cluster largescale data sets without centrally pooling the data. Data at distributed sites are clustered independently i.e. without any communication among them. After partitioning the local/distributed sites we send only the centroids of each site to a central location. Thus there is very little bandwidth cost in a wide area network scenario. The distributed sites/subsets neither exchange cluster labels nor individual data features thus providing the framework for privacy preserving distributive clustering. Centroids from each local site form an ensemble of centroids at the central site. Our assumption is that data in all distributed locations are from the same underlying distribution and the set of centroids obtained by partitioning the data in each subset/distributed location gives us partial information about the position of the cluster centroids in that distribution. Now, the problem of finding a global partition using the limited knowledge of the ensemble of centroids can be viewed as the problem of reaching a global consensus on the position of cluster centroids. A global consensus on the position of cluster centroids of the global data using only the very limited statistics of the position of centroids from each local site is reached by grouping the centroids into consensus chains and computing the weighted mean of centroids in a consensus chain to represent a global cluster centroid. We compute the Euclidean distance of each example from the global set of centroids, and assign it to the centroid nearest to it. Experimental results show that quality of clusters generated by our algorithm is similar to the quality of clusters generated by clustering all the data at a time. We have shown that the disputed examples between the clusters generated by our algorithm and clustering all the data at a time lay on the border of clusters as expected. We also proposed a centroidfiltering algorithm to make partitions formed by our algorithm better. 590 Adviser: Hall, Lawrence 653 ensemble. merging. filtering. disputed examples. extrema. 690 Dissertations, Academic z USF x Computer Science Masters. 773 t USF Electronic Theses and Dissertations. 4 856 u http://digital.lib.usf.edu/?e14.395 