Introduction of Apache Mahout ——Unsupervised Machine Learning Lib Unsupervised Machine Learning  is the machine learning task of inferring a function that describes the structure of “unlabeled” data (i.e. data that has not been classified or categorized). Since the examples given to the learning algorithm are unlabeled, there is no straightforward way to evaluate the accuracy of the structure that is produced by the algorithm—one feature that distinguishes unsupervised learning from supervised learning and reinforcement learning.

There’re one canonical unsupervised learning problem:

• clustering • dimensionality reduction Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. We Only discussing clustering problem.

Clustring algorithm

There’s a lot of clustring algorithms: Hierarchical Clustering, K-Means Clustering, Density-Based Spatial Clustering of Applications with Noise … Here we only discussing K-Means Clustering.

K-Means Clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. But the hardest part is you don’t know how many clusters you want to clustering, and you don’t know who’s the most suitable k observations in n as cluster core points.

Solution：

Canopy Clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters. All objects are represented as a point in a multidimensional feature space. The algorithm uses a fast approximate distance metric and two distance thresholds T1 > T2 for processing. The basic algorithm is to begin with a set of points and remove one at random. Create a Canopy containing this point and iterate through the remainder of the point set. At each point, if its distance from the first point is < T1, then add the point to the cluster. If, in addition, the distance is < T2, then remove the point from the set. This way points that are very close to the original will avoid all further processing. The algorithm loops until the initial set is empty, accumulating a set of Canopies, each containing one or more points. A given point may occur in more than one Canopy.

So Canopy Clustering is often used as an initial step in more rigorous clustering techniques, such as K-Means Clustering  to determine the initial cluster centers. By starting with an initial clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies.

Next post will introduce how to use Canopy and K-Means together.