Machine learning -- kMeans clustering

Related concepts

Unsupervised learning

Unsupervised learning is machine learning to learn the statistical law or internal structure of data from unlabeled data, mainly including clustering, dimension reduction and probability estimation. Unsupervised learning can be used for data analysis or pre-processing of supervised learning.

clustering

clustering is a data analysis problem in which a given sample is merged into several clusters according to the similarity or distance of their characteristics. Intuitively, similar samples are clustered in the same cluster, and dissimilar samples are scattered in different clusters. Therefore, the similarity or distance between samples plays an important role.

Measurement of similarity and distance:

Sample matrix X X X means:
X = [ x i j ] m × n = [ x 11 x 12 ⋯ x 1 n x 21 x 22 ⋯ x 2 n ⋮ ⋮ ⋮ x m 1 x m 2 ⋯ x m n ] X=\left[x_{ij}\right]_{m\times n}=\begin{bmatrix}x_{11}&x_{12}&\cdots&x_{1n}\\x_{21}&x_{22}&\cdots&x_{2n}\\\vdots&\vdots& &\vdots\\x_{m1}&x_{m2}&\cdots&x_{mn}\end{bmatrix} X=[xij​]m×n​=⎣⎢⎢⎢⎡​x11​x21​⋮xm1​​x12​x22​⋮xm2​​⋯⋯⋯​x1n​x2n​⋮xmn​​⎦⎥⎥⎥⎤​

1: Minkowski distance: d i j = ( ∑ k = 1 m ∣ x k i − x k j ∣ p ) 1 p d_{ij}=\left(\sum\limits_{k=1}^m\left|x_{ki}-x_{kj}\right|^p\right)^{\frac{1}{p}} dij​=(k=1∑m​∣xki​−xkj​∣p)p1​. In clustering, the sample set can be regarded as the focus set of vector space, and the similarity between samples is expressed by the distance of this space. The greater the Minkowski distance, the smaller the similarity, otherwise the higher the similarity.

2: Mahalanobis distance: d i j = [ ( x i − x j ) T S − 1 ( x i − x j ) ] 1 2 d_{ij}=\left[(x_i-x_j)^TS^{-1}(x_i-x_j)\right]^\frac{1}{2} dij = [(xi − xj) TS − 1(xi − xj)] 21, where S S S is the covariance matrix covariance, x i = ( x 1 i , x 2 i , ⋯   , x m i ) T , x j = ( x 1 j , x 2 j , ⋯   , x m j ) T x_i=\left(x_{1i},x_{2i},\cdots,x_{mi}\right)^T,x_j=\left(x_{1j},x_{2j},\cdots,x_{mj}\right)^T xi = (x1i, x2i,..., xmi) T,xj = (x1j, x2j,..., xmj) T, referred to as Mahalanobis distance, is also a commonly used similarity, considering the correlation characteristics between various features and independent of the scale of each component. The greater the Mahalanobis distance, the smaller the similarity, otherwise the greater.

3: Correlation coefficient: the similarity between samples can be expressed by correlation coefficient. The closer the absolute value of correlation coefficient is to 1, the more similar the samples are; The closer to 0, the more dissimilar the samples are. sample x i x_i xi) and samples x j x_j The correlation coefficient of xj is defined as:
r i j = ∑ k = 1 m ( x k i − x ˉ i ) ( x k j − x ˉ j ) [ ∑ k = 1 m ( x k i − x ˉ i ) 2 ( x k j − x ˉ j ) 2 ] 1 2 r_{ij}=\frac{\sum\limits_{k=1}^m(x_{ki-\bar x_i})(x_{kj-\bar x_j})}{\left[\sum\limits_{k=1}^m(x_{ki-\bar x_i})^2(x_{kj-\bar x_j})^2\right]^\frac{1}{2}} rij​=[k=1∑m​(xki−xˉi​​)2(xkj−xˉj​​)2]21​k=1∑m​(xki−xˉi​​)(xkj−xˉj​​)​
Of which: x ˉ i = 1 m ∑ k = 1 m x k i ,   x ˉ j = 1 m ∑ k = 1 m x k j \bar x_i=\frac{1}{m}\sum\limits_{k=1}^m x_{ki},\ \bar x_j=\frac{1}{m}\sum\limits_{k=1}^m x_{kj} xˉi​=m1​k=1∑m​xki​, xˉj​=m1​k=1∑m​xkj​

4: Included angle cosine:

The similarity between samples can also be expressed by cosine of angle. The closer the cosine of the included angle is to 1, the more similar the samples are; The closer to 0, the more dissimilar the samples are. sample x i x_i xi) and samples x j x_j The included angle cosine of xj is defined as:
s i j = ∑ k = 1 m x k i x k j [ ∑ k = 1 m x k i 2 ∑ k = 1 m x k j 2 ] 1 2 s_{ij}=\frac{\sum\limits_{k=1}^m x_{ki}x_{kj}}{\left[\sum\limits_{k=1}^mx_{ki}^2\sum\limits_{k=1}^mx_{kj}^2\right]^\frac{1}{2}} sij​=[k=1∑m​xki2​k=1∑m​xkj2​]21​k=1∑m​xki​xkj​​
k-means clustering

k-means clustering is a clustering algorithm based on the division of sample set D D D is divided into k subsets to form K clusters { C l ∣ l = 1 , 2 , ⋯   , k } \left\{C_l|l=1,2,\cdots,k\right\} {Cl ∣ l=1,2,..., k}, divide n samples into k clusters, the distance from each sample to the center of its cluster is the smallest, and each sample belongs to only one cluster.

use d x , y d_{x,y} dx,y , represents the distance between two vectors, then the loss function of k-means algorithm is:
W ( C ) = ∑ l = 1 k ∑ x ∈ C l d x , x ˉ l W(C)=\sum\limits_{l=1}^k\sum\limits_{x \in C_l}d_{x,\bar x_l} W(C)=l=1∑k​x∈Cl​∑​dx,xˉl​​
Among them, x ˉ l = 1 ∣ C l ∣ ∑ x ∈ C l \bar x_l=\frac{1}{|C_l|}\sum x\in C_l xˉl​=∣Cl​∣1​∑x∈Cl​.

The objective optimization function of k-means algorithm is obtained:
C ∗ = a r g   min ⁡ C W ( C ) C^*=arg\ \min_CW(C) C∗=arg Cmin​W(C)
When similar samples are clustered into the same cluster, the loss value is the smallest. However, dividing n samples into k classes is a combinatorial optimization problem, and the way of dividing is exponential. Therefore, a greedy strategy is adopted to approximate the solution through iteration.

code

Relevant data:

/**
	 * Manhattan distance.
	 */
	public static final int MANHATTAN = 0;

	/**
	 * Euclidean distance.
	 */
	public static final int EUCLIDEAN = 1;

	/**
	 * The distance measure.
	 */
	public int distanceMeasure = EUCLIDEAN;

	/**
	 * A random instance.
	 */
	public static final Random random = new Random();

	/**
	 * The number of clusters.
	 */
	int numClusters = 2;

	/**
	 * The whole data set.
	 */
	Instances dataset;

	/**
	 * The clusters.
	 */
	int[][] clusters;

In the constructor, read the data and get the sample:
(IRIS data is used, but the label in the last column is not used)

	public KMeans(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		} // Of try
	}// Of the first constructor.

A setter function to set the number of clusters:

	/**
	 ********************
	 * A setter
	 *********************
	 */
	public void setNumClusters(int paraNumClusters) {
		numClusters = paraNumClusters;
	}// Of the setter

The random index is obtained and the data order is disrupted, which is the same as the previous KNN, but it is only used once when initializing the centroid of the cluster, so I think it is the same to randomly select numClusters samples there:

	/**
	 ********************
	 * Get a random indices for data randomization.
	 *
	 * @param paraLength The length of the sequence.
	 * @return An array of indices.
	 *********************
	 */
	public static int[] getRandomIndices(int paraLength) {
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;

		} // Of for i
		return resultIndices;
	}// Of getRandomIndices

Calculate the distance from the sample to the centroid. Since the centroid may not be the actual sample point, the second parameter is an array:

	/**
	 ********************
	 * The distance between two instances.
	 *
	 * @param paraI     The index of the first instance.
	 * @param paraArray The array representing a point in the space.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, double[] paraArray) {
		int resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - paraArray[i];
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - paraArray[i];
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

Core of code:

	/**
	 ********************
	 * Clustering.
	 *********************
	 */
	public void clustering() {
		int[] tempOldClusterArray = new int[dataset.numInstances()];
		tempOldClusterArray[0] = -1;
		int[] tempClusterArray = new int[dataset.numInstances()];
		Arrays.fill(tempClusterArray, 0);
		double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];

		// Step 1. Initialize centers.
		int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
		for (int i = 0; i < numClusters; i++) {
			for (int j = 0; j < tempCenters[0].length; j++) {
				tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
			} // Of for j
		} // Of for i

		int[] tempClusterLengths = null;
		while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
			System.out.println("New loop...");
			tempOldClusterArray = tempClusterArray;
			tempClusterArray = new int[dataset.numInstances()];

			// Step 2.1 Minimization.Assign cluster to each instance.
			int tempNearestCenter;
			double tempNearestDistance;
			double tempDistance;

			for (int i = 0; i < dataset.numInstances(); i++) {
				tempNearestCenter = -1;
				tempNearestDistance = Double.MAX_VALUE;

				for (int j = 0; j < numClusters; j++) {
					tempDistance = distance(i, tempCenters[j]);
					if (tempDistance < tempNearestDistance) {
						tempNearestDistance = tempDistance;
						tempNearestCenter = j;
					} // Of if
				} // Of for j
				tempClusterArray[i] = tempNearestCenter;
			} // Of for i

			// Step 2.2 Mean. Find new centers.
			tempClusterLengths = new int[numClusters];
			Arrays.fill(tempClusterLengths, 0);
			double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
			for (int i = 0; i < dataset.numInstances(); i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
				} // Of for j
				tempClusterLengths[tempClusterArray[i]]++;
			} // Of for i

			// Step 2.3 Now average.
			for (int i = 0; i < tempNewCenters.length; i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[i][j] /= tempClusterLengths[i];
				} // Of for j
			} // Of for i

			System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
			tempCenters = tempNewCenters;
		} // Of while

		// Step 3. Form clusters.
		clusters = new int[numClusters][];
		int[] tempCounters = new int[numClusters];
		for (int i = 0; i < numClusters; i++) {
			clusters[i] = new int[tempClusterLengths[i]];
		} // Of for i

		for (int i = 0; i < tempClusterArray.length; i++) {
			clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
			tempCounters[tempClusterArray[i]]++;
		} // Of for i

		System.out.println("The clusters are: " + Arrays.deepToString(clusters));
	}// Of clustering

At the beginning, you need to randomly select numClusters samples as the centroid. The randomization here is realized through getrandominduces:

int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
		for (int i = 0; i < numClusters; i++) {
			for (int j = 0; j < tempCenters[0].length; j++) {
				tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
			} // Of for j
		} // Of for i

In order to better experience the iterative process, I want to realize it through graphic comparison. Since the characteristics of the sample have four dimensions, but they are mainly aimed at the length and width of petals and the length and width of calyx, I use two diagrams to reflect:

(does the cluster numbered 1 look too miserable? I think there are two reasons: first, it is random after all, and the point may be a little off; second, because I use two two dimensions to represent four dimensions, it may not look like this in the real four dimensions. Don't panic first. Let's experience the process and see ~)

The while loop is to iterate continuously to find the approximate optimal solution. When it is found that the centroid does not change (the samples of each cluster do not change), it means that the score is finished.
This for loop assigns each sample to the nearest cluster (the shortest distance from the sample to the centroid)

for (int i = 0; i < dataset.numInstances(); i++) {
	tempNearestCenter = -1;
	tempNearestDistance = Double.MAX_VALUE;

	for (int j = 0; j < numClusters; j++) {
		tempDistance = distance(i, tempCenters[j]);
		if (tempDistance < tempNearestDistance) {
			tempNearestDistance = tempDistance;
			tempNearestCenter = j;
		} // Of if
	} // Of for j
	tempClusterArray[i] = tempNearestCenter;
} // Of for i

After the above samples are divided, the subset of clusters may change, so update the centroid:

tempClusterLengths = new int[numClusters];
Arrays.fill(tempClusterLengths, 0);
double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
for (int i = 0; i < dataset.numInstances(); i++) {
	for (int j = 0; j < tempNewCenters[0].length; j++) {
		tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
	} // Of for j
	tempClusterLengths[tempClusterArray[i]]++;
} // Of for i

// Step 2.3 Now average.
for (int i = 0; i < tempNewCenters.length; i++) {
	for (int j = 0; j < tempNewCenters[0].length; j++) {
		tempNewCenters[i][j] /= tempClusterLengths[i];
	} // Of for j
} // Of for i

After updating the centroid, you have to update the subset until the subsets of each cluster are unchanged, and then jump out of while.
Now, let's feel the update process:








The above is the result of the final clustering. Now let's see what the graph drawn with the label of the data itself is like:

Here, I didn't look at the score results and their own labels to see the accuracy of scores, because in the actual environment, we can't get labels ourselves~
But it is not difficult to find that the kMeans algorithm is really powerful!

Test code:

	/**
	 ********************
	 * Test clustering.
	 *********************
	 */
	public static void testClustering() {
		KMeans tempKMeans = new KMeans("F:/sampledataMain/iris.arff");
		tempKMeans.setNumClusters(3);
		tempKMeans.clustering();
	}// Of testClustering

Requirements: after obtaining the virtual center, replace it with the nearest point as the actual center, and then cluster
I defined a member variable:

	/**
	 * The nearest instances.
	 */
	int[] nearestInstances;

Initialize and update values in clustering:

	/**
	 ********************
	 * Clustering.
	 *********************
	 */
	public void clustering() {
		int[] tempOldClusterArray = new int[dataset.numInstances()];
		tempOldClusterArray[0] = -1;
		int[] tempClusterArray = new int[dataset.numInstances()];
		Arrays.fill(tempClusterArray, 0);
		double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];
		nearestInstances = new int[numClusters];

		// Step 1. Initialize centers.
		int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
		for (int i = 0; i < numClusters; i++) {
			for (int j = 0; j < tempCenters[0].length; j++) {
				tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
			} // Of for j
			nearestInstances[i] = tempRandomOrders[i];
		} // Of for i

		int[] tempClusterLengths = null;
		while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
			System.out.println("New loop...");
			tempOldClusterArray = tempClusterArray;
			tempClusterArray = new int[dataset.numInstances()];

			// Step 2.1 Minimization.Assign cluster to each instance.
			int tempNearestCenter;
			double tempNearestDistance;
			double tempDistance = 0;
			double[] tempMinDiffDistance = new double[numClusters];
			for (int i = 0; i < numClusters; i++) {
				tempMinDiffDistance[i] = Double.MAX_EXPONENT;
			} // Of for i

			for (int i = 0; i < dataset.numInstances(); i++) {
				tempNearestCenter = -1;
				tempNearestDistance = Double.MAX_VALUE;

				for (int j = 0; j < numClusters; j++) {
					tempDistance = distance(i, tempCenters[j]);
					if (tempDistance < tempNearestDistance) {
						tempNearestDistance = tempDistance;
						tempNearestCenter = j;
					} // Of if
				} // Of for j
				tempClusterArray[i] = tempNearestCenter;

				// Update the nearest instance index.
				if (tempDistance < tempMinDiffDistance[tempNearestCenter]) {
					nearestInstances[tempNearestCenter] = i;
				} // Of if
			} // Of for i

			// Step 2.2 Mean. Find new centers.
			tempClusterLengths = new int[numClusters];
			Arrays.fill(tempClusterLengths, 0);
			double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
			for (int i = 0; i < dataset.numInstances(); i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
				} // Of for j
				tempClusterLengths[tempClusterArray[i]]++;
			} // Of for i

			// Step 2.3 Now average.
			for (int i = 0; i < tempNewCenters.length; i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[i][j] /= tempClusterLengths[i];
				} // Of for j
			} // Of for i

			System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
			tempCenters = tempNewCenters;
		} // Of while

		// Step 3. Form clusters.
		clusters = new int[numClusters][];
		int[] tempCounters = new int[numClusters];
		for (int i = 0; i < numClusters; i++) {
			clusters[i] = new int[tempClusterLengths[i]];
		} // Of for i

		for (int i = 0; i < tempClusterArray.length; i++) {
			clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
			tempCounters[tempClusterArray[i]]++;
		} // Of for i

		System.out.println("The clusters are: " + Arrays.deepToString(clusters));
	}// Of clustering

Then, a method to divide clusters according to entity samples is defined:

	/**
	 ********************
	 * Clustering with instance.
	 *********************
	 */
	public void clusteringWithInstance() {
		int tempNearestCenter;
		double tempNearestDistance;
		double tempDistance = 0;
		int[] tempClusterArray = new int[dataset.numInstances()];
		int[] tempClusterLengths = new int[numClusters];
		for (int i = 0; i < dataset.numInstances(); i++) {
			tempNearestCenter = -1;
			tempNearestDistance = Double.MAX_VALUE;
			for (int j = 0; j < numClusters; j++) {
				tempDistance = distance(i, nearestInstances[j]);
				if (tempDistance < tempNearestDistance) {
					tempNearestDistance = tempDistance;
					tempNearestCenter = j;
				} // Of if
			} // Of for j
			tempClusterArray[i] = tempNearestCenter;
			tempClusterLengths[tempNearestCenter]++;
		} // Of for i
		clusters = new int[numClusters][];
		int[] tempCounters = new int[numClusters];
		for (int i = 0; i < numClusters; i++) {
			clusters[i] = new int[tempClusterLengths[i]];
		} // Of for i

		for (int i = 0; i < tempClusterArray.length; i++) {
			clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
			tempCounters[tempClusterArray[i]]++;
		} // Of for i

		System.out.println("The clusters are: " + Arrays.deepToString(clusters));
	}//Of clusteringWithInstance

Update test code:

	/**
	 ********************
	 * Test clustering.
	 *********************
	 */
	public static void testClustering() {
		KMeans tempKMeans = new KMeans("F:/sampledataMain/iris.arff");
		tempKMeans.setNumClusters(3);
		tempKMeans.clustering();
		tempKMeans.clusteringWithInstance();
	}// Of testClustering

Comparison of division results according to virtual particles and entity samples (blue mark according to entity division results):

Tags: Java Machine Learning kmeans clustering

Posted by jamkelvl on Fri, 13 May 2022 02:19:17 +0300