概念

k-means算法属于非监督学习，也就是事先不知道给的数据属于那一类，需要自己去分类。它的基本思想是把数据点密集的一群分成一类。

具体过程：

随机选择k个类的初始中心
在c次迭代中，对任意一个样本，求到各中心点之间的距离，将该样本归类到最近中心的那个类
使用均值等方法更新中心点。
如果两次更新匪类相同也结束

例如划分三个点(1, 1) (2, 3) (4, 6)是一类，那么新的中心点是((1+2+4)/3, (1+3+6)/3),不一定要在原有点上

实现

import numpy as np

# Function: K Means
# -------------
# K-Means is an algorithm that takes in a dataset and a constant
# k and returns k centroids (which define clusters of data in the
# dataset which are similar to one another).
def kmeans(X, k, maxIt): # maxIt是初始化次数
    
    numPoints, numDim = X.shape
    
    dataSet = np.zeros((numPoints, numDim + 1))
    dataSet[:, :-1] = X # 初始化赋值
    
    # Initialize centroids randomly
    centroids = dataSet[np.random.randint(numPoints, size = k), :] # 随机选取中心点
    #centroids = dataSet[0:2, :]
    #Randomly assign labels to initial centorid
    centroids[:, -1] = range(1, k +1)
    
    # Initialize book keeping vars.
    iterations = 0
    oldCentroids = None
    
    # Run the main k-means algorithm
    while not shouldStop(oldCentroids, centroids, iterations, maxIt):
        print "iteration: \n", iterations
        print "dataSet: \n", dataSet
        print "centroids: \n", centroids
        # Save old centroids for convergence test. Book keeping.
        oldCentroids = np.copy(centroids)
        iterations += 1
        
        # Assign labels to each datapoint based on centroids
        updateLabels(dataSet, centroids)
        
        # Assign centroids based on datapoint labels
        centroids = getCentroids(dataSet, k)
        
    # We can get the labels too by calling getLabels(dataSet, centroids)
    return dataSet
# Function: Should Stop
# -------------
# Returns True or False if k-means is done. K-means terminates either
# because it has run a maximum number of iterations OR the centroids
# stop changing.
def shouldStop(oldCentroids, centroids, iterations, maxIt):
    if iterations > maxIt:
        return True
    return np.array_equal(oldCentroids, centroids)  
# Function: Get Labels
# -------------
# Update a label for each piece of data in the dataset. 
def updateLabels(dataSet, centroids):
    # For each element in the dataset, chose the closest centroid. 
    # Make that centroid the element's label.
    numPoints, numDim = dataSet.shape
    for i in range(0, numPoints):
        dataSet[i, -1] = getLabelFromClosestCentroid(dataSet[i, :-1], centroids)
    
    
def getLabelFromClosestCentroid(dataSetRow, centroids):
    label = centroids[0, -1];
    minDist = np.linalg.norm(dataSetRow - centroids[0, :-1])
    for i in range(1 , centroids.shape[0]):
        dist = np.linalg.norm(dataSetRow - centroids[i, :-1])
        if dist < minDist:
            minDist = dist
            label = centroids[i, -1]
    print "minDist:", minDist
    return label
    
        
    
# Function: Get Centroids
# -------------
# Returns k random centroids, each of dimension n.
def getCentroids(dataSet, k): # 算出新的中心点
    # Each centroid is the geometric mean of the points that
    # have that centroid's label. Important: If a centroid is empty (no points have
    # that centroid's label) you should randomly re-initialize it.
    result = np.zeros((k, dataSet.shape[1]))
    for i in range(1, k + 1):
        oneCluster = dataSet[dataSet[:, -1] == i, :-1]
        result[i - 1, :-1] = np.mean(oneCluster, axis = 0)
        result[i - 1, -1] = i
    
    return result
    
    
x1 = np.array([1, 1])
x2 = np.array([2, 1])
x3 = np.array([4, 3])
x4 = np.array([5, 4])
testX = np.vstack((x1, x2, x3, x4))

result = kmeans(testX, 2, 10)
print "final result:"
print result

SLIC算法

一般来讲k-means只考虑到了颜色空间，而slic算法还考虑了空间位置信息，并且它是局部聚类，提高了速度。

slic算法使用的颜色空间是lab。l表示亮度，a表示洋红色到绿色的范围，b表示黄色到蓝色的范围。

具体步骤：

. 初始化种子点：假设有K个聚类中心（K个超像素），那么均匀分配超像素到图像中，每个图像占有大小的边长为S=sqrt(N/K)，其中N是像素点个数 . 在种子的nn 邻域内重新选择种子点，具体方法为计算每个像素点的梯度，选择种子点为梯度最小的地方.这样做是为了避免使种子落到边界上，影响后面的效果 . 在种子为中心的2S*2S空间内为像素点赋标签，距离度量为：

$\begin{aligned} &d_c = \sqrt{(l_j - l_i )^2 + (a_j - a_i )^2 + (b_j - b_i )^2}\\ &d_s = \sqrt{(x_j - x_i )^2 + (y_j - y_i )^2}\\ D^{\prime} = \sqrt{(\frac{d_c}{N_c})^2 ) + (\frac{d_s}{N_s})^2 )} \end{aligned}$

$N_s$也就是上面说的S，$N_c$取值在[1, 40]之间，一般取10

之后重新计算种子中心然后聚类，知道达到迭代次数或者变化不明显为止

mean-shift算法

mean-shift算法最大的优点是他会自动寻找聚类个数，因此可以减少人工标注过程，但是会增加很多计算量。

它基于点越密集的地方越可能是聚类点的想法，我们可以任意选择初始点。然后根据算法他会一步步向点密集的区域移动，

基本mean-shift向量形式

$M_h (x) = \frac{1}{k} \sum_{x_i \in S_h} (x_i - x)$

其中$S_h$是一个h半径的高维球区域内的点。这个式子的含义是计算哪个方向上点多就往哪个方向偏移。但是他将h范围内的点同等对待，实际上距离近的点影响更大

改进形式

基于上面的想法，我们可以给每个点增加一个权重使其对距离近的点更为敏感

$M_h (x) = \frac{\sum_{i=1}^{n} G(x_i - x, h) w(x_i ) (x_i - x)}{\sum_{i=1}^{n} G(x_i - x, h) w(x_i )}\\ G(x) = \frac{1}{\sqrt{2\pi}h}e^{-\frac{x^2}{2h^2}}$

其中G是高斯函数,距离0越近值越大，$w(x_i )$是权重函数，分母是为了归一化

因此我们可以得到每次运行后的终点为

$m_h (x) = \frac{\sum_{i=1}^{n} G(x_i - x, h) w(x_i ) (x_i - x) x_i}{\sum_{i=1}^{n} G(x_i - x, h) w(x_i )}$