概念

k-means算法属于非监督学习,也就是事先不知道给的数据属于那一类,需要自己去分类。它的基本思想是把数据点密集的一群分成一类。

具体过程:

  1. 随机选择k个类的初始中心
  2. 在c次迭代中,对任意一个样本,求到各中心点之间的距离,将该样本归类到最近中心的那个类
  3. 使用均值等方法更新中心点。
  4. 如果两次更新匪类相同也结束

例如划分三个点(1, 1) (2, 3) (4, 6)是一类,那么新的中心点是((1+2+4)/3, (1+3+6)/3),不一定要在原有点上

实现

import numpy as np

# Function: K Means
# -------------
# K-Means is an algorithm that takes in a dataset and a constant
# k and returns k centroids (which define clusters of data in the
# dataset which are similar to one another).
def kmeans(X, k, maxIt): # maxIt是初始化次数

numPoints, numDim = X.shape

dataSet = np.zeros((numPoints, numDim + 1))
dataSet[:, :-1] = X # 初始化赋值

# Initialize centroids randomly
centroids = dataSet[np.random.randint(numPoints, size = k), :] # 随机选取中心点
#centroids = dataSet[0:2, :]
#Randomly assign labels to initial centorid
centroids[:, -1] = range(1, k +1)

# Initialize book keeping vars.
iterations = 0
oldCentroids = None

# Run the main k-means algorithm
while not shouldStop(oldCentroids, centroids, iterations, maxIt):
print "iteration: \n", iterations
print "dataSet: \n", dataSet
print "centroids: \n", centroids
# Save old centroids for convergence test. Book keeping.
oldCentroids = np.copy(centroids)
iterations += 1

# Assign labels to each datapoint based on centroids
updateLabels(dataSet, centroids)

# Assign centroids based on datapoint labels
centroids = getCentroids(dataSet, k)

# We can get the labels too by calling getLabels(dataSet, centroids)
return dataSet
# Function: Should Stop
# -------------
# Returns True or False if k-means is done. K-means terminates either
# because it has run a maximum number of iterations OR the centroids
# stop changing.
def shouldStop(oldCentroids, centroids, iterations, maxIt):
if iterations > maxIt:
return True
return np.array_equal(oldCentroids, centroids)
# Function: Get Labels
# -------------
# Update a label for each piece of data in the dataset.
def updateLabels(dataSet, centroids):
# For each element in the dataset, chose the closest centroid.
# Make that centroid the element's label.
numPoints, numDim = dataSet.shape
for i in range(0, numPoints):
dataSet[i, -1] = getLabelFromClosestCentroid(dataSet[i, :-1], centroids)


def getLabelFromClosestCentroid(dataSetRow, centroids):
label = centroids[0, -1];
minDist = np.linalg.norm(dataSetRow - centroids[0, :-1])
for i in range(1 , centroids.shape[0]):
dist = np.linalg.norm(dataSetRow - centroids[i, :-1])
if dist < minDist:
minDist = dist
label = centroids[i, -1]
print "minDist:", minDist
return label



# Function: Get Centroids
# -------------
# Returns k random centroids, each of dimension n.
def getCentroids(dataSet, k): # 算出新的中心点
# Each centroid is the geometric mean of the points that
# have that centroid's label. Important: If a centroid is empty (no points have
# that centroid's label) you should randomly re-initialize it.
result = np.zeros((k, dataSet.shape[1]))
for i in range(1, k + 1):
oneCluster = dataSet[dataSet[:, -1] == i, :-1]
result[i - 1, :-1] = np.mean(oneCluster, axis = 0)
result[i - 1, -1] = i

return result


x1 = np.array([1, 1])
x2 = np.array([2, 1])
x3 = np.array([4, 3])
x4 = np.array([5, 4])
testX = np.vstack((x1, x2, x3, x4))

result = kmeans(testX, 2, 10)
print "final result:"
print result

SLIC算法

一般来讲k-means只考虑到了颜色空间,而slic算法还考虑了空间位置信息,并且它是局部聚类,提高了速度。

slic算法使用的颜色空间是lab。l表示亮度,a表示洋红色到绿色的范围,b表示黄色到蓝色的范围。

具体步骤:

. 初始化种子点: 假设有K个聚类中心(K个超像素),那么均匀分配超像素到图像中,每个图像占有大小的边长为S=sqrt(N/K),其中N是像素点个数 . 在种子的nn 邻域内重新选择种子点,具体方法为计算每个像素点的梯度,选择种子点为梯度最小的地方.这样做是为了避免使种子落到边界上,影响后面的效果 . 在种子为中心的2S*2S空间内为像素点赋标签,距离度量为:

$N_s$也就是上面说的S,$N_c$取值在[1, 40]之间,一般取10

  • 之后重新计算种子中心然后聚类,知道达到迭代次数或者变化不明显为止

mean-shift算法

mean-shift算法最大的优点是他会自动寻找聚类个数,因此可以减少人工标注过程,但是会增加很多计算量。

它基于点越密集的地方越可能是聚类点的想法,我们可以任意选择初始点。然后根据算法他会一步步向点密集的区域移动,

基本mean-shift向量形式

其中$S_h$是一个h半径的高维球区域内的点。这个式子的含义是计算哪个方向上点多就往哪个方向偏移。但是他将h范围内的点同等对待,实际上距离近的点影响更大

改进形式

基于上面的想法,我们可以给每个点增加一个权重使其对距离近的点更为敏感

其中G是高斯函数,距离0越近值越大,$w(x_i )$是权重函数,分母是为了归一化

因此我们可以得到每次运行后的终点为