一文简述多种无监督聚类算法的Python实现(Python实现多种无监督聚类算法详解)

原创

ithorizon 7个月前 (10-20) 阅读数 26 #后端开发

在机器学习和数据挖掘领域，无监督聚类算法是一种重要的技术，它能够在没有标签信息的情况下，结合数据的内在特征将数据分为若干个类别。本文将详细介绍几种常用的无监督聚类算法，并给出相应的Python实现。

一、K-means聚类算法

K-means算法是一种基于距离的聚类方法，其基本思想是通过迭代寻找K个类别的中心点，令每个数据点到其类别中心点的距离之和最小。

1.1 算法步骤

随机选择K个初始中心点。

计算每个数据点到各个中心点的距离，将数据点分配到距离最近的中心点所在的类别。

更新每个类别的中心点。

重复步骤2和3，直到中心点不再变化或约为迭代次数。

1.2 Python实现


import numpy as np
def kmeans(data, k, max_iter=100):
    centroids = data[np.random.choice(data.shape[0], k, replace=False)]
    for _ in range(max_iter):
        clusters = {}
        for x in data:
            distances = np.linalg.norm(x - centroids, axis=1)
            closest = np.argmin(distances)
            if closest not in clusters:
                clusters[closest] = []
            clusters[closest].append(x)
        
        new_centroids = []
        for key in sorted(clusters.keys()):
            new_centroid = np.mean(clusters[key], axis=0)
            new_centroids.append(new_centroid)
        
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    
    return centroids, clusters
data = np.random.rand(100, 2)
k = 3
centroids, clusters = kmeans(data, k)
print("Centroids: ", centroids)

二、DBSCAN聚类算法

DBSCAN（Density-Based Spatial Clustering of Applications with Noise）是一种基于密度的聚类方法，它能够识别出任意形状的聚类，并且能够处理噪声数据。

2.1 算法步骤

对于数据集中的每个点，计算其ε邻域内的点的数量。

如果一个点的ε邻域内的点数量大于或等于MinPts，则该点为核心点。

对于每个核心点，找出所有与之直接密度可达的点，形成簇。

重复步骤2和3，直到所有点都被处理。

2.2 Python实现


from sklearn.neighbors import NearestNeighbors
def dbscan(data, eps, min_samples):
    neighbors = NearestNeighbors(n_neighbors=min_samples + 1)
    indices = neighbors.fit(data).kneighbors(data, return_distance=False)
    
    labels = np.full(len(data), -1)
    cluster_id = 0
    
    for i, point_indices in enumerate(indices):
        if labels[i] != -1:
            continue
        if len(point_indices) < min_samples + 1:
            labels[i] = -1
            continue
        
        labels[i] = cluster_id
        seeds = [i]
        while seeds:
            current = seeds.pop()
            point_indices = indices[current]
            for point_idx in point_indices:
                if labels[point_idx] == -1:
                    labels[point_idx] = cluster_id
                    seeds.append(point_idx)
        
        cluster_id += 1
    
    return labels
data = np.random.rand(100, 2)
eps = 0.3
min_samples = 5
labels = dbscan(data, eps, min_samples)
print("Labels: ", labels)

三、层次聚类算法

层次聚类算法是一种自底向上的聚类方法，它通过计算数据点之间的距离，逐步合并距离最近的点或簇，最终形成一个聚类树。

3.1 算法步骤

将每个数据点作为一个簇。

计算所有簇之间的距离，选择距离最近的两个簇进行合并。

更新簇之间的距离矩阵。

重复步骤2和3，直到只剩下一个簇。

3.2 Python实现


from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
def hierarchical_clustering(data):
    linked = linkage(data, 'ward')
    labelList = range(1, len(data) + 1)
    plt.figure(figsize=(10, 7))
    dendrogram(linked, orientation='top', labels=labelList, distance_sort='descending', show_leaf_counts=True)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('Sample Index')
    plt.ylabel('Distance')
    plt.show()
data = np.random.rand(100, 2)
hierarchical_clustering(data)

四、高斯混合模型（GMM）聚类算法

高斯混合模型（Gaussian Mixture Model, GMM）是一种概率模型，它假设数据是由多个高斯分布混合生成的，通过最大似然估计来找到最佳的参数。

4.1 算法步骤

初始化参数（均值、方差、混合系数）。

对于每个数据点，计算其在每个高斯分布下的概率。

结合概率更新参数。

重复步骤2和3，直到参数收敛。

4.2 Python实现


from sklearn.mixture import GaussianMixture
def gmm_clustering(data, n_components):
    gmm = GaussianMixture(n_components=n_components)
    gmm.fit(data)
    labels = gmm.predict(data)
    return labels, gmm
data = np.random.rand(100, 2)
n_components = 3
labels, gmm = gmm_clustering(data, n_components)
print("Labels: ", labels)