데이터과학 삼학년

[Clustering] DBSCAN 본문

Machine Learning

[Clustering] DBSCAN

Dan-k 2021. 1. 1. 16:38
반응형

DBSCAN : 밀도기반의 클러스터링 기법

-> knn, k-means의 경우, 각 데이터 별 일정거리를 통해서 클러스터링을 하는 방법이라면,

DBSCAN 은 데이터의 밀집도(밀도)를 통해 군집을 나누는 방법이다.

DBSCAN의 장점은 비선형의 클러스터링이 가능하다는 것이다.

앱실론과 minspoint 수를 통해 클러스터링을 지정함 (파라미터)

  • 앱실론 : 중심점으로부터 거리

  • minspoint : 앱실론 반경내에 샘플의 갯수

지정한 앱실론과 min 포인트수를 통해 밀도를 구하고 클러스터링 함

- 반경안에 들어오지 못한 points 는 noise point

 

코드

print(__doc__)

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

 

참고: scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py
 

Demo of DBSCAN clustering algorithm — scikit-learn 0.24.0 documentation

Note Click here to download the full example code or to run this example in your browser via Binder Demo of DBSCAN clustering algorithm Finds core samples of high density and expands clusters from them. Out: Estimated number of clusters: 3 Estimated number

scikit-learn.org

en.wikipedia.org/wiki/DBSCAN

 

DBSCAN - Wikipedia

From Wikipedia, the free encyclopedia Jump to navigation Jump to search A density-based data clustering algorithm Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Krieg

en.wikipedia.org

 

728x90
반응형
LIST
Comments