일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
- youtube data
- API Gateway
- 공분산
- 상관관계
- hadoop
- tensorflow text
- top_k
- correlation
- UDF
- chatGPT
- grad-cam
- GCP
- requests
- 유튜브 API
- gather_nd
- XAI
- session 유지
- API
- BigQuery
- login crawling
- Retry
- TensorFlow
- Airflow
- flask
- Counterfactual Explanations
- integrated gradient
- subdag
- airflow subdag
- GenericGBQException
- spark udf
- Today
- Total
데이터과학 삼학년
Ch.7 Nonlinear Featurization viaK-Means Model Stacking 본문
CHAPTER 7: Nonlinear Featurization via K-Means Model Stacking¶
Nonlinear dimensionality reduction¶
: is also called nonlinear embedding or manifold learning. Nonlinear embeddings are useful for aggressively compressing high-dimensional data into low-dimensional data. They are often used for visualization in two or three dimensions.
Swiss roll: nonlinear manifold¶
import numpy as np
tt0 = 3 * np.pi / 2 * (1 + 2 * np.arange(0, 1.25, 0.01))
hh = np.arange(0, 1.125, 0.125) * 30
xx = np.transpose(np.tile(np.multiply(tt0, np.cos(tt0)), (len(hh), 1)))
yy = np.tile(hh, (len(tt0), 1))
zz = np.transpose(np.tile(np.multiply(tt0, np.sin(tt0)), (len(hh), 1)))
cc = np.transpose(np.tile((tt0-tt0.min())/(tt0.max()-tt0.min()), (len(hh), 1)))
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import seaborn as sns
%matplotlib notebook
%matplotlib inline
sns.set_style('whitegrid')
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(xx, yy, zz, rstride=1, cstride=1,
linewidth=0, antialiased=False,
facecolors=cm.seismic_r(cc))
Clustering¶
The goal of feature engineering, however, isn’t so much to make the feature dimensions as low as possible, but to arrive at the right features for the task.
Clustering algorithms are usually not presented as techniques for local structure learning. But they in fact enable just that. Points that are close to each other belong to the same cluster.
Compared to nonlinear embedding techniques, clustering may produce more features. But if the end goal is feature engineering instead of visualization, this is not a problem.
We will illustrate the idea of local structure learning with a common clustering algorithm called k-means. It is simple to understand and implement. Instead of nonlinear manifold reduction, it is more apt to say that k-means performs nonlinear manifold feature extraction.
K-means Clustering¶
- unsupervised
- hard clustering: each data point is assigned to one and only one cluster
- depend on a metric: closeness (ex. Euclidean distance)
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
%matplotlib inline
n_data = 1000
seed = 1
n_centers = 4
# Generate random Gaussian blobs and run K-means
blobs, blob_labels = make_blobs(n_samples=n_data, n_features=2,
centers=n_centers, random_state=seed)
clusters_blob = KMeans(n_clusters=n_centers, random_state=seed).fit_predict(blobs)
# Generate data uniformly at random and run K-means
uniform = np.random.rand(n_data, 2)
clusters_uniform = KMeans(n_clusters=n_centers, random_state=seed).fit_predict(uniform)
# Matplotlib incantations for visualizing results
figure = plt.figure(figsize=(20,10))
plt.subplot(221)
plt.scatter(blobs[:, 0], blobs[:, 1], c=blob_labels, edgecolors='k', cmap='PuRd')
plt.title("(a) Four randomly generated blobs", fontsize=14)
plt.axis('off')
plt.subplot(222)
plt.scatter(blobs[:, 0], blobs[:, 1], c=clusters_blob, edgecolors='k', cmap='PuRd')
plt.title("(b) Clusters found via K-means", fontsize=14)
plt.axis('off')
plt.subplot(223)
plt.scatter(uniform[:, 0], uniform[:, 1], edgecolors='k')
plt.title("(c) 1000 randomly generated pois", fontsize=14)
plt.axis('off')
plt.subplot(224)
plt.scatter(uniform[:, 0], uniform[:, 1], c=clusters_uniform, edgecolors='k', cmap='PuRd')
plt.title("(d) Clusters found via K-means", fontsize=14)
plt.axis('off')
Clustering as Surface Tiling¶
when data is spread out fairly uniformly, there is no longer a correct number of clusters. In this case, the role of a clustering algorithm is vector quantization.
from mpl_toolkits.mplot3d import Axes3D
from sklearn import manifold, datasets
#Generate a noisy Swiss roll dataset
X, color = datasets.samples_generator.make_swiss_roll(n_samples=1500) # generate 1500 points at random on the Swiss roll surface
#Approximate the data with 100 k-means clusters
clusters_swiss_roll = KMeans(n_clusters=100, random_state=seed).fit_predict(X) #asked k-means to approximate it with 100 clusters
#Plot the dataset with k-means cluster IDs as the color
fig2 = plt.figure()
ax = fig2.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=clusters_swiss_roll, cmap='Spectral')
The problem is that if we pick a k that is too small, then the results won’t be so nice from a manifold learning perspective.
We can clearly see data from very different sections of the manifold being mapped to the same clusters
#Approximate the data with 10 k-means clusters
clusters_swiss_roll = KMeans(n_clusters=10, random_state=seed).fit_predict(X) #asked k-means to approximate it with 10 clusters
#Plot the dataset with k-means cluster IDs as the color
fig2 = plt.figure()
ax = fig2.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=clusters_swiss_roll, cmap='Spectral')
Uniform distribution is the worst-case scenario for k-means. If data density is not uniform, then we will be able to represent more data with fewer clusters.
k-Means Featurization for Classification¶
If a target variable is also available, then we have the choice of giving that information as a hint to the clustering procedure.
k-Means Featurization¶
Clustering algorithms analyze the spatial distribution of data. Therefore, k-means featurization creates a compressed spatial index of the data which can be fed into the model in the next stage. This is an example of model stacking.
# Example 7-3. k-means featurizer
import numpy as np
from sklearn.cluster import KMeans
class KMeansFeaturizer:
"""Transforms numeric data into k-means cluster memberships.
This transformer runs k-means on the input data and converts each data point
into the id of the closest cluster. If a target variable is present, it is
scaled and included as input to k-means in order to derive clusters that
obey the classification boundary as well as group similar points together.
Parameters
----------
k: integer, optional, default 100
The number of clusters to group data into.
target_scale: float, [0, infty], optional, default 5.0
The scaling factor for the target variable. Set this to zero to ignore
the target. For classification problems, larger `target_scale` values
will produce clusters that better respect the class boundary.
random_state : integer or numpy.RandomState, optional
This is passed to k-means as the generator used to initialize the
kmeans centers. If an integer is given, it fixes the seed. Defaults to
the global numpy random number generator.
Attributes
----------
cluster_centers_ : array, [k, n_features]
Coordinates of cluster centers. n_features does count the target column.
"""
def __init__(self, k=100, target_scale=5.0, random_state=None):
self.k = k
self.target_scale = target_scale
self.random_state = random_state
self.cluster_encoder = OneHotEncoder().fit(np.array(range(k)).reshape(-1,1))
def fit(self, X, y=None):
"""Runs k-means on the input data and find centroids.
If no target is given (`y` is None) then run vanilla k-means on input
`X`.
If target `y` is given, then include the target (weighted by
`target_scale`) as an extra dimension for k-means clustering. In this
case, run k-means twice, first with the target, then an extra iteration
without.
After fitting, the attribute `cluster_centers_` are set to the k-means
centroids in the input space represented by `X`.
Parameters
----------
X : array-like or sparse matrix, shape=(n_data_points, n_features)
y : vector of length n_data_points, optional, default None
If provided, will be weighted with `target_scale` and included in
k-means clustering as hint.
"""
if y is None:
# No target variable, just do plain k-means
km_model = KMeans(n_clusters=self.k,
n_init=20,
random_state=self.random_state)
km_model.fit(X)
self.km_model_ = km_model
self.cluster_centers_ = km_model.cluster_centers_
return self
# There is target information. Apply appropriate scaling and include
# into input data to k-means
data_with_target = np.hstack((X, y[:,np.newaxis]*self.target_scale))
# Build a pre-training k-means model on data and target
km_model_pretrain = KMeans(n_clusters=self.k,
n_init=20,
random_state=self.random_state)
km_model_pretrain.fit(data_with_target)
# Run k-means a second time to get the clusters in the original space
# without target info. Initialize using centroids found in pre-training.
# Go through a single iteration of cluster assignment and centroid
# recomputation.
km_model = KMeans(n_clusters=self.k,
init=km_model_pretrain.cluster_centers_[:,:2],
n_init=1,
max_iter=1)
km_model.fit(X)
self.km_model = km_model
self.cluster_centers_ = km_model.cluster_centers_
return self
def transform(self, X, y=None):
"""Outputs the closest cluster id for each input data point.
Parameters
----------
X : array-like or sparse matrix, shape=(n_data_points, n_features)
y : vector of length n_data_points, optional, default None
Target vector is ignored even if provided.
Returns
-------
cluster_ids : array, shape[n_data_points,1]
"""
clusters = self.km_model.predict(X)
return self.cluster_encoder.transform(clusters.reshape(-1,1))
def fit_transform(self, X, y=None):
"""Runs fit followed by transform.
"""
self.fit(X, y)
return self.transform(X, y)
# Example 7-4. k-means featurization with and without target hints
from scipy.spatial import Voronoi, voronoi_plot_2d
from sklearn.datasets import make_moons
from sklearn.preprocessing import OneHotEncoder
import sklearn
import scipy
import matplotlib.pyplot as plt
%matplotlib notebook
%matplotlib inline
seed = 1
training_data, training_labels = make_moons(n_samples=2000, noise=0.2, random_state=seed)
kmf_hint = KMeansFeaturizer(k=100, target_scale=10, random_state=seed).fit(training_data, training_labels)
kmf_no_hint = KMeansFeaturizer(k=100, target_scale=0, random_state=seed).fit(training_data, training_labels)
def kmeans_voronoi_plot(X, y, cluster_centers, ax):
"""Plots the Voronoi diagram of the kmeans clusters overlayed with the data"""
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='Set1', alpha=0.2)
vor = Voronoi(cluster_centers)
voronoi_plot_2d(vor, ax=ax, show_vertices=False, alpha=0.5)
The top panel shows that when the clustering algorithm is given target information, the cluster boundaries align much better along class boundaries.
fig = plt.figure()
ax = plt.subplot(211, aspect='equal')
kmeans_voronoi_plot(training_data, training_labels, kmf_hint.cluster_centers_, ax)
ax.set_title('K-Means with Target Hint')
ax2 = plt.subplot(212, aspect='equal')
kmeans_voronoi_plot(training_data, training_labels, kmf_no_hint.cluster_centers_, ax2)
ax2.set_title('K-Means without Target Hint')
test the effectiveness of k-means features for classification¶
- support vector machine with radial basis function kernel (RBF SVM)
- k-nearest neighbors (kNN)
- random forest (RF)
-
gradient boosting tree (GBT) classifiers
-
Logistic regression (LR)
# Example 7-5. Classification with k-means cluster features
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Generate some test data from the same distribution as training data
test_data, test_labels = make_moons(n_samples=2000, noise=0.3, random_state=seed+5)
# Use the k-means featurizer to generate cluster features
training_cluster_features = kmf_hint.transform(training_data)
test_cluster_features = kmf_hint.transform(test_data)
# Form new input features with cluster features
training_with_cluster = scipy.sparse.hstack((training_data, training_cluster_features))
test_with_cluster = scipy.sparse.hstack((test_data, test_cluster_features))
# Build the classifiers
lr_cluster = LogisticRegression(random_state=seed).fit(training_with_cluster, training_labels)
classifier_names = ['LR',
'kNN',
'RBF SVM',
'Random Forest',
'Boosted Trees']
classifiers = [LogisticRegression(random_state=seed),
KNeighborsClassifier(5),
SVC(gamma=2, C=1, random_state=seed),
RandomForestClassifier(max_depth=5, n_estimators=10,
max_features=1, random_state=seed),
GradientBoostingClassifier(n_estimators=10, learning_rate=1.0,
max_depth=5, random_state=seed)]
for model in classifiers:
model.fit(training_data, training_labels)
A ROC curve shows the trade-off between true positives and false positives as we vary the classification decision boundary
# Helper function to evaluate classifier performance using ROC
def test_roc(model, data, labels):
if hasattr(model, "decision_function"):
predictions = model.decision_function(data)
else:
predictions = model.predict_proba(data)[:,1]
fpr, tpr, _ = sklearn.metrics.roc_curve(labels, predictions)
return fpr, tpr
# Plot results
plt.figure()
fpr_cluster, tpr_cluster = test_roc(lr_cluster, test_with_cluster, test_labels)
plt.plot(fpr_cluster, tpr_cluster, 'r-', label='LR with k-means')
for i, model in enumerate(classifiers):
fpr, tpr = test_roc(model, test_data, test_labels)
plt.plot(fpr, tpr, label=classifier_names[i])
plt.plot([0, 1], [0, 1], 'k--')
plt.legend()
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
There may be performance differences once the models are fully tuned, but at least this shows that it is possible for LR with k-means to be on a par with nonlinear classifiers.
This is a nice result because linear classifiers are much cheaper to train than nonlinear classifiers.
Lower computation cost allows us to try more models with different features in the same period of time, which increases the chance of ending up with a much better model.
Alternative Dense Featurization¶
Instead of one-hot cluster membership, a data point can also be represented by a dense vector of its inverse distance to each cluster center.
-
one-hot cluster membership
- results in a very lightweight, sparse representation
- one might need a larger k to represent data of complex shapes
-
distance representation
- dense which could be more expensive for the modeling step
- one might be able to get away with a smaller k
A compromise between sparse and dense is to retain inverse distances for only p of the closest clusters. But now p is an extra hyperparameter to tune.
K-means clustering¶
import random
import math
#Euclidian Distance between two d-dimensional points
def eucldist(p0,p1):
dist = 0.0
for i in range(0,len(p0)):
dist += (p0[i] - p1[i])**2
return math.sqrt(dist)
#K-Means Algorithm
def kmeans(k,datapoints):
# d - Dimensionality of Datapoints
d = len(datapoints[0])
#Limit our iterations
Max_Iterations = 1000
i = 0
cluster = [0] * len(datapoints)
prev_cluster = [-1] * len(datapoints)
#Randomly Choose Centers for the Clusters
cluster_centers = []
for i in range(0,k):
new_cluster = []
#for i in range(0,d):
# new_cluster += [random.randint(0,10)]
cluster_centers += [random.choice(datapoints)]
#Sometimes The Random points are chosen poorly and so there ends up being empty clusters
#In this particular implementation we want to force K exact clusters.
#To take this feature off, simply take away "force_recalculation" from the while conditional.
force_recalculation = False
while (cluster != prev_cluster) or (i > Max_Iterations) or (force_recalculation) :
prev_cluster = list(cluster)
force_recalculation = False
i += 1
#Update Point's Cluster Alligiance
for p in range(0,len(datapoints)):
min_dist = float("inf")
#Check min_distance against all centers
for c in range(0,len(cluster_centers)):
dist = eucldist(datapoints[p],cluster_centers[c])
if (dist < min_dist):
min_dist = dist
cluster[p] = c # Reassign Point to new Cluster
#Update Cluster's Position
for k in range(0,len(cluster_centers)):
new_center = [0] * d
members = 0
for p in range(0,len(datapoints)):
if (cluster[p] == k): #If this point belongs to the cluster
for j in range(0,d):
new_center[j] += datapoints[p][j]
members += 1
for j in range(0,d):
if members != 0:
new_center[j] = new_center[j] / float(members)
#This means that our initial random assignment was poorly chosen
#Change it to a new datapoint to actually force k clusters
else:
new_center = random.choice(datapoints)
force_recalculation = True
print ("Forced Recalculation...")
cluster_centers[k] = new_center
print ("======== Results ========")
print ("Clusters", cluster_centers)
print ("Iterations",i)
print ("Assignments", cluster)
#TESTING THE PROGRAM#
if __name__ == "__main__":
#2D - Datapoints List of n d-dimensional vectors. (For this example I already set up 2D Tuples)
#Feel free to change to whatever size tuples you want...
datapoints = [(3,2),(2,2),(1,2),(0,1),(1,0),(1,1),(5,6),(7,7),(9,10),(11,13),(12,12),(12,13),(13,13)]
k = 2 # K - Number of Clusters
kmeans(k,datapoints)
'Feature Engineering' 카테고리의 다른 글
[Labeling] Snorkel 소개 (0) | 2020.11.20 |
---|---|
Ch.9 Back to the Feature: Building an Academic Paper Recommender (0) | 2020.06.03 |
Ch.6 Dimensionality Reduction: Squashing the Data Pancake with PCA (0) | 2020.05.06 |
Ch.5 Categorical Variables: Counting Eggs in theAge of Robotic Chickens (0) | 2020.04.14 |
Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf (0) | 2020.04.02 |