CHAPTER 7: Nonlinear Featurization via K-Means Model Stacking¶

Nonlinear dimensionality reduction¶

: is also called nonlinear embedding or manifold learning. Nonlinear embeddings are useful for aggressively compressing high-dimensional data into low-dimensional data. They are often used for visualization in two or three dimensions.

Swiss roll: nonlinear manifold¶

import numpy as np


tt0 = 3 * np.pi / 2 * (1 + 2 * np.arange(0, 1.25, 0.01))
hh = np.arange(0, 1.125, 0.125) * 30
xx = np.transpose(np.tile(np.multiply(tt0, np.cos(tt0)), (len(hh), 1)))
yy = np.tile(hh, (len(tt0), 1))
zz = np.transpose(np.tile(np.multiply(tt0, np.sin(tt0)), (len(hh), 1)))
cc = np.transpose(np.tile((tt0-tt0.min())/(tt0.max()-tt0.min()), (len(hh), 1)))

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import seaborn as sns
%matplotlib notebook
%matplotlib inline


sns.set_style('whitegrid')

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(xx, yy, zz, rstride=1, cstride=1, 
                linewidth=0, antialiased=False, 
                facecolors=cm.seismic_r(cc))

<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x7f8c4f090128>

Clustering¶

The goal of feature engineering, however, isn’t so much to make the feature dimensions as low as possible, but to arrive at the right features for the task.

Clustering algorithms are usually not presented as techniques for local structure learning. But they in fact enable just that. Points that are close to each other belong to the same cluster.

Compared to nonlinear embedding techniques, clustering may produce more features. But if the end goal is feature engineering instead of visualization, this is not a problem.

We will illustrate the idea of local structure learning with a common clustering algorithm called k-means. It is simple to understand and implement. Instead of nonlinear manifold reduction, it is more apt to say that k-means performs nonlinear manifold feature extraction.

K-means Clustering¶

unsupervised
hard clustering: each data point is assigned to one and only one cluster
depend on a metric: closeness (ex. Euclidean distance)

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
%matplotlib inline

n_data = 1000
seed = 1
n_centers = 4

# Generate random Gaussian blobs and run K-means
blobs, blob_labels = make_blobs(n_samples=n_data, n_features=2, 
                                centers=n_centers, random_state=seed)
clusters_blob = KMeans(n_clusters=n_centers, random_state=seed).fit_predict(blobs)

# Generate data uniformly at random and run K-means
uniform = np.random.rand(n_data, 2)
clusters_uniform = KMeans(n_clusters=n_centers, random_state=seed).fit_predict(uniform)

# Matplotlib incantations for visualizing results
figure = plt.figure(figsize=(20,10))
plt.subplot(221)
plt.scatter(blobs[:, 0], blobs[:, 1], c=blob_labels, edgecolors='k', cmap='PuRd')
plt.title("(a) Four randomly generated blobs", fontsize=14)
plt.axis('off')

plt.subplot(222)
plt.scatter(blobs[:, 0], blobs[:, 1], c=clusters_blob, edgecolors='k', cmap='PuRd')
plt.title("(b) Clusters found via K-means", fontsize=14)
plt.axis('off')

plt.subplot(223)
plt.scatter(uniform[:, 0], uniform[:, 1], edgecolors='k')
plt.title("(c) 1000 randomly generated pois", fontsize=14)
plt.axis('off')

plt.subplot(224)
plt.scatter(uniform[:, 0], uniform[:, 1], c=clusters_uniform, edgecolors='k', cmap='PuRd')
plt.title("(d) Clusters found via K-means", fontsize=14)
plt.axis('off')

(-0.05333619811634985,
 1.055757657148305,
 -0.06505378647490812,
 1.0651541240789473)

Clustering as Surface Tiling¶

when data is spread out fairly uniformly, there is no longer a correct number of clusters. In this case, the role of a clustering algorithm is vector quantization.

from mpl_toolkits.mplot3d import Axes3D
from sklearn import manifold, datasets

#Generate a noisy Swiss roll dataset
X, color = datasets.samples_generator.make_swiss_roll(n_samples=1500) # generate 1500 points at random on the Swiss roll surface

#Approximate the data with 100 k-means clusters
clusters_swiss_roll = KMeans(n_clusters=100, random_state=seed).fit_predict(X) #asked k-means to approximate it with 100 clusters

#Plot the dataset with k-means cluster IDs as the color
fig2 = plt.figure()
ax = fig2.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=clusters_swiss_roll, cmap='Spectral')

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f8c44a500f0>

The problem is that if we pick a k that is too small, then the results won’t be so nice from a manifold learning perspective.

We can clearly see data from very different sections of the manifold being mapped to the same clusters

#Approximate the data with 10 k-means clusters
clusters_swiss_roll = KMeans(n_clusters=10, random_state=seed).fit_predict(X) #asked k-means to approximate it with 10 clusters

#Plot the dataset with k-means cluster IDs as the color
fig2 = plt.figure()
ax = fig2.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=clusters_swiss_roll, cmap='Spectral')

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f8c4493db70>

Uniform distribution is the worst-case scenario for k-means. If data density is not uniform, then we will be able to represent more data with fewer clusters.

k-Means Featurization for Classification¶

If a target variable is also available, then we have the choice of giving that information as a hint to the clustering procedure.

k-Means Featurization¶

Clustering algorithms analyze the spatial distribution of data. Therefore, k-means featurization creates a compressed spatial index of the data which can be fed into the model in the next stage. This is an example of model stacking.

# Example 7-3. k-means featurizer
import numpy as np
from sklearn.cluster import KMeans

class KMeansFeaturizer:
    """Transforms numeric data into k-means cluster memberships.
    
    This transformer runs k-means on the input data and converts each data point
    into the id of the closest cluster. If a target variable is present, it is 
    scaled and included as input to k-means in order to derive clusters that
    obey the classification boundary as well as group similar points together.

    Parameters
    ----------
    k: integer, optional, default 100
        The number of clusters to group data into.

    target_scale: float, [0, infty], optional, default 5.0
        The scaling factor for the target variable. Set this to zero to ignore
        the target. For classification problems, larger `target_scale` values 
        will produce clusters that better respect the class boundary.

    random_state : integer or numpy.RandomState, optional
        This is passed to k-means as the generator used to initialize the 
        kmeans centers. If an integer is given, it fixes the seed. Defaults to 
        the global numpy random number generator.

    Attributes
    ----------
    cluster_centers_ : array, [k, n_features]
        Coordinates of cluster centers. n_features does count the target column.
    """

    def __init__(self, k=100, target_scale=5.0, random_state=None):
        self.k = k
        self.target_scale = target_scale
        self.random_state = random_state
        self.cluster_encoder = OneHotEncoder().fit(np.array(range(k)).reshape(-1,1))
        
    def fit(self, X, y=None):
        """Runs k-means on the input data and find centroids.

        If no target is given (`y` is None) then run vanilla k-means on input
        `X`. 

        If target `y` is given, then include the target (weighted by 
        `target_scale`) as an extra dimension for k-means clustering. In this 
        case, run k-means twice, first with the target, then an extra iteration
        without.

        After fitting, the attribute `cluster_centers_` are set to the k-means
        centroids in the input space represented by `X`.

        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_data_points, n_features)

        y : vector of length n_data_points, optional, default None
            If provided, will be weighted with `target_scale` and included in 
            k-means clustering as hint.
        """
        if y is None:
            # No target variable, just do plain k-means
            km_model = KMeans(n_clusters=self.k, 
                              n_init=20, 
                              random_state=self.random_state)
            km_model.fit(X)

            self.km_model_ = km_model
            self.cluster_centers_ = km_model.cluster_centers_
            return self

        # There is target information. Apply appropriate scaling and include
        # into input data to k-means            
        data_with_target = np.hstack((X, y[:,np.newaxis]*self.target_scale))

        # Build a pre-training k-means model on data and target
        km_model_pretrain = KMeans(n_clusters=self.k, 
                                   n_init=20, 
                                   random_state=self.random_state)
        km_model_pretrain.fit(data_with_target)

        # Run k-means a second time to get the clusters in the original space
        # without target info. Initialize using centroids found in pre-training.
        # Go through a single iteration of cluster assignment and centroid 
        # recomputation.
        km_model = KMeans(n_clusters=self.k, 
                          init=km_model_pretrain.cluster_centers_[:,:2], 
                          n_init=1, 
                          max_iter=1)
        km_model.fit(X)
        
        self.km_model = km_model
        self.cluster_centers_ = km_model.cluster_centers_
        return self
        
    def transform(self, X, y=None):
        """Outputs the closest cluster id for each input data point.

        Parameters
        ----------
        X : array-like or sparse matrix, shape=(n_data_points, n_features)

        y : vector of length n_data_points, optional, default None
            Target vector is ignored even if provided.

        Returns
        -------
        cluster_ids : array, shape[n_data_points,1]
        """
        clusters = self.km_model.predict(X)
        return self.cluster_encoder.transform(clusters.reshape(-1,1))
    
    def fit_transform(self, X, y=None):
        """Runs fit followed by transform.
        """
        self.fit(X, y)
        return self.transform(X, y)

# Example 7-4. k-means featurization with and without target hints
from scipy.spatial import Voronoi, voronoi_plot_2d
from sklearn.datasets import make_moons
from sklearn.preprocessing import OneHotEncoder
import sklearn
import scipy

import matplotlib.pyplot as plt
%matplotlib notebook
%matplotlib inline

seed = 1
training_data, training_labels = make_moons(n_samples=2000, noise=0.2, random_state=seed)

kmf_hint = KMeansFeaturizer(k=100, target_scale=10, random_state=seed).fit(training_data, training_labels)
kmf_no_hint = KMeansFeaturizer(k=100, target_scale=0, random_state=seed).fit(training_data, training_labels)

def kmeans_voronoi_plot(X, y, cluster_centers, ax):
    """Plots the Voronoi diagram of the kmeans clusters overlayed with the data"""
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='Set1', alpha=0.2)
    vor = Voronoi(cluster_centers)
    voronoi_plot_2d(vor, ax=ax, show_vertices=False, alpha=0.5)

The top panel shows that when the clustering algorithm is given target information, the cluster boundaries align much better along class boundaries.

fig = plt.figure()
ax = plt.subplot(211, aspect='equal')
kmeans_voronoi_plot(training_data, training_labels, kmf_hint.cluster_centers_, ax)
ax.set_title('K-Means with Target Hint')
ax2 = plt.subplot(212, aspect='equal')
kmeans_voronoi_plot(training_data, training_labels, kmf_no_hint.cluster_centers_, ax2)
ax2.set_title('K-Means without Target Hint')

Text(0.5, 1.0, 'K-Means without Target Hint')

test the effectiveness of k-means features for classification¶

support vector machine with radial basis function kernel (RBF SVM)
k-nearest neighbors (kNN)
random forest (RF)
gradient boosting tree (GBT) classifiers
Logistic regression (LR)

# Example 7-5. Classification with k-means cluster features

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Generate some test data from the same distribution as training data
test_data, test_labels = make_moons(n_samples=2000, noise=0.3, random_state=seed+5)

# Use the k-means featurizer to generate cluster features
training_cluster_features = kmf_hint.transform(training_data)
test_cluster_features = kmf_hint.transform(test_data)

# Form new input features with cluster features
training_with_cluster = scipy.sparse.hstack((training_data, training_cluster_features))
test_with_cluster = scipy.sparse.hstack((test_data, test_cluster_features))

# Build the classifiers

lr_cluster = LogisticRegression(random_state=seed).fit(training_with_cluster, training_labels)

classifier_names = ['LR',
                    'kNN',
                    'RBF SVM',
                    'Random Forest',
                    'Boosted Trees']
classifiers = [LogisticRegression(random_state=seed),
               KNeighborsClassifier(5),
               SVC(gamma=2, C=1, random_state=seed),
               RandomForestClassifier(max_depth=5, n_estimators=10, 
                                      max_features=1, random_state=seed),
               GradientBoostingClassifier(n_estimators=10, learning_rate=1.0,
                                          max_depth=5, random_state=seed)]
for model in classifiers:
    model.fit(training_data, training_labels)

A ROC curve shows the trade-off between true positives and false positives as we vary the classification decision boundary

# Helper function to evaluate classifier performance using ROC
def test_roc(model, data, labels):
    if hasattr(model, "decision_function"):
        predictions = model.decision_function(data)
    else:
        predictions = model.predict_proba(data)[:,1]
    fpr, tpr, _ = sklearn.metrics.roc_curve(labels, predictions)
    return fpr, tpr

# Plot results
plt.figure()
fpr_cluster, tpr_cluster = test_roc(lr_cluster, test_with_cluster, test_labels)
plt.plot(fpr_cluster, tpr_cluster, 'r-', label='LR with k-means')

for i, model in enumerate(classifiers):
    fpr, tpr = test_roc(model, test_data, test_labels)
    plt.plot(fpr, tpr, label=classifier_names[i])
    
plt.plot([0, 1], [0, 1], 'k--')
plt.legend()
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)

Text(0, 0.5, 'True Positive Rate')

There may be performance differences once the models are fully tuned, but at least this shows that it is possible for LR with k-means to be on a par with nonlinear classifiers.

This is a nice result because linear classifiers are much cheaper to train than nonlinear classifiers.

Lower computation cost allows us to try more models with different features in the same period of time, which increases the chance of ending up with a much better model.

Alternative Dense Featurization¶

Instead of one-hot cluster membership, a data point can also be represented by a dense vector of its inverse distance to each cluster center.

one-hot cluster membership
- results in a very lightweight, sparse representation
- one might need a larger k to represent data of complex shapes
distance representation
- dense which could be more expensive for the modeling step
- one might be able to get away with a smaller k

A compromise between sparse and dense is to retain inverse distances for only p of the closest clusters. But now p is an extra hyperparameter to tune.

K-means clustering¶

import random
import math

#Euclidian Distance between two d-dimensional points
def eucldist(p0,p1):
    dist = 0.0
    for i in range(0,len(p0)):
        dist += (p0[i] - p1[i])**2
    return math.sqrt(dist)


    
#K-Means Algorithm
def kmeans(k,datapoints):

    # d - Dimensionality of Datapoints
    d = len(datapoints[0]) 
    
    #Limit our iterations
    Max_Iterations = 1000
    i = 0
    
    cluster = [0] * len(datapoints)
    prev_cluster = [-1] * len(datapoints)
    
    #Randomly Choose Centers for the Clusters
    cluster_centers = []
    for i in range(0,k):
        new_cluster = []
        #for i in range(0,d):
        #    new_cluster += [random.randint(0,10)]
        cluster_centers += [random.choice(datapoints)]
        
        
        #Sometimes The Random points are chosen poorly and so there ends up being empty clusters
        #In this particular implementation we want to force K exact clusters.
        #To take this feature off, simply take away "force_recalculation" from the while conditional.
        force_recalculation = False
    
    while (cluster != prev_cluster) or (i > Max_Iterations) or (force_recalculation) :
        
        prev_cluster = list(cluster)
        force_recalculation = False
        i += 1
    
        #Update Point's Cluster Alligiance
        for p in range(0,len(datapoints)):
            min_dist = float("inf")
            
            #Check min_distance against all centers
            for c in range(0,len(cluster_centers)):
                
                dist = eucldist(datapoints[p],cluster_centers[c])
                
                if (dist < min_dist):
                    min_dist = dist  
                    cluster[p] = c   # Reassign Point to new Cluster
        
        
        #Update Cluster's Position
        for k in range(0,len(cluster_centers)):
            new_center = [0] * d
            members = 0
            for p in range(0,len(datapoints)):
                if (cluster[p] == k): #If this point belongs to the cluster
                    for j in range(0,d):
                        new_center[j] += datapoints[p][j]
                    members += 1
            
            for j in range(0,d):
                if members != 0:
                    new_center[j] = new_center[j] / float(members) 
                
                #This means that our initial random assignment was poorly chosen
                #Change it to a new datapoint to actually force k clusters
                else: 
                    new_center = random.choice(datapoints)
                    force_recalculation = True
                    print ("Forced Recalculation...")
                    
            
            cluster_centers[k] = new_center
    
        
    print ("======== Results ========")
    print ("Clusters", cluster_centers)
    print ("Iterations",i)
    print ("Assignments", cluster)
    
    
#TESTING THE PROGRAM#
if __name__ == "__main__":
    #2D - Datapoints List of n d-dimensional vectors. (For this example I already set up 2D Tuples)
    #Feel free to change to whatever size tuples you want...
    datapoints = [(3,2),(2,2),(1,2),(0,1),(1,0),(1,1),(5,6),(7,7),(9,10),(11,13),(12,12),(12,13),(13,13)]

    k = 2 # K - Number of Clusters
      
    kmeans(k,datapoints)

======== Results ========
Clusters [[10.666666666666666, 11.333333333333334], [1.8571428571428572, 2.0]]
Iterations 3
Assignments [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

[Labeling] Snorkel 소개 (0)	2020.11.20
Ch.9 Back to the Feature: Building an Academic Paper Recommender (0)	2020.06.03
Ch.6 Dimensionality Reduction: Squashing the Data Pancake with PCA (0)	2020.05.06
Ch.5 Categorical Variables: Counting Eggs in theAge of Robotic Chickens (0)	2020.04.14
Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf (0)	2020.04.02

데이터과학 삼학년