#!/usr/bin/env python # coding: utf-8 # # Principal Component Analysis # In[1]: from IPython.display import Image Image('../../../python_for_probability_statistics_and_machine_learning.jpg') # The features from a particular dataset that will ultimately prove important for # machine learning can be difficult to know ahead of time. This is especially # true for problems that do not have a strong physical underpinning. The # row-dimension of the input matrix ($X$) for fitting data in Scikit-learn is the # number of samples and the column dimension is the number of features. There may # be a large number of column dimensions in this matrix, and the purpose of # dimensionality reduction is to somehow reduce these to only those columns that # are important for the machine learning task. # # Fortunately, Scikit-learn provides some powerful tools to help uncover the most # relevant features. Principal Component Analysis (PCA) consists of taking # the input $X$ matrix and (1) subtracting the mean, (2) computing the covariance # matrix, and (3) computing the eigenvalue decomposition of the covariance # matrix. For example, if $X$ has more columns than is practicable for a # particular learning method, then PCA can reduce the number of columns to a more # manageable number. PCA is widely used in statistics and other areas beyond # machine learning, so it is worth examining what it does in some detail. First, # we need the decomposition module from Scikit-learn. # In[2]: from sklearn import decomposition import numpy as np pca = decomposition.PCA() # Let's create some very simple data and apply PCA. # In[3]: x = np.linspace(-1,1,30) X = np.c_[x,x+1,x+2] # stack as columns pca.fit(X) print(pca.explained_variance_ratio_) # **Programming Tip.** # # The `np.c_` is a shorcut method for creating stacked column-wise arrays. # # # # # In this example, the columns are just constant offsets of the first # column. The *explained variance ratio* is the percentage of the variance # attributable to the transformed columns of `X`. You can think of this as the # information that is relatively concentrated in each column of the transformed # matrix `X`. [Figure](#fig:pca_001) shows the graph of this dominant # transformed column in the bottom panel. # In[4]: get_ipython().run_line_magic('matplotlib', 'inline') from matplotlib.pylab import subplots fig,axs = subplots(2,1,sharex=True,sharey=True) ax = axs[0] _=ax.plot(x,X[:,0],'-k',lw=3) _=ax.plot(x,X[:,1],'--k',lw=3) _=ax.plot(x,X[:,2],':k',lw=3) ax=axs[1] _=ax.plot(x,pca.fit_transform(X)[:,0],'-k',lw=3) #ax.tick_params(labelsize='x-large') # # #
# #The top panel shows the columns of the feature matrix and the bottom panel # shows the dominant component that PCA has extracted.
#
#
#
#
#
# To make this more interesting, let's change the slope of each of the
# columns as
# in the following,
# In[5]:
X = np.c_[x,2*x+1,3*x+2,x] # change slopes of columns
pca.fit(X)
print(pca.explained_variance_ratio_)
# However, changing the slope did not impact the explained variance
# ratio. Again, there is still only one dominant column. This means that PCA is
# invariant to both constant offsets and scale changes. This works for functions
# as well as simple lines,
# In[6]:
x = np.linspace(-1,1,30)
X = np.c_[np.sin(2*np.pi*x),
2*np.sin(2*np.pi*x)+1,
3*np.sin(2*np.pi*x)+2]
pca.fit(X)
print(pca.explained_variance_ratio_)
# Once again, there is only one dominant column, which is shown in the
# bottom panel of [Figure](#fig:pca_002). The top panel shows the individual
# columns of the feature matrix. To sum up, PCA is able to identify and eliminate
# features that are merely linear transformations of existing features. This also
# works when there is additive noise in the features, although more samples are
# needed to separate the uncorrelated noise from between features.
#
#
#
#
#
# The top panel shows the columns of the feature matrix and the bottom panel # shows the dominant component that PCA has computed.
#
#
#
# In[7]:
fig,axs = subplots(2,1,sharex=True,sharey=True)
ax = axs[0]
_=ax.plot(x,X[:,0],'-k')
_=ax.plot(x,X[:,1],'--k')
_=ax.plot(x,X[:,2],':k')
ax=axs[1]
_=ax.axis(xmin=-1,xmax=1)
_=ax.plot(x,pca.fit_transform(X)[:,0],'-k')
# ax.tick_params(labelsize='x-large')
# In[8]:
fig,axs=subplots(1,2,sharey=True)
fig.set_size_inches((8,5))
ax=axs[0]
ax.set_aspect(1/1.6)
_=ax.axis(xmin=-2,xmax=12)
x1 = np.arange(0, 10, .01/1.2)
x2 = x1+ np.random.normal(loc=0, scale=1, size=len(x1))
X = np.c_[(x1, x2)]
good = (x1>5) | (x2>5)
bad = ~good
_=ax.plot(x1[good],x2[good],'ow',alpha=.3)
_=ax.plot(x1[bad],x2[bad],'ok',alpha=.3)
_=ax.set_title("original data space")
_=ax.set_xlabel("x")
_=ax.set_ylabel("y")
_=pca.fit(X)
Xx=pca.fit_transform(X)
ax=axs[1]
ax.set_aspect(1/1.6)
_=ax.plot(Xx[good,0],Xx[good,1]*0,'ow',alpha=.3)
_=ax.plot(Xx[bad,0],Xx[bad,1]*0,'ok',alpha=.3)
_=ax.set_title("PCA-reduced data space")
_=ax.set_xlabel(r"$\hat{x}$")
_=ax.set_ylabel(r"$\hat{y}$")
# To see how PCA can simplify machine learning tasks, consider
# [Figure](#fig:pca_003) wherein the two classes are separated along the diagonal.
# After PCA, the transformed data lie along a single axis where the two classes
# can be split using a one-dimensional interval, which greatly simplifies the
# classification task. The class identities are preserved under PCA because the
# principal component is along the same direction that the classes are separated.
# On the other hand, if the classes are separated along the direction
# *orthogonal* to the principal component, then the two classes become mixed
# under PCA and the classification task becomes much harder. Note that in both
# cases, the `explained_variance_ratio_` is the same because the explained
# variance ratio does not account for class membership.
#
#
#
#
#
# The left panel shows the original two-dimensional data space of two easily # distinguishable classes and the right panel shows the reduced the data space # transformed using PCA. Because the two classes are separated along the principal # component discovered by PCA, the classes are preserved under the # transformation.
#
#
#
# In[9]:
fig,axs=subplots(1,2,sharey=True)
ax=axs[0]
x1 = np.arange(0, 10, .01/1.2)
x2 = x1+np.random.normal(loc=0, scale=1, size=len(x1))
X = np.c_[(x1, x2)]
good = x1>x2
bad = ~good
_=ax.plot(x1[good],x2[good],'ow',alpha=.3)
_=ax.plot(x1[bad],x2[bad],'ok',alpha=.3)
_=ax.set_title("original data space")
_=ax.set_xlabel("x")
_=ax.set_ylabel("y")
_=pca.fit(X)
Xx=pca.fit_transform(X)
ax=axs[1]
_=ax.plot(Xx[good,0],Xx[good,1]*0,'ow',alpha=.3)
_=ax.plot(Xx[bad,0],Xx[bad,1]*0,'ok',alpha=.3)
_=ax.set_title("PCA-reduced data space")
_=ax.set_xlabel(r"\hat{x}")
_=ax.set_ylabel(r"\hat{y}")
#
#
#
#
# As compared with [Figure](#fig:pca_003), the two classes differ along the # coordinate direction that is orthogonal to the principal component. As a result, # the two classes are no longer distinguishable after transformation.
#
#
#
#
#
# PCA works by decomposing the covariance matrix of the data using the Singular
# Value Decomposition (SVD). This decomposition exists for all matrices and
# returns the following factorization for an arbitrary matrix $\mathbf{A}$,
#
# $$
# \mathbf{A} = \mathbf{U} \mathbf{S} \mathbf{V}^T
# $$
#
# Because of the symmetry of the covariance matrix, $\mathbf{U} =
# \mathbf{V}$. The elements of the diagonal matrix $\mathbf{S}$ are the singular
# values of $\mathbf{A}$ whose squares are the eigenvalues of $\mathbf{A}^T
# \mathbf{A}$. The eigenvector matrix $\mathbf{U}$ is orthogonal: $\mathbf{U}^T
# \mathbf{U} =\mathbf{I}$. The singular values are in decreasing order so that
# the first column of $\mathbf{U}$ is the axis corresponding to the largest
# singular value. This is the first dominant column that PCA identifies. The
# entries of the covariance matrix are of the form $\mathbb{E}(x_i x_j)$ where
# $x_i$ and $x_j$ are different features [^covariance]. This means that the
# covariance matrix is filled with entries that attempt to uncover mutually
# correlated relationships between all pairs of columns of the feature matrix.
# Once these have been tabulated in the covariance matrix, the SVD finds optimal
# orthogonal transformations to align the components along the directions most
# strongly associated with these correlated relationships. Simultaneously,
# because orthogonal matrices have columns of unit-length, the SVD collects the
# absolute squared lengths of these components into the $\mathbf{S}$ matrix. In
# our example above in [Figure](#fig:pca_003), the two feature vectors were
# obviously correlated along the
# diagonal, meaning that PCA selected that diagonal direction as the principal
# component.
#
# [^covariance]: Note that these entries are constructed from the data
# using an estimator of the covariance matrix because we do not have
# the full probability densities at hand.
#
# We have seen that PCA is a powerful dimension reduction method that is
# invariant to linear transformations of the original feature space. However,
# this method performs poorly with transformations that are nonlinear. In that
# case, there are a wide range of extensions to PCA, such as Kernel PCA, that are
# available in Scikit-learn, which allow for embedding parameterized
# non-linearities into the PCA at the risk of overfitting.
#
# ## Independent Component Analysis
#
# Independent Component Analysis (ICA) via the `FastICA` algorithm is also
# available in Scikit-learn. This method is fundamentally different from PCA
# in that it is the small differences between components that are emphasized,
# not the large principal components. This method is adopted from signal
# processing. Consider a matrix of signals ($\mathbf{X}$) where the rows are
# the samples and the columns are the different signals. For example, these
# could be EKG signals from multiple leads on a single patient. The analysis
# starts with the following model,
#
#
#
#
# $$
# \begin{equation}
# \mathbf{X} = \mathbf{S}\mathbf{A}^T
# \label{eq:ICA} \tag{1}
# \end{equation}
# $$
#
# In other words, the observed signal matrix is an unknown mixture
# ($\mathbf{A}$) of some set of conformable, independent random sources
# $\mathbf{S}$,
#
# $$
# \mathbf{S}=\left[ \mathbf{s}_1(t),\mathbf{s}_2(t),\ldots,\mathbf{s}_n(t)\right]
# $$
#
# The distribution on the random sources is otherwise unknown, except
# there can be at most one Gaussian source, otherwise, the mixing matrix
# $\mathbf{A}$ cannot be identified because of technical reasons. The problem in
# ICA is to find $\mathbf{A}$ in Equation [eq:ICA](#eq:ICA) and thereby un-mix the
# $s_i(t)$ signals, but this cannot be solved without a strategy to reduce the
# inherent arbitrariness in this formulation.
#
# To make this concrete, let us simulate the situation with the following code,
# In[10]:
np.random.seed(123456)
# In[11]:
from numpy import matrix, c_, sin, cos, pi
t = np.linspace(0,1,250)
s1 = sin(2*pi*t*6)
s2 =np.maximum(cos(2*pi*t*3),0.3)
s2 = s2 - s2.mean()
s3 = np.random.randn(len(t))*.1
# normalize columns
s1=s1/np.linalg.norm(s1)
s2=s2/np.linalg.norm(s2)
s3=s3/np.linalg.norm(s3)
S =c_[s1,s2,s3] # stack as columns
# mixing matrix
A = matrix([[ 1, 1,1],
[0.5, -1,3],
[0.1, -2,8]])
X= S*A.T # do mixing
# In[12]:
fig,axs=subplots(3,2,sharex=True)
fig.set_size_inches((8,8))
X = np.array(X)
_=axs[0,1].plot(t,-X[:,0],'k-')
_=axs[1,1].plot(t,-X[:,1],'k-')
_=axs[2,1].plot(t,-X[:,2],'k-')
_=axs[0,0].plot(t,s1,'k-')
_=axs[1,0].plot(t,s2,'k-')
_=axs[2,0].plot(t,s3,'k-')
_=axs[2,0].set_xlabel('$t$',fontsize=18)
_=axs[2,1].set_xlabel('$t$',fontsize=18)
_=axs[0,0].set_ylabel('$s_1(t)$ ',fontsize=18,rotation='horizontal')
_=axs[1,0].set_ylabel('$s_2(t)$ ',fontsize=18,rotation='horizontal')
_=axs[2,0].set_ylabel('$s_3(t)$ ',fontsize=18,rotation='horizontal')
for ax in axs.flatten():
_=ax.yaxis.set_ticklabels('')
_=axs[0,1].set_ylabel(' $X_1(t)$',fontsize=18,rotation='horizontal')
_=axs[1,1].set_ylabel(' $X_2(t)$',fontsize=18,rotation='horizontal')
_=axs[2,1].set_ylabel(' $X_3(t)$',fontsize=18,rotation='horizontal')
_=axs[0,1].yaxis.set_label_position("right")
_=axs[1,1].yaxis.set_label_position("right")
_=axs[2,1].yaxis.set_label_position("right")
#
#
#
#
# The left column shows the original signals and the right column shows the # mixed signals. The object of ICA is to recover the left column from the # right.
#
#
#
#
#
# The individual signals ($s_i(t)$) and their mixtures ($X_i(t)$) are
# shown in [Figure](#fig:pca_008). To recover the individual signals using ICA,
# we use the `FastICA` object and fit the parameters on the `X` matrix,
# In[13]:
from sklearn.decomposition import FastICA
ica = FastICA()
# estimate unknown S matrix
S_=ica.fit_transform(X)
# The results of this estimation are shown in [Figure](#fig:pca_009),
# showing that ICA is able to recover the original signals from the observed
# mixture. Note that ICA is unable to distinguish the signs of the recovered
# signals or preserve the order of the input signals.
# In[14]:
fig,axs=subplots(3,2,sharex=True)
fig.set_size_inches((8,8))
X = np.array(X)
_=axs[0,1].plot(t,-S_[:,2],'k-')
_=axs[1,1].plot(t,-S_[:,1],'k-')
_=axs[2,1].plot(t,-S_[:,0],'k-')
_=axs[0,0].plot(t,s1,'k-')
_=axs[1,0].plot(t,s2,'k-')
_=axs[2,0].plot(t,s3,'k-')
_=axs[2,0].set_xlabel('$t$',fontsize=18)
_=axs[2,1].set_xlabel('$t$',fontsize=18)
_=axs[0,0].set_ylabel('$s_1(t)$ ',fontsize=18,rotation='horizontal')
_=axs[1,0].set_ylabel('$s_2(t)$ ',fontsize=18,rotation='horizontal')
_=axs[2,0].set_ylabel('$s_3(t)$ ',fontsize=18,rotation='horizontal')
for ax in axs.flatten():
_=ax.yaxis.set_ticklabels('')
_=axs[0,1].set_ylabel(' $s_1^\prime(t)$',fontsize=18,rotation='horizontal')
_=axs[1,1].set_ylabel(' $s_2^\prime(t)$',fontsize=18,rotation='horizontal')
_=axs[2,1].set_ylabel(' $s_3^\prime(t)$',fontsize=18,rotation='horizontal')
_=axs[0,1].yaxis.set_label_position("right")
_=axs[1,1].yaxis.set_label_position("right")
_=axs[2,1].yaxis.set_label_position("right")
#
#
#
#
# The left column shows the original signals and the right column shows the # signals that ICA was able to recover. They match exactly, outside of a possible # sign change.
#
#
#
#
#
# To develop some intuition as to how ICA accomplishes this feat, consider the
# following two-dimensional situation with two uniformly distributed independent
# variables, $u_x,u_y \sim \mathcal{U}[0,1]$. Suppose we apply the
# following orthogonal rotation matrix to these variables,
#
# $$
# \begin{bmatrix}
# u_x^\prime \\\
# u_y^\prime
# \end{bmatrix}=
# \begin{bmatrix}
# \cos(\phi) & -\sin(\phi) \\\
# \sin(\phi) & \cos(\phi)
# \end{bmatrix}
# \begin{bmatrix}
# u_x \\\
# u_y
# \end{bmatrix}
# $$
#
#
#
#
#
# The left panel shows two classes labeled on the $u_x,u_y$ uniformly # independent random variables. The right panel shows these random variables after # a rotation, which removes their mutual independence and makes it hard to # separate the two classes along the coordinate directions.
#
#
#
#
#
# The so-rotated variables $u_x^\prime,u_y^\prime$ are no longer
# independent, as shown in [Figure](#fig:pca_005). Thus, one way to think about
# ICA is as a search through orthogonal matrices so that the independence is
# restored. This is where the prohibition against Gaussian distributions arises.
# The two dimensional Gaussian distribution of independent variables is
# proportional the following,
#
# $$
# f(\mathbf{x})\propto\exp(-\frac{1}{2}\mathbf{x}^T \mathbf{x})
# $$
#
# Now, if we similarly rotated the $\mathbf{x}$ vector as,
#
# $$
# \mathbf{y} = \mathbf{Q} \mathbf{x}
# $$
#
# the resulting density for $\mathbf{y}$ is obtained by plugging in
# the following,
#
# $$
# \mathbf{x} = \mathbf{Q}^T \mathbf{y}
# $$
#
# because the inverse of an orthogonal matrix is its transpose, we
# obtain
#
# $$
# f(\mathbf{y})\propto\exp(-\frac{1}{2}\mathbf{y}^T \mathbf{Q}\mathbf{Q}^T
# \mathbf{y})=\exp(-\frac{1}{2}\mathbf{y}^T \mathbf{y})
# $$
#
# In other words, the transformation is lost on the $\mathbf{y}$
# variable. This means that ICA cannot search over orthogonal transformations if
# it is blind to them, which explains the restriction of Gaussian random
# variables. Thus, ICA is a method that seeks to maximize the non-Gaussian-ness
# of the transformed random variables. There are many methods to doing this,
# some of which involve cumulants and others that use the
# *negentropy*,
#
# $$
# \mathcal{J}(Y) = \mathcal{H}(Z)-\mathcal{H}(Y)
# $$
#
# where $\mathcal{H}(Z)$ is the information entropy of the
# Gaussian random variable $Z$ that has the same variance as $Y$. Further
# details would take us beyond our scope, but that is the outline of how
# the FastICA algorithm works.
#
# The implementation of this method in Scikit-learn includes two different ways
# of extracting more than one independent source component. The *deflation*
# method iteratively extracts one component at a time using a incremental
# normalization step. The *parallel* method also uses the single-component method
# but carries out normalization of all the components simultaneously, instead of
# for just the newly computed component. Because ICA extracts
# independent components, a whitening step is used beforehand to balance the
# correlated components from the data matrix. Whereas PCA returns
# uncorrelated components along dimensions optimal for Gaussian random
# variables, ICA returns components that are as far from the Gaussian density
# as possible.
#
# The left panel on [Figure](#fig:pca_005) shows the orignal uniform random
# sources. The white and black colors distinguish between two classes. The right
# panel shows the mixture of these sources, which is what we observe as input
# features. The top row of [Figure](#fig:pca_006) shows the PCA (left) and ICA
# (right) transformed data spaces. Notice that ICA is able to un-mix the two
# random sources whereas PCA transforms along the dominant diagonal. Because ICA
# is able to preserve the class membership, the data space can be reduced to two
# non-overlapping sections, as shown. However, PCA cannot achieve a similiar
# separation because the classes are mixed along the dominant diagonal that PCA
# favors as the main component in the decomposition.
#
#
#
#
#
# The panel on the top left shows two classes in a plane after a rotation. The # bottom left panel shows the result of dimensionality reduction using PCA, which # causes mixing between the two classes. The top right panel shows the ICA # transformed output and the lower right panel shows that, because ICA was able to # un-rotate the data, the lower dimensional data maintains the separation between # the classes.
#
#
#
#
#
# For a good principal component analysis treatment, see
# [[richert2013building]](#richert2013building),
# [[alpaydin2014introduction]](#alpaydin2014introduction),
# [[cuesta2013practical]](#cuesta2013practical), and
# [[izenman2008modern]](#izenman2008modern). Independent Component Analysis is
# discussed in more detail
# in [[hyvarinen2004independent]](#hyvarinen2004independent).
#
#
#
#
#
#
#