330
$\begingroup$

Singular value decomposition (SVD) and principal component analysis (PCA) are two eigenvalue methods used to reduce a high-dimensional dataset into fewer dimensions while retaining important information. Articles online say that these methods are 'related' but never specify the exact relation.

What is the intuitive relationship between PCA and SVD? As PCA uses the SVD in its calculation, clearly there is some 'extra' analysis done. What does PCA 'pay attention' to differently than the SVD? What kinds of relationships do each method utilize more in their calculations? Is one method 'blind' to a certain type of data that the other is not?

  • 14
    SVD and PCA and "total least-squares" (and several other names) are the same thing. It computes the orthogonal transform that decorrelates the variables and keeps the ones with the largest variance. There are two numerical approaches: one by SVD of the (centered) data matrix, and one by Eigen decomposition of this matrix "squared" (covariance).2014-06-10
  • 5
    Here is a link to a very similar thread on CrossValidated.SE: [Relationship between SVD and PCA. How to use SVD to perform PCA?](http://stats.stackexchange.com/questions/134282) It covers similar grounds to J.M.'s answer (+1 by the way), but in somewhat more detail.2015-01-24
  • 0
    [how-to-find-straight-line-minimizing-the-sum-of-squares-of-euclidean-distances-f](http://stats.stackexchange.com/questions/52807/how-to-find-straight-line-minimizing-the-sum-of-squares-of-euclidean-distances-f) on stats.stackexchange has some links on the relationship between orthogonal regression and PCA.2015-08-30

4 Answers 4

271

(I assume for the purposes of this answer that the data has been preprocessed to have zero mean.)

Simply put, the PCA viewpoint requires that one compute the eigenvalues and eigenvectors of the covariance matrix, which is the product $\mathbf X\mathbf X^\top$, where $\mathbf X$ is the data matrix. Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:

$\mathbf X\mathbf X^\top=\mathbf W\mathbf D\mathbf W^\top$

On the other hand, applying SVD to the data matrix $\mathbf X$ as follows:

$\mathbf X=\mathbf U\mathbf \Sigma\mathbf V^\top$

and attempting to construct the covariance matrix from this decomposition gives

$\begin{align*} \mathbf X\mathbf X^\top&=(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf U\mathbf \Sigma\mathbf V^\top)^\top\\ \mathbf X\mathbf X^\top&=(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf V\mathbf \Sigma\mathbf U^\top) \end{align*}$

and since $\mathbf V$ is an orthogonal matrix ($\mathbf V^\top \mathbf V=\mathbf I$),

$\mathbf X\mathbf X^\top=\mathbf U\mathbf \Sigma^2 \mathbf U^\top$

and the correspondence is easily seen (the square roots of the eigenvalues of $\mathbf X\mathbf X^\top$ are the singular values of $\mathbf X$, etc.)

In fact, using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix to begin with, since the formation of $\mathbf X\mathbf X^\top$ can cause loss of precision. This is detailed in books on numerical linear algebra, but I'll leave you with an example of a matrix that can be stable SVD'd, but forming $\mathbf X\mathbf X^\top$ can be disastrous, the Läuchli matrix:

$\begin{pmatrix}1&1&1\\ \epsilon&0&0\\0&\epsilon&0\\0&0&\epsilon\end{pmatrix}^\top$

where $\epsilon$ is a tiny number.

  • 3
    To give a *Mathematica* example: `A=SparseArray[{{i_, 1} -> 1, {i_, j_} /; i + 1 == j :> $MachineEpsilon}, {3, 4}];` and then compare `Sqrt[Eigenvalues[a.Transpose[a]]]` and `SingularValueList[a,Tolerance->0]`.2010-09-02
  • 0
    I see. So, in essence, SVD compares U to V, while PCA compares U to U itself. SVD gives you an eigenvector decomposition of the data, while PCA takes that decomposition and compares one side to itself to see which ones are more dominant.2010-10-29
  • 9
    If $X$ is the data matrix, isn't the covariance matrix defined by $E[XX^T] - E[X]E[X]^T$, not $XX^T$? And so PCA is equivalent to SVD on the mean centered data matrix.2011-05-03
  • 4
    @Jason: Ah yes, I was assuming that the data was already "zero mean" for the purposes of this answer. I'll edit in your note later.2011-05-04
  • 0
    @J.M. could you ellaborate on why $AA^T$ calculation is disastrous for the matrix you've given? I calculated $AA^T$ for the numbers you've specified, I get a 3x3 matrix with all 1's. Am I doing something wrong?2013-01-21
  • 10
    I'm not sure about this, but I think the covariance matrix of an $n \times m$ data matrix (with $n$ samples each of size $m$) is $(X X^T) / (n - 1)$, so your example only works if the data matrix is standardized to have unit variance along each column. Somebody please correct me if I'm wrong.2013-01-29
  • 2
    @JeffTerrell Right so, but I think the normalizing factor was left out for the sake of readability.2013-02-13
  • 0
    @user6786, and you noticed that the matrix you got is singular, no? But the singular values are tiny, but not zero...2013-03-23
  • 2
    Note that in practice, the columns of $W$ and $U$ (the principal components via the eigendecomposition versus singular value decomposition) may differ from each other by a factor of -1.2014-01-16
  • 1
    @J.M. - It is a bit unclear if the data matrix consists of row vectors or column vectors maybe could be good to mention so there is no misunderstanding.2014-01-21
  • 1
    @J. M., for completeness what is the mathematical relation between the `W` matrix defined in your PCA explanation and the `U` matrix defined in your SVD explanation?2014-09-04
  • 9
    This was a little confusing in that normally the data matrix has n rows of samples of data with d dimensions along columns, like a least squares design matrix. If that is true then the covariance is $X^TX$, and the SVD result is $V\Sigma V^T$. I was also confused by the lack of normalization initially. But altogether a pretty clear explanation.2015-03-03
  • 1
    Maybe we should add the matrix U is also the eigenvectors.2015-11-05
  • 0
    @J.M. Please provide some good reference for this statement: "forming the covariance matrix to begin with, since the formation of XX⊤XX⊤ can cause loss of precision"2016-03-08
  • 1
    @sv_jan, did ya even try the example I gave, if you have to ask that?2016-03-08
  • 0
    @J.M. I wanted some reference such as a standard textbook or a research paper.2016-03-09
  • 1
    @sv_jan5, if memory serves, Golub/Van Loan mention the Läuchli example. A number of other numerical linear algebra books will have this discussion and similar examples; look around.2016-03-09
  • 0
    @J.M. You guessed it right. I found its reference in Golub/Van Loan on pg. 239. Thanks for help!2016-03-09
  • 0
    @J.M.isnotamathematician I have a question about the covariance matrix C = XXtrans. In this settings, you assume that X rows are the features and columns the samples. right? otherwise, if the samples are the rows and the features the columns, it should be C = XtransX2018-03-02
  • 0
    @sera, "In this settings, you assume that X rows are the features and columns the samples. right?" - yes. Note that the Läuchli example is actually a "wide" matrix and not a "tall" one due to the transposition.2018-03-02
  • 0
    @J.M.isnotamathematician Thank you. could you add some sentences about how can I reduce my data using SVD a la PCA? the reduced PCA would be Xreduced = np.dot(Wtransposed, X) right? then how can I connect this with the SVD decomposition ? thank you in advance2018-03-03
  • 0
    @sera, I would suggest asking a new question about that instead; explaining that is a long answer itself.2018-03-04
51

A tutorial on Principal Component Analysis by Jonathon Shlens is a good tutorial on PCA and its relation to SVD. Specifically, section VI: A More General Solution Using SVD.

9

The question boils down to whether you what to subtract the means and divide by standard deviation first. The same question arises in the context of linear and logistic regression. So I'll reason by analogy.

In many problems our features are positive values such as counts of words or pixel intensities. Typically a higher count or a higher pixel intensity means that a feature is more useful for classification/regression. If you subtract the means then you are forcing features with original value of zero to have a negative value which is high in magnitude. This entails that you make the features values that are non-important to the problem of classification (previously zero valued) as influential as the most important features values (the ones that have high counts or pixel intensities).

The same reasoning holds for PCA. If your features are least sensitive (informative) towards the mean of the distribution, then it makes sense to subtract the mean. If the features are most sensitive towards the high values, then subtracting the mean does not make sense.

SVD does not subtract the means but often as a first step projects the data on the mean of all data points. In this way the SVD first takes care of global structure.

  • 1
    I think this answer may be a bit misleading. The fact that zero value numbers will be mapped to negative numbers of large magnitude after subtracting means doesn't mean that their influence on a statistical model is increased. Deviation from the mean *is* the information used by many (perhaps most?) statistical models to fit curves, sort items, etc. If you are concerned about a feature with a long distribution tail (e.g. counts), then there are ways of transforming that data (e.g. add 1 and take the log) so it plays nice with models based on symmetric distributions.2016-01-27
3

There is a way to do an SVD on a sparse matrix that treats missing features as missing (using gradient search). I don't know any way to do PCA on a sparse matrix except by treating missing features as zero.