Next: Sample Distributions Up: Statistical Definitions Previous: Regression Diagnostics Contents

Subsections

Principal Components Analysis

PCA is essentialy an operation that rotates and scales the axes of the variable space to maximize the variance. Since rotation and scaling are linear the PCA operation maintains linear relationships in the data. PCA is basically an ordination technique that consists of doing an eigenanalysis of the correlation or covariance matrix. The underlying aim of PCA is to reduce the dimensionality of the orignal dataset but retain most of the original variability. A really neat example can be found here The resultant principal components are effectively new axes for the data. Thus they are orthogonal and uncorrelated.

PCA is often motivated by the search for latent variables. Often it is relatively easy to label the highest or second highest components, but it becomes increasingly difficult as less relevant axes are examined. The objects with the highest loadings (see below) or projections on the axes (i.e. those which are placed towards the extremities of the axes) are usually worth examining: the axis may be characterisable as a spectrum running from a small number of objects with high positive loadings to those with high negative loadings.

Also see Principal Components Regression

Algorithm

The procedeure to carry out a PCA on a given set of data involves find the eigenvectors and eigenvalues of the covariance or correlation matrix. The data should be normalized (autoscaled) so as to remove the effects of large differences in order. The resultant eigenvectors are known as loadings and are coefficients for calculating linear combinations of original variables (scores). The eigenvalues give the variances of principal component scores. The algorithm can be summarized as

Autoscale the input data
Evalauate the correlation matrix and/or covariance matrix
Find eigenvectors and eigenvalues of either the correlation or covariance matrix

Once we have the eigenvectors and eigenvalues we can transform the original dataset by projecting the original points onto the new axes. In the case of molecular descriptors the algorithm is

Consider a molecule 1 which is described by a vector of descriptors $\mathbb{D} = \{ d_{11}, d_{12} \cdots d_{1m}\}$ . Also the loading matrix, $\mathbb{L}$ obtained will be a $m \times m$ matrix with the eigenvectors in the columns (so $\mathbb{L}_{1i}, \{i = 1, m\}$ is the eigenvector for the first PC)
Thus the projection of the original descriptor vector along the first PC will be given by $\mathbb{D} \cdot \mathbb{L}_{1i}$
Repeat for all the molecules and all the PC's

How Many Factors?

Once we have obtained the principal components (also termed as factors) we need to decide on how many to consider. In general this is empirically chosen as the number of components that comprise 95% of the total variance. However there are two other methods

The Kaiser criterion which says that one should only retain factors whose corresponding eigenvalues are greater than 1. This method can sometimes lead to too many factors being taken
The scree method in which we plot the serial number of an eigenvalue versus its actual value. Looking at this plot one finds the point to the right of which the plot more or less flattens out. All points to the left (this point inclusive) are chosen. This method can sometimes lead to too few factors being chosen

Principal Factors vs. Principal Components

The defining characteristic then that distinguishes between the two factor analytic models is that in principal components analysis we assume that all variability in an item should be used in the analysis, while in principal factors analysis we only use the variability in an item that it has in common with the other items. In most cases, these two methods usually yield very similar results. However, principal components analysis is often preferred as a method for data reduction, while principal factors analysis is often preferred when the goal of the analysis is to detect structure

Matrix Ill Conditioning

If, in the correlation matrix there are variables that are 100% redundant, then the inverse of the matrix cannot be computed. For example, if a variable is the sum of two other variables selected for the analysis, then the correlation matrix of those variables cannot be inverted, and the factor analysis can basically not be performed. The problem can be overcome by artificially lowering all correlations in the correlation matrix by adding a small constant to the diagonal of the matrix, and then restandardizing it. This procedure will usually yield a matrix that now can be inverted and thus factor-analyzed; moreover, the factor patterns should not be affected by this procedure. However, note that the resulting estimates are not exact.

Next: Sample Distributions Up: Statistical Definitions Previous: Regression Diagnostics Contents

2003-08-29