There is grandeur in this view of life

martins bioblogg

Using R: Two plots of principal component analysis

with 5 comments

PCA is a very common method for exploration and reduction of high-dimensional data. It works by making linear combinations of the variables that are orthogonal, and is thus a way to change basis to better see patterns in data. You either do spectral decomposition of the correlation matrix or singular value decomposition of the data matrix and get linear combinations that are called principal components, where the weights of each original variable in the principal component are called loadings and the transformed data are called scores. Spurred by this question, I thought I’d share my favourite PCA plots. Of course, this example uses R and ggplot2, but you could use anything you like.

First, let us generate some nonsense data — 50 samples and 70 variables in groups of ten. Variables in the same group are related, and there is relationship between values of the variables and sample group numbers. I didn’t worry too much about the features of the data, except I wanted some patterns and quite a bit of noise. The first principal component explains approximately 20% of the variance.

sample.groups <- c(rep(1, 10), rep(2, 10), rep(3, 10),
  rep(4, 10), rep(5, 10))
variable.groups <- c(rep(1, 10), rep(2, 10), rep(3, 10),
  rep(4, 10), rep(5, 10), rep(6, 10),
  rep(7, 10))

data <- matrix(nrow=length(sample.groups), ncol=70) <- matrix(nrow=length(sample.groups), ncol=7)

for (j in 1:ncol( {
  mu <- rnorm(1, 0, 4)
  sigma <- runif(1, 5, 10)[,j] <- sample.groups*mu +
  rnorm(length(sample.groups), 0, sigma)

for (j in 1:ncol(data)) {
  mu <- runif(1, 0, 4)
  data[,j] <-[,variable.groups[j]] +
  rnorm(length(sample.groups), mu, 10)

Here is the typical correlation heatmap of the variables:


heatmap <- qplot(x=Var1, y=Var2, data=melt(cor(data)), geom="tile",

Maybe what we want to know is what variables go together, and if we can use a few of the principal components to capture some aspect of the data. So we want to know which variables have high loading in which principal components. I think that small multiples of barplots (or dotplots) of the first few principal components does this pretty well:


pca <- prcomp(data, scale=T)
melted <- cbind(, melt(pca$rotation[,1:9]))

barplot <- ggplot(data=melted) +
  geom_bar(aes(x=Var1, y=value,, stat="identity") +


As usual, I haven’t put that much effort into the look. If you were to publish this plot, you’d probably want to use something other than ggplot2 defaults, and give your axes sensible names. In cases where we don’t have a priori variable groupings we can just omit the fill colour. Maybe sorting the bars by loading could be useful to quickly identify the most influential variables.

In other applications we’re more interested in graphically looking for similarities between samples, and then we have more use for the scores. For instance, in genetics a scatterplot of the first principal components is typically used to show for patterns of genetic similarity between individuals drawn from different populations. This is a component of the so-called biplot.

scores <- data.frame(sample.groups, pca$x[,1:3])
pc1.2 <- qplot(x=PC1, y=PC2, data=scores, colour=factor(sample.groups)) +
pc1.3 <- qplot(x=PC1, y=PC3, data=scores, colour=factor(sample.groups)) +
pc2.3 <- qplot(x=PC2, y=PC3, data=scores, colour=factor(sample.groups)) +


In this case, small multiples are not as easily made with facets, but I used the multiplot function by Winston Chang.

Written by mrtnj

26 juni, 2013 den 17:50

Publicerat i data analysis, english

Tagged with

5 svar

Subscribe to comments with RSS.

  1. Hi Martin,

    great post. I recently did a similar post on using the SVD for identifying key variable, I used lattice graphics and had a slightly different approach, but basically the same idea.


    Max Gordon (@gordon_max)

    27 juni, 2013 at 21:48

    • Cool! It’s pretty funny to me that we did almost the same thing with almost completely different functions. I really want to learn a little lattice, so I’ll have a look at your code for pointers.




      29 juni, 2013 at 12:49

  2. […] script because it uses a very similar ggplot-base. However, there’s also a very nice posting over at Martin’s Bio Blog which show alternative approaches on plotting […]

  3. You probably want to add + coord_fixed() to the plots of the PCA scores in the last example. The axes should be scaled equally in such a plot.


    8 juli, 2013 at 18:54

    • Thanks! Yes, that’s probably a good thing to do, though it changes these plots very little.




      9 juli, 2013 at 17:25


Fyll i dina uppgifter nedan eller klicka på en ikon för att logga in: Logo

Du kommenterar med ditt Logga ut / Ändra )


Du kommenterar med ditt Twitter-konto. Logga ut / Ändra )


Du kommenterar med ditt Facebook-konto. Logga ut / Ändra )

Google+ photo

Du kommenterar med ditt Google+-konto. Logga ut / Ändra )

Ansluter till %s


Få meddelanden om nya inlägg via e-post.

Gör sällskap med 1 367 andra följare

%d bloggare gillar detta: