PCAtools package - RDocumentation (2024)

Kevin Blighe, Aaron Lun2021-07-23

Principal Component Analysis (PCA) is a very powerful technique that haswide applicability in data science, bioinformatics, and further afield.It was initially developed to analyse large volumes of data in order totease out the differences/relationships between the logical entitiesbeing analysed. It extracts the fundamental structure of the datawithout the need to build any model to represent it. This ‘summary’ ofthe data is arrived at through a process of reduction that can transformthe large number of variables into a lesser number that are uncorrelated(i.e.the ‘principal components’), while at the same time being capableof easy interpretation on the original data (Blighe and Lun 2019)(Blighe 2013).

PCAtools provides functions for data exploration via PCA, and allowsthe user to generate publication-ready figures. PCA is performed viaBiocSingular (Lun 2019) - users can also identify optimal number ofprincipal components via different metrics, such as elbow method andHorn’s parallel analysis (Horn 1965) (Buja and Eyuboglu 1992), which hasrelevance for data reduction in single-cell RNA-seq (scRNA-seq) and highdimensional mass cytometry data.

1. Download the package from Bioconductor

 if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager') BiocManager::install('PCAtools')

Note: to install development version direct from GitHub:

 if (!requireNamespace('devtools', quietly = TRUE)) install.packages('devtools') devtools::install_github('kevinblighe/PCAtools')

2. Load the package into R session

 library(PCAtools)

For this example, we will follow the tutorial (from Section 3.1) ofRNA-seq workflow: gene-level exploratory analysis and differentialexpression.Specifically, we will load the ‘airway’ data, where different airwaysmooth muscle cells were treated with dexamethasone.

 library(airway) library(magrittr) data('airway') airway$dex %<>% relevel('untrt')

Annotate the Ensembl gene IDs to gene symbols:

 ens <- rownames(airway) library(org.Hs.eg.db) symbols <- mapIds(org.Hs.eg.db, keys = ens, column = c('SYMBOL'), keytype = 'ENSEMBL') symbols <- symbols[!is.na(symbols)] symbols <- symbols[match(rownames(airway), names(symbols))] rownames(airway) <- symbols keep <- !is.na(rownames(airway)) airway <- airway[keep,]

Normalise the data and transform the normalised counts tovariance-stabilised expression levels:

 library('DESeq2') dds <- DESeqDataSet(airway, design = ~ cell + dex) dds <- DESeq(dds) vst <- assay(vst(dds))

Conduct principal component analysis (PCA):

 p <- pca(vst, metadata = colData(airway), removeVar = 0.1)
## -- removing the lower 10% of variables based on variance

A scree plot

 screeplot(p, axisLabSize = 18, titleLabSize = 22)

A bi-plot

Different interpretations of the biplot exist. In the OMICs era, formost general users, a biplot is a simple representation of samples in a2-dimensional space, usually focusing on just the first two PCs:

 biplot(p)

However, the original definition of a biplot by Gabriel KR (Gabriel1971) is a plot that plots both variables and observations (samples) inthe same space. The variables are indicated by arrows drawn from theorigin, which indicate their ‘weight’ in different directions. We touchon this later via the plotLoadings function.

 biplot(p, showLoadings = TRUE, labSize = 5, pointSize = 5, sizeLoadingsNames = 5)

Here, we will instead start with data from Gene ExpressionOmnibus. We will load breast cancergene expression data with recurrence free survival (RFS) from GeneExpression Profiling in Breast Cancer: Understanding the Molecular Basisof Histologic Grade To ImprovePrognosis.

First, let’s read in and prepare the data:

 library(Biobase) library(GEOquery) # load series and platform data from GEO gset <- getGEO('GSE2990', GSEMatrix = TRUE, getGPL = FALSE) mat <- exprs(gset[[1]]) # remove Affymetrix control probes mat <- mat[-grep('^AFFX', rownames(mat)),] # extract information of interest from the phenotype data (pdata) idx <- which(colnames(pData(gset[[1]])) %in% c('relation', 'age:ch1', 'distant rfs:ch1', 'er:ch1', 'ggi:ch1', 'grade:ch1', 'size:ch1', 'time rfs:ch1')) metadata <- data.frame(pData(gset[[1]])[,idx], row.names = rownames(pData(gset[[1]]))) # tidy column names colnames(metadata) <- c('Study', 'Age', 'Distant.RFS', 'ER', 'GGI', 'Grade', 'Size', 'Time.RFS') # prepare certain phenotypes of interest metadata$Study <- gsub('Reanalyzed by: ', '', as.character(metadata$Study)) metadata$Age <- as.numeric(gsub('^KJ', NA, as.character(metadata$Age))) metadata$Distant.RFS <- factor(metadata$Distant.RFS, levels = c(0,1)) metadata$ER <- factor(gsub('\\?', NA, as.character(metadata$ER)), levels = c(0,1)) metadata$ER <- factor(ifelse(metadata$ER == 1, 'ER+', 'ER-'), levels = c('ER-', 'ER+')) metadata$GGI <- as.numeric(as.character(metadata$GGI)) metadata$Grade <- factor(gsub('\\?', NA, as.character(metadata$Grade)), levels = c(1,2,3)) metadata$Grade <- gsub(1, 'Grade 1', gsub(2, 'Grade 2', gsub(3, 'Grade 3', metadata$Grade))) metadata$Grade <- factor(metadata$Grade, levels = c('Grade 1', 'Grade 2', 'Grade 3')) metadata$Size <- as.numeric(as.character(metadata$Size)) metadata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', NA, metadata$Time.RFS)) # remove samples from the pdata that have any NA value discard <- apply(metadata, 1, function(x) any(is.na(x))) metadata <- metadata[!discard,] # filter the expression data to match the samples in our pdata mat <- mat[,which(colnames(mat) %in% rownames(metadata))] # check that sample names match exactly between pdata and expression data all(colnames(mat) == rownames(metadata))
## [1] TRUE

Conduct principal component analysis (PCA):

 p <- pca(mat, metadata = metadata, removeVar = 0.1)
## -- removing the lower 10% of variables based on variance

A bi-plot

 biplot(p)
 biplot(p, showLoadings = TRUE, lab = NULL)

One of the probes pointing downward is 205225_at, which targets theESR1 gene. This is already a useful validation, as the oestrogenreceptor, which is in part encoded by ESR1, is strongly represented byPC2 (y-axis), with negative-to-positive receptor status going fromtop-to-bottom.

More on this later in this vignette.

A pairs plot

 pairsplot(p)

A loadings plot

If the biplot was previously generated with showLoadings = TRUE, checkhow this loadings plot corresponds to the biplot loadings - they shouldmatch up for the tophits.

 plotloadings(p, labSize = 3)
## -- variables retained:## 215281_x_at, 214464_at, 211122_s_at, 210163_at, 204533_at, 205225_at, 209351_at, 205044_at, 202037_s_at, 204540_at, 215176_x_at, 214768_x_at, 212671_s_at, 219415_at, 37892_at, 208650_s_at, 206754_s_at, 205358_at, 205380_at, 205825_at## Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider## increasing max.overlaps

An eigencor plot

 eigencorplot(p, metavars = c('Study','Age','Distant.RFS','ER', 'GGI','Grade','Size','Time.RFS'))

Access the internal data

The rotated data that represents the observations / samples is stored inrotated, while the variable loadings are stored in loadings

 p$rotated[1:5,1:5]
## PC1 PC2 PC3 PC4 PC5## GSM65752 -30.24272 43.826310 3.781677 -39.536149 18.612835## GSM65753 -37.73436 -15.464421 -4.913100 -5.877623 9.060108## GSM65755 -29.95155 7.788280 -22.980076 -15.222649 23.123766## GSM65757 -33.73509 1.261410 -22.834375 2.494554 13.629207## GSM65758 -40.95958 -8.588458 4.995440 14.340150 0.417101
 p$loadings[1:5,1:5]
## PC1 PC2 PC3 PC4 PC5## 206378_at -0.0024336244 -0.05312797 -0.004809456 0.04045087 0.0096616577## 205916_at -0.0051057533 0.00122765 -0.010593760 0.04023264 0.0285972617## 206799_at 0.0005723191 -0.05048096 -0.009992964 0.02568142 0.0024626261## 205242_at 0.0129147329 0.02867789 0.007220832 0.04424070 -0.0006138609## 206509_at 0.0019058729 -0.05447596 -0.004979062 0.01510060 -0.0026213610

All functions in PCAtools are highly configurable and should covervirtually all basic and advanced user requirements. The followingsections take a look at some of these advanced features, and form asomewhat practical example of how one can use PCAtools to make aclinical interpretation of data.

First, let’s sort out the gene annotation by mapping the probe IDs togene symbols. The array used for this study was the Affymetrix U133a, solet’s use the hgu133a.db Bioconductor package:

 suppressMessages(require(hgu133a.db)) newnames <- mapIds(hgu133a.db, keys = rownames(p$loadings), column = c('SYMBOL'), keytype = 'PROBEID')
## 'select()' returned 1:many mapping between keys and columns
 # tidy up for NULL mappings and duplicated gene symbols newnames <- ifelse(is.na(newnames) | duplicated(newnames), names(newnames), newnames) rownames(p$loadings) <- newnames

Determine optimum number of PCs to retain

A scree plot on its own just shows the accumulative proportion ofexplained variation, but how can we determine the optimum number of PCsto retain?

PCAtools provides four metrics for this purpose:

  • Elbow method
  • Horn’s parallel analysis (Horn 1965) (Buja and Eyuboglu 1992).
  • Marchenko-Pastur limit
  • Gavish-Donoho method

Let’s perform Horn’s parallel analysis first:

 horn <- parallelPCA(mat) horn$n
## [1] 11

Now the elbow method:

 elbow <- findElbowPoint(p$variance) elbow
## PC8 ## 8

In most cases, the identified values will disagree. This is becausefinding the correct number of PCs is a difficult task and is akin tofinding the ‘correct’ number of clusters in a dataset - there is nocorrect answer.

Taking these values, we can produce a new scree plot and mark these:

 library(ggplot2) screeplot(p, components = getComponents(p, 1:20), vline = c(horn$n, elbow)) + geom_label(aes(x = horn$n + 1, y = 50, label = 'Horn\'s', vjust = -1, size = 8)) + geom_label(aes(x = elbow + 1, y = 50, label = 'Elbow method', vjust = -1, size = 8))

If all else fails, one can simply take the number of PCs thatcontributes to a pre-selected total of explained variation, e.g., inthis case, 27 PCs account for >80% explained variation.

 which(c*msum(p$variance) > 80)[1]
## PC27 ## 27

Modify bi-plots

The bi-plot comparing PC1 versus PC2 is the most characteristic plot ofPCA. However, PCA is much more than the bi-plot and much more than PC1and PC2. This said, PC1 and PC2, by the very nature of PCA, are indeedusually the most important parts of a PCA analysis.

In a bi-plot, we can shade the points by different groups and add manymorefeatures.

Colour by a metadata factor, use a custom label, add lines through origin, and add legend

 biplot(p, lab = paste0(p$metadata$Age, ' años'), colby = 'ER', hline = 0, vline = 0, legendPosition = 'right')
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider## increasing max.overlaps

Supply custom colours and encircle variables by group

The encircle functionality literally draws a polygon around each groupspecified by colby. It says nothing about any statistic pertaining toeach group.

 biplot(p, colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'), colLegendTitle = 'ER-\nstatus', # encircle config encircle = TRUE, encircleFill = TRUE, hline = 0, vline = c(-25, 0, 25), legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider## increasing max.overlaps
 biplot(p, colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'), colLegendTitle = 'ER-\nstatus', # encircle config encircle = TRUE, encircleFill = FALSE, encircleAlpha = 1, encircleLineSize = 5, hline = 0, vline = c(-25, 0, 25), legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider## increasing max.overlaps

Stat ellipses

Stat ellipses are also drawn around each group but have a greaterstatistical meaning and can be used, for example, as a strictdetermination of outlier samples. Here, we draw ellipses around eachgroup at the 95% confidence level:

 biplot(p, colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'), # ellipse config ellipse = TRUE, ellipseType = 't', ellipseLevel = 0.95, ellipseFill = TRUE, ellipseAlpha = 1/4, ellipseLineSize = 1.0, xlim = c(-125,125), ylim = c(-50, 80), hline = 0, vline = c(-25, 0, 25), legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
## Warning: ggrepel: 40 unlabeled data points (too many overlaps). Consider## increasing max.overlaps
 biplot(p, colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'), # ellipse config ellipse = TRUE, ellipseType = 't', ellipseLevel = 0.95, ellipseFill = TRUE, ellipseAlpha = 1/4, ellipseLineSize = 0, ellipseFillKey = c('ER+' = 'yellow', 'ER-' = 'pink'), xlim = c(-125,125), ylim = c(-50, 80), hline = 0, vline = c(-25, 0, 25), legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
## Warning: ggrepel: 40 unlabeled data points (too many overlaps). Consider## increasing max.overlaps

Change shape based on tumour grade, remove connectors, and add titles

 biplot(p, colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'), hline = c(-25, 0, 25), vline = c(-25, 0, 25), legendPosition = 'top', legendLabSize = 13, legendIconSize = 8.0, shape = 'Grade', shapekey = c('Grade 1' = 15, 'Grade 2' = 17, 'Grade 3' = 8), drawConnectors = FALSE, title = 'PCA bi-plot', subtitle = 'PC1 versus PC2', caption = '27 PCs ≈ 80%')

Modify line types, remove gridlines, and increase point size

 biplot(p, lab = NULL, colby = 'ER', colkey = c('ER+'='royalblue', 'ER-'='red3'), hline = c(-25, 0, 25), vline = c(-25, 0, 25), vlineType = c('dotdash', 'solid', 'dashed'), gridlines.major = FALSE, gridlines.minor = FALSE, pointSize = 5, legendPosition = 'left', legendLabSize = 14, legendIconSize = 8.0, shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8), drawConnectors = FALSE, title = 'PCA bi-plot', subtitle = 'PC1 versus PC2', caption = '27 PCs ≈ 80%')

Let’s plot the same as above but with loadings:

 biplot(p, # loadings parameters showLoadings = TRUE, lengthLoadingsArrowsFactor = 1.5, sizeLoadingsNames = 4, colLoadingsNames = 'red4', # other parameters lab = NULL, colby = 'ER', colkey = c('ER+'='royalblue', 'ER-'='red3'), hline = 0, vline = c(-25, 0, 25), vlineType = c('dotdash', 'solid', 'dashed'), gridlines.major = FALSE, gridlines.minor = FALSE, pointSize = 5, legendPosition = 'left', legendLabSize = 14, legendIconSize = 8.0, shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8), drawConnectors = FALSE, title = 'PCA bi-plot', subtitle = 'PC1 versus PC2', caption = '27 PCs ≈ 80%')

Colour by a continuous variable and plot other PCs

There are two ways to colour by a continuous variable. In the first way,we simply ‘add on’ a continuous colour scale viascale_colour_gradient:

 # add ESR1 gene expression to the metadata p$metadata$ESR1 <- mat['205225_at',] biplot(p, x = 'PC2', y = 'PC3', lab = NULL, colby = 'ESR1', shape = 'ER', hline = 0, vline = 0, legendPosition = 'right') + scale_colour_gradient(low = 'gold', high = 'red2')

We can also just permit that the internal ggplot2 engine picks thecolour scheme - here, we also plot PC10 versus PC50:

 biplot(p, x = 'PC10', y = 'PC50', lab = NULL, colby = 'Age', hline = 0, vline = 0, hlineWidth = 1.0, vlineWidth = 1.0, gridlines.major = FALSE, gridlines.minor = TRUE, pointSize = 5, legendPosition = 'left', legendLabSize = 16, legendIconSize = 8.0, shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8), drawConnectors = FALSE, title = 'PCA bi-plot', subtitle = 'PC10 versus PC50', caption = '27 PCs ≈ 80%')

Quickly explore potentially informative PCs via a pairs plot

The pairs plot in PCA unfortunately suffers from a lack of use; however,for those who love exploring data and squeezing every last ounce ofinformation out of data, a pairs plot provides for a relatively quickway to explore useful leads for other downstream analyses.

As the number of pairwise plots increases, however, space becomeslimited. We can shut off titles and axis labeling to save space.Reducing point size and colouring by a variable of interest canadditionally help us to rapidly skim over the data.

 pairsplot(p, components = getComponents(p, c(1:10)), triangle = TRUE, trianglelabSize = 12, hline = 0, vline = 0, pointSize = 0.4, gridlines.major = FALSE, gridlines.minor = FALSE, colby = 'Grade', title = 'Pairs plot', plotaxes = FALSE, margingaps = unit(c(-0.01, -0.01, -0.01, -0.01), 'cm'))

We can arrange these in a way that makes better use of the screen spaceby setting ‘triangle = FALSE’. In this case, we can further control thelayout with the ‘ncol’ and ‘nrow’ parameters, although, the functionwill automatically determine these based on your input data.

 pairsplot(p, components = getComponents(p, c(4,33,11,1)), triangle = FALSE, hline = 0, vline = 0, pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE, colby = 'ER', title = 'Pairs plot', titleLabSize = 22, axisLabSize = 14, plotaxes = TRUE, margingaps = unit(c(0.1, 0.1, 0.1, 0.1), 'cm'))

Determine the variables that drive variation among each PC

If, on the bi-plot or pairs plot, we encounter evidence that 1 or morePCs are segregating a factor of interest, we can explore further thegenes that are driving these differences along each PC.

For each PC of interest, ‘plotloadings’ determines the variables fallingwithin the top/bottom 5% of the loadings range, and then creates a finalconsensus list of these. These variables are then plotted.

The loadings plot, like all others, is highly configurable. To modifythe cut-off for inclusion / exclusion of variables, we userangeRetain, where 0.01 equates to the top/bottom 1% of the loadingsrange per PC.

 plotloadings(p, rangeRetain = 0.01, labSize = 4.0, title = 'Loadings plot', subtitle = 'PC1, PC2, PC3, PC4, PC5', caption = 'Top 1% variables', shape = 24, col = c('limegreen', 'black', 'red3'), drawConnectors = TRUE)
## -- variables retained:## POGZ, CDC42BPA, CXCL11, ESR1, SFRP1, EEF1A2, IGKC, GABRP, CD24, PDZK1## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider## increasing max.overlaps

At least one interesting finding is 205225_at / ESR1, which is byfar the gene most responsible for variation along PC2. The previousbi-plots showed that this PC also segregated ER+ from ER- patients. Theother results could be explored. Also, from the biplots with loadingsthat we have already generated, this result is also verified in these.

With the loadings plot, in addition, we can instead plot absolute valuesand modify the point sizes to be proportional to the loadings. We canalso switch off the line connectors and plot the loadings for any PCs

 plotloadings(p, components = getComponents(p, c(4,33,11,1)), rangeRetain = 0.1, labSize = 4.0, absolute = FALSE, title = 'Loadings plot', subtitle = 'Misc PCs', caption = 'Top 10% variables', shape = 23, shapeSizeRange = c(1, 16), col = c('white', 'pink'), drawConnectors = FALSE)
## -- variables retained:## CXCL11, IGKC, CXCL9, 210163_at, 214768_x_at, 211645_x_at, 211644_x_at, IGHA1, 216491_x_at, 214777_at, 216576_x_at, 212671_s_at, IL23A, PLAAT4, 212588_at, 212998_x_at, KRT14, GABRP, SOX10, PTX3, TTYH1, CPB1, KRT15, MYBPC1, DST, CXADR, GALNT3, CDH3, TCIM, DHRS2, MMP1, CRABP1, CST1, MAGEA3, ACOX2, PRKAR2B, PLCB1, HDGFL3, CYP2B6, ORM1, 205040_at, HSPB8, SCGB2A2, JCHAIN, POGZ, 213872_at, DYNC2LI1, CDC42BPA

We can plot just this single PC and flip the plot on its side, if wewish:

 plotloadings(p, components = getComponents(p, c(2)), rangeRetain = 0.12, absolute = TRUE, col = c('black', 'pink', 'red4'), drawConnectors = TRUE, labSize = 4) + coord_flip()
## -- variables retained:## S100A8, PROM1, CXCL11, MMP1, FABP7, 205029_s_at, CXCL9, 210163_at, UBD, IGHG3, RARRES1, 206392_s_at, CXCL10, GBP1, ASPM, CDC20, NAT1, ESR1, SCUBE2## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider## increasing max.overlaps

Correlate the principal components back to the clinical data

Further exploration of the PCs can come through correlations withclinical data. This is also a mostly untapped resource in the era of‘big data’ and can help to guide an analysis down a particular path.

We may wish, for example, to correlate all PCs that account for 80%variation in our dataset and then explore further the PCs that havestatistically significant correlations.

‘eigencorplot’ is built upon another function by the PCAtoolsdevelopers, namelyCorLevelPlot. Furtherexamples can be found there.

 eigencorplot(p, components = getComponents(p, 1:27), metavars = c('Study','Age','Distant.RFS','ER', 'GGI','Grade','Size','Time.RFS'), col = c('darkblue', 'blue2', 'black', 'red2', 'darkred'), cexCorval = 0.7, colCorval = 'white', fontCorval = 2, posLab = 'bottomleft', rotLabX = 45, posColKey = 'top', cexLabColKey = 1.5, scale = TRUE, main = 'PC1-27 clinical correlations', colFrame = 'white', plotRsquared = FALSE)

We can also supply different cut-offs for statistical significance,apply p-value adjustment, plot R-squared values, and specify correlationmethod:

 eigencorplot(p, components = getComponents(p, 1:horn$n), metavars = c('Study','Age','Distant.RFS','ER','GGI', 'Grade','Size','Time.RFS'), col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'), cexCorval = 1.2, fontCorval = 2, posLab = 'all', rotLabX = 45, scale = TRUE, main = bquote(Principal ~ component ~ Pearson ~ r^2 ~ clinical ~ correlates), plotRsquared = TRUE, corFUN = 'pearson', corUSE = 'pairwise.complete.obs', corMultipleTestCorrection = 'BH', signifSymbols = c('****', '***', '**', '*', ''), signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1))

Clearly, PC2 is coming across as the most interesting PC in thisexperiment, with highly statistically significant correlation(p<0.0001) to ER status, tumour grade, and GGI (genomic Grade Index),an indicator of response. It comes as no surprise that the gene drivingmost variationn along PC2 is ESR1, identified from our loadings plot.

This information is, of course, not new, but shows how PCA is much morethan just a bi-plot used to identify outliers!

Plot the entire project on a single panel

 pscree <- screeplot(p, components = getComponents(p, 1:30), hline = 80, vline = 27, axisLabSize = 14, titleLabSize = 20, returnPlot = FALSE) + geom_label(aes(20, 80, label = '80% explained variation', vjust = -1, size = 8)) ppairs <- pairsplot(p, components = getComponents(p, c(1:3)), triangle = TRUE, trianglelabSize = 12, hline = 0, vline = 0, pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE, colby = 'Grade', title = '', plotaxes = FALSE, margingaps = unit(c(0.01, 0.01, 0.01, 0.01), 'cm'), returnPlot = FALSE) pbiplot <- biplot(p, # loadings parameters showLoadings = TRUE, lengthLoadingsArrowsFactor = 1.5, sizeLoadingsNames = 4, colLoadingsNames = 'red4', # other parameters lab = NULL, colby = 'ER', colkey = c('ER+'='royalblue', 'ER-'='red3'), hline = 0, vline = c(-25, 0, 25), vlineType = c('dotdash', 'solid', 'dashed'), gridlines.major = FALSE, gridlines.minor = FALSE, pointSize = 5, legendPosition = 'none', legendLabSize = 16, legendIconSize = 8.0, shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8), drawConnectors = FALSE, title = 'PCA bi-plot', subtitle = 'PC1 versus PC2', caption = '27 PCs ≈ 80%', returnPlot = FALSE) ploadings <- plotloadings(p, rangeRetain = 0.01, labSize = 4, title = 'Loadings plot', axisLabSize = 12, subtitle = 'PC1, PC2, PC3, PC4, PC5', caption = 'Top 1% variables', shape = 24, shapeSizeRange = c(4, 8), col = c('limegreen', 'black', 'red3'), legendPosition = 'none', drawConnectors = FALSE, returnPlot = FALSE) peigencor <- eigencorplot(p, components = getComponents(p, 1:10), metavars = c('Study','Age','Distant.RFS','ER', 'GGI','Grade','Size','Time.RFS'), cexCorval = 1.0, fontCorval = 2, posLab = 'all', rotLabX = 45, scale = TRUE, main = "PC clinical correlates", cexMain = 1.5, plotRsquared = FALSE, corFUN = 'pearson', corUSE = 'pairwise.complete.obs', signifSymbols = c('****', '***', '**', '*', ''), signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), returnPlot = FALSE) library(cowplot) library(ggplotify) top_row <- plot_grid(pscree, ppairs, pbiplot, ncol = 3, labels = c('A', 'B Pairs plot', 'C'), label_fontfamily = 'serif', label_fontface = 'bold', label_size = 22, align = 'h', rel_widths = c(1.10, 0.80, 1.10)) bottom_row <- plot_grid(ploadings, as.grob(peigencor), ncol = 2, labels = c('D', 'E'), label_fontfamily = 'serif', label_fontface = 'bold', label_size = 22, align = 'h', rel_widths = c(0.8, 1.2)) plot_grid(top_row, bottom_row, ncol = 1, rel_heights = c(1.1, 0.9))

Make predictions on new data

It is possible to use the variable loadings as part of a matrixcalculation to ‘predict’ principal component eigenvectors in new data.This is elaborated in a posting by Pandula Priyadarshana: How to usePrincipal Component Analysis (PCA) to makePredictions.

The pca class, which is created by PCAtools, is not configured towork with stats::predict; however, trusty prcomp class isconfigured. We can manually create a prcomp object and then use thatin model prediction, as elaborated in the following code chunk:

 p <- pca(mat, metadata = metadata, removeVar = 0.1)
## -- removing the lower 10% of variables based on variance
 p.prcomp <- list(sdev = p$sdev, rotation = data.matrix(p$loadings), x = data.matrix(p$rotated), center = TRUE, scale = TRUE) class(p.prcomp) <- 'prcomp' # for this simple example, just use a chunk of # the original data for the prediction newdata <- t(mat[,seq(1,20)]) predict(p.prcomp, newdata = newdata)[,1:5]
## PC1 PC2 PC3 PC4 PC5## GSM65752 11.683293 71.0152986 10.677205 -75.97644152 29.7537169## GSM65753 -10.542633 -31.9953531 -2.753783 -19.59178967 14.9924713## GSM65755 6.585509 13.4975310 -40.370389 -29.38990525 47.7142845## GSM65757 1.498398 -0.1294115 -37.336278 0.08078156 22.3448232## GSM65758 -18.049833 -14.9445805 14.890320 16.57567005 3.4010033## GSM65760 8.073473 47.5491189 -18.016340 -9.73629569 -51.7330414## GSM65761 -3.689814 7.7199606 -35.476666 -35.31465087 -40.1455143## GSM65762 3.949911 -24.9428080 4.710631 2.71721065 43.2182093## GSM65763 -20.757238 -33.3085383 22.639443 7.41053224 -9.9339918## GSM65764 -12.287305 -12.7566718 13.813429 33.75583684 17.7938583## GSM65767 -4.209505 -13.9349129 -17.814569 -14.87200276 -82.4754172## GSM65768 3.547044 39.6095431 -28.424912 40.26444836 45.6591355## GSM65769 3.754370 30.0201461 12.415498 45.74502641 37.9905308## GSM65770 2.538593 -36.6517740 54.887990 5.94021104 -0.9545218## GSM65771 -7.382089 -8.5963702 27.749060 -21.50981794 -71.4524526## GSM65772 3.735223 43.2576570 26.995375 21.01817312 -68.8193200## GSM65773 15.775812 -19.4523339 4.419158 -6.47899302 -25.2479186## GSM65774 17.589719 -28.5666333 -52.875007 -16.82207768 37.8455365## GSM65775 -3.375783 -5.2950960 27.071957 49.10111537 55.0410908## GSM65776 1.562855 -22.0947718 12.797877 7.08296875 -4.9924828

The development of PCAtools has benefited from contributions andsuggestions from:

  • Krushna Chandra Murmu
  • Jinsheng
  • Myles Lewis
  • Anna-Leigh Brown
  • Vincent Carey
  • Vince Vu
  • Guido Hooiveld
  • pwwang
  • Pandula Priyadarshana
  • Barley Rose Collier Harris
  • Bob Policastro
  • Alan O’Callaghan
sessionInfo()
## R version 4.0.3 (2020-10-10)## Platform: x86_64-pc-linux-gnu (64-bit)## Running under: Ubuntu 16.04.7 LTS## ## Matrix products: default## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0## ## locale:## [1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=pt_BR.UTF-8 ## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=pt_BR.UTF-8 ## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages:## [1] parallel stats4 stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages:## [1] ggplotify_0.0.5 cowplot_1.1.1 ## [3] hgu133a.db_3.2.3 GEOquery_2.56.0 ## [5] DESeq2_1.28.1 org.Hs.eg.db_3.11.4 ## [7] AnnotationDbi_1.53.0 magrittr_2.0.1 ## [9] airway_1.8.0 SummarizedExperiment_1.18.2## [11] DelayedArray_0.14.1 matrixStats_0.57.0 ## [13] Biobase_2.48.0 GenomicRanges_1.40.0 ## [15] GenomeInfoDb_1.24.2 IRanges_2.22.2 ## [17] S4Vectors_0.26.1 BiocGenerics_0.34.0 ## [19] PCAtools_2.5.5 ggrepel_0.9.1 ## [21] ggplot2_3.3.3 ## ## loaded via a namespace (and not attached):## [1] bitops_1.0-6 bit64_4.0.5 ## [3] ash_1.0-15 RColorBrewer_1.1-2 ## [5] tools_4.0.3 R6_2.5.0 ## [7] irlba_2.3.3 KernSmooth_2.23-18 ## [9] DBI_1.1.1 colorspace_2.0-0 ## [11] withr_2.4.1 tidyselect_1.1.0 ## [13] ggalt_0.4.0 extrafontdb_1.0 ## [15] bit_4.0.4 curl_4.3 ## [17] compiler_4.0.3 cli_2.2.0 ## [19] xml2_1.3.2 labeling_0.4.2 ## [21] scales_1.1.1 proj4_1.0-10.1 ## [23] readr_1.4.0 genefilter_1.70.0 ## [25] stringr_1.4.0 digest_0.6.27 ## [27] rmarkdown_2.6 XVector_0.28.0 ## [29] pkgconfig_2.0.3 htmltools_0.5.1.1 ## [31] extrafont_0.17 maps_3.3.0 ## [33] fastmap_1.1.0 limma_3.44.3 ## [35] highr_0.8 rlang_0.4.10 ## [37] rstudioapi_0.13 RSQLite_2.2.3 ## [39] DelayedMatrixStats_1.10.1 gridGraphics_0.5-1 ## [41] generics_0.1.0 farver_2.0.3 ## [43] BiocParallel_1.22.0 dplyr_1.0.3 ## [45] RCurl_1.98-1.2 BiocSingular_1.4.0 ## [47] GenomeInfoDbData_1.2.3 Matrix_1.3-2 ## [49] Rcpp_1.0.6 munsell_0.5.0 ## [51] fansi_0.4.2 lifecycle_0.2.0 ## [53] stringi_1.5.3 yaml_2.2.1 ## [55] MASS_7.3-53 zlibbioc_1.34.0 ## [57] plyr_1.8.6 grid_4.0.3 ## [59] blob_1.2.1 dqrng_0.2.1 ## [61] crayon_1.3.4 lattice_0.20-41 ## [63] splines_4.0.3 annotate_1.66.0 ## [65] hms_1.0.0 locfit_1.5-9.4 ## [67] knitr_1.31 pillar_1.4.7 ## [69] geneplotter_1.66.0 reshape2_1.4.4 ## [71] XML_3.99-0.5 glue_1.4.2 ## [73] evaluate_0.14 BiocManager_1.30.10 ## [75] vctrs_0.3.6 Rttf2pt1_1.3.8 ## [77] gtable_0.3.0 purrr_0.3.4 ## [79] tidyr_1.1.2 assertthat_0.2.1 ## [81] cachem_1.0.1 xfun_0.20 ## [83] rsvd_1.0.3 xtable_1.8-4 ## [85] survival_3.2-7 tibble_3.0.1 ## [87] rvcheck_0.1.8 memoise_2.0.0 ## [89] ellipsis_0.3.1

Blighe and Lun (2019)

Blighe (2013)

Horn (1965)

Buja and Eyuboglu (1992)

Lun (2019)

Gabriel (1971)

Blighe, K. 2013. “Haplotype classification using copy number variationand principal components analysis.” The Open Bioinformatics Journal7:19-24.

Blighe, K, and A Lun. 2019. “PCAtools: everything Principal ComponentsAnalysis.” https://github.com/kevinblighe/PCAtools.

Buja, A, and N Eyuboglu. 1992. “Remarks on Parallel Analysis.”Multivariate Behav. Res. 27, 509-40.

Gabriel, KR. 1971. “The Biplot Graphic Display of Matrices withApplication to Principal Component Analysis 1.” Biometrika 58 (3):453–67. http://biomet.oxfordjournals.org/content/58/3/453.short.

Horn, JL. 1965. “A rationale and test for the number of factors infactor analysis.” Psychometrika 30(2), 179-185.

Lun, A. 2019. “BiocSingular: Singular Value Decomposition forBioconductor Packages.” R package version 1.0.0,https://github.com/LTLA/BiocSingular.

PCAtools package - RDocumentation (2024)
Top Articles
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 6490

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.