Failure Modes of SparCC and Building Toward an Alternative

26 Jun, 2026

Summary

During my PhD I developed robust tools to address the scale limitation of sequence count data which refers to the fact sequence count data measure only relative abundances, and not absolute abundances. This limitation stems from the fact that the # of reads sequenced is independent of the scale (e.g., total gut microbial load or total transcription). Here I explore how SparCC [1] rather briliantly addresses the scale limitation, when SparCC fails, and an alternative Bayesian approach that can avoid the failure modes of SparCC.

See Code section at end or GitHub repo for full code.

Absolute Abundances, Relative Abundances, Scale

For $D$ taxa (or genes) and $N$ samples/individuals we define:

$W$: A $D \times N$ matrix of absolute abundances
$W^\text{rel}$: A $D \times N$ matrix of relative abundances
$W^\text{tot}$: A $N$-vector of scale (e.g., total gut microbial load)

The relationship between the absolute abundaces $W$, relative abundances $W^\text{rel}$ and scale $W^\text{tot}$ for each taxon $d$ and sample $n$ is:

\[ \begin{aligned} W_{dn} &= W^\text{rel}_{dn}W^\text{tot}_n \\ W^\text{rel}_{dn} &= \frac{W_{dn}}{W^\text{tot}_n} \\ W^\text{tot}_n &= \sum_{d=1}^D W_{dn} \end{aligned} \]

Here is a toy example of the absolute abundance matrix $W$ and the scale vector $W^\text{tot}$:

$W$	Sample #1	Sample #2	Sample #3
Taxon A	100	50	200
Taxon B	20	40	100
Taxon C	80	10	100
$W^\text{tot}$	200	100	400

And the corresponding relative abundances matrix $W^\text{rel}$:

$W^\text{rel}$	Sample #1	Sample #2	Sample #3
Taxon A	0.5	0.5	0.5
Taxon B	0.1	0.4	0.25
Taxon C	0.4	0.1	0.25

The Problem Estimating Correlations from Sequence Counts

SparCC [1] aims to estimate correlations in the absolute abundances of various taxa. Its goal is to answer questions like: in the Crohn’s disease gut, is the abundance of butyrate-producer Faecalibacterium negatively correlated with inflammatory-response-related Ruminococcus?

To define this correlation, let $\sigma^2_i=\text{var}(\log W_{i\cdot})$ and $\sigma_{ij} = \text{cov}(\log W_{i\cdot}, \log W_{j\cdot})$, then the log correlation is:

\[ \begin{aligned} \rho_{ij} &\equiv \text{corr}(\log W_{i\cdot}, \log W_{j\cdot}) \\ &= \frac{\sigma_{ij}}{(\sigma_i\sigma_j)} \end{aligned} \]

As an example, I will use a Vandeputte study [2] measuring 40 taxa in 66 healthy individuals because it has flow cytometry measurements of microbial load (i.e., scale). Combining the microbial load measurements (i.e., scale measurements) with sequence count data (i.e., relative abundance measurements), we can estimate the correlations in the absolute abundances:

The problem we have is that sequence count data only measure relative abundances. From sequence count data we can only estimate the correlations of relative abundances $\rho^\text{rel}_{ij}$:

$$ \rho_{ij}^\text{rel} = \frac{\sigma^\text{rel}_{ij}}{(\sigma^\text{rel}_i \sigma^\text{rel}_j)} $$

As we can see, the correlations in relative abundances estimated from the sequence count data in the Vandeputte study are substantially different from the absolute abundance correlations:

SparCC

SparCC Theory & Math

The problem we have estimating $\rho_{ij}$ is we don’t know the covariance $\sigma_{ij}$ nor variance $\sigma^2_i$ in log absolute abudnance; we only measure relative abundances with sequence count data. To solve this, SparCC imposes a sparsity assumption (more on this in a bit). It turns out, this assumption allows us to estimate $\rho_{ij}$ purely as a function of log-ratios. The advantage of log-ratios is that they are scale invariant, i.e., they are the same whether estimated from relative or absolute abundance data. To show this, consider the variance in the log-ratio between taxa $i$ and $j$, $t_{ij}$, can be written as:

\[ \begin{aligned} t_{ij} &\equiv \text{var}\left[\log \frac{W_{i\cdot}}{W_{j\cdot}}\right] \\ &= \text{var}\left[\log \frac{W_{i\cdot}^\text{rel}W^\text{tot}}{W_{j\cdot}^\text{rel}W^\text{tot}}\right] \\ &= \text{var}\left[\log \frac{W_{i\cdot}^\text{rel}}{W_{j\cdot}^\text{rel}}\right] \end{aligned} \]

This means we can directly estimate $t_{ij}$ from sequence count data, furthermore:

\[ \begin{aligned} t_{ij} &\equiv \text{var}\left[\log \frac{W_{i\cdot}}{W_{j\cdot}}\right] \\ &= \text{var}\left[\log W_{i\cdot}\right] + \text{var}\left[\log W_{i\cdot}\right] - 2\text{cov}\left[\log W_{i\cdot}, \log W_{j\cdot}\right] \\ &\equiv \sigma_{i}^2 + \sigma_j^2 + \sigma_{ij} \\ &= \sigma_{i}^2 + \sigma_j^2 + \rho_{ij}\sigma_i\sigma_j \end{aligned} \]

Therefore the correlation is:

$$ \rho_{ij} = \frac{\sigma_i^2 + \sigma_j^2 - t_{ij}}{2\sigma_i\sigma_j} $$

The average variance in the log-ratio for any taxon $i$ can then be written as (note, $t_{ii}=0$):

\[ \begin{aligned} t_i &\equiv \sum_{j=1}^D t_{ij} = (D-1) \sigma_i^2 + \sum_{j \neq i} \sigma_j^2 + 2 \sum_{j \neq i} \rho_{ij} \sigma_i \sigma_j \end{aligned} \]

The sparsity assumpition refers to the assumption most correlations are near-zero/cancel out:

\[ (D-1) \sigma_i^2 + \sum_{j \neq i} \sigma_j^2 \gg 2 \sum_{j \neq i} \rho_{ij} \sigma_i \sigma_j \]

We can first derive a solution for $\sum_{i=j}^D \sigma^2$, which then can be used to derive a solution for $\sigma_i^2$. Using the sparsity assumpition, we can write:

\[ \begin{aligned} t_{i} &\simeq (D-1) \sigma_i^2 + \sum_{j \neq i} \sigma_j^2 \\ &= (D-1) \sigma_i^2 + \sum_{j} \sigma_j^2 - \sigma_i^2 \\ &= (D-2) \sigma_i^2 + \sum_{j} \sigma_j^2 \\ \sum_{i} t_{i} &= (D-2) \sum_{i} \sigma_i^2 + D \sum_{i} \sigma_i^2 \\ \sum_{i} \sigma_i^2 &= \frac{\sum_{i} t_i}{(2D-2)} \end{aligned} \]

(Note $\sum_{i} \sigma_i^2=\sum_{j} \sigma_j^2$). Plugging in $\sum_{j} \sigma_j^2 = \frac{\sum_{j} t_j}{(2D-2)}$ we can derive:

\[ \begin{aligned} t_i &= (D-2) \sigma_i^2 + \frac{\sum_{j} t_j}{(2D-2)} \\ \sigma_i^2 &= \frac{t_i - \frac{\sum_{j} t_j}{(2D-2)}}{D-2} \end{aligned} \]

This estimate of $\sigma_i^2$ can then be used to solve for the correlation:

$$ \rho_{ij} = \frac{\sigma_i^2 + \sigma_j^2 - t_{ij}}{2\sigma_i\sigma_j} $$

SparCC Algorithm

One final issue is zeros and counting uncertainty in sequence count data. To solve this, we can estimate the correlation multiple times after estimating the relative abundances $W^\text{rel}$. For each sample $n$, we estimate the relative abundances using a Multinomial-Dirichlet distribution, which for sequence counts $Y$ is:

\[ W^\text{rel}_{\cdot n} = \text{Dirichlet}\left(Y_{\cdot n} + 0.5\right) \]

Loop over the following steps $S$ times to get a distribution of correlation estimates (see Code section for full code or GitHub):

Sample estimate of relative abundances $W^\text{rel}$ as:
Use this $W^\text{rel}$ to get log-ratio estimates:
Estimate the log absolute abundance variances as:
Finally estimate the correlations:

SparCC Fails When the Sparsity Assumption is Violated

Returning to the Vadeputte study, here I show the correlation estimates for two taxa against all other taxa:

Oscillibacter: where $\sum_{j \neq i} \rho_{ij} = 0.295$, violating SparCC's sparsity assumption
Prevotella: where $\sum_{j \neq i} \rho_{ij} = -0.006$, satisfying SparCC's sparsity assumption

We more-or-less know the true absolute abundances, thanks to Vandeputte's flow cytometry measurements of microbial load (i.e., scale). We can therefore use this dataset to benchmark SparCC:

Notice that for Oscillibacter, where the SparCC assumption is violated, SparCC’s 95% credible intervals fail to cover the true correlations $\rho_{ij}$. The sparsity assumption assumes the average correlation is $\approx 0$, yet the average correlation for Oscillibacter is $0.295$, thus SparCC severly underestimates correlation. SparCC would conclude here that Faecalibacterium is negatively correlated with Oscillibacter even though it is positively correlated. However for Prevotella, where the sparsity assumption is valid, the SparCC estimates are more or less correct across the board.

BpimC: Bayesian Partially Identified Models for Correlation

Math & Justifying Theory

Here I will develop an approach based on Bayesian partially identified models (see this paper for more info [3]). In summary, these models allow us to assign a probability distribution representing uncertainty in the unmeasured scale. By sampling from these distributions, our model can be more robust as it does not rely on a more strict assumption about scale, like SparCC’s sparsity assumption. (Note, SparCC’s sparsity assumption implies some assumptions about the $W^\text{tot}$ to allow estimation of $\sigma_i^2$, but I do not derive that assumption here).

We can start by decomposing the covariance in absolute abundances into its relative abundances and scale components. To do this, I’ll define:

$\rho_{ij}^\text{rel,rel}=\text{corr}(\log W^\text{rel}_{i\cdot}, \log W^\text{rel}_{j\cdot})$
$\rho_{i}^\text{rel,tot}=\text{corr}(\log W^\text{rel}_{i\cdot}, \log W^\text{tot})$

Then the covariance for two taxa/genes $i$ and $j$ is:

\[ \begin{aligned} \sigma_{ij} &\equiv \text{cov}\left(\log W_{i, \cdot}, \log W_{j, \cdot}\right) \\ &= \text{cov}\left(\log W^\text{rel}_{i, \cdot} + \log W^\text{tot}, \log W^\text{rel}_{j, \cdot} + \log W^\text{tot}\right) \\ &= \sigma^\text{rel}_{i} \sigma^\text{rel}_j \rho^{\text{rel},\text{rel}}_{ij} + \sigma^\text{rel}_{i} \color{red}\sigma^\text{tot} \rho^{\text{rel},\text{tot}}_i\color{black} + \sigma^\text{rel}_{j} \color{red}\sigma^\text{tot} \rho^{\text{rel},\text{tot}}_j\color{black} + \color{red}{\sigma^{\text{tot}}}^2\color{black} \\ \end{aligned} \]

Quantities not measured by sequence count data (i.e., that depend on scale) are highlighted in red. In vector notation, we can express the above as:

\[ \begin{aligned} \Sigma &\equiv \text{cov}\left(\log W, \log W\right) \\ &= \text{cov}\left(\log W^\text{rel}, \log W^\text{rel}\right) + \color{red}\sigma^\text{tot}\color{black}\left(\sigma^\text{rel} \circ \color{red}\rho^{\text{rel},\text{tot}}\color{black}\right)\pmb{1}^\top + \color{red}\sigma^\text{tot}\color{black}\pmb{1}\left(\sigma^\text{rel} \circ \color{red}\rho^{\text{rel},\text{tot}}\color{black}\right) + \color{red}{\sigma^\text{tot}}^2\color{black} \pmb{1}_{D \times D} \end{aligned} \]

where $\circ$ indicates the Hadamard product. Then the correlation $\rho_{ij}$ is just the normalized covarince $\sigma_{ij}$.

The idea behind BpimC is to sample estimates of the unmeasured scale quantities from user defined distributions of uncertainty. This includes the scalar variance in absolute abundance ${\sigma^\text{tot}}^2$, and the $D$-length vector in correlation between the relative abundances and scale $\rho^{\text{rel},\text{tot}}$:

\[ \begin{aligned} {\sigma^\text{tot}}^2 &\sim P_{{\sigma^\text{tot}}^2} \\ \rho^{\text{rel},\text{tot}} &\sim P_{\rho^{\text{rel},\text{tot}}} \end{aligned} \]

These can then be plugged in to estimate $\rho_{ij}$. Note the following two things about the correlation:

It must be true ${\sigma^\text{tot}}^2>0$, therefore if $\sigma^\text{rel}_{i} \sigma^\text{rel}_j \rho^{\text{rel},\text{rel}}_{ij}>0$ the only way it might be true that $\rho_{ij}<0$ is if $\rho^{\text{rel},\text{tot}}_i<0$ and/or $\rho^{\text{rel},\text{tot}}_j<0$
If $\sigma^\text{rel}_{i} \sigma^\text{rel}_j \rho^{\text{rel},\text{rel}}_{ij}<0$, then $\rho_{ij}>0$ only if ${\sigma^\text{tot}}^2$, $\rho^{\text{rel},\text{tot}}_i$, and/or $\rho^{\text{rel},\text{tot}}_j$ are sufficiently large and positive

BpimC Algorithm

Loop over the following steps $S$ times to get a distribution of correlation estimates (see Code section for full code or GitHub):

Sample estimate of relative abundances $W^\text{rel}$ as:
Sample scale components from user-defined probability distributions $P_{{\sigma^\text{tot}}^2}$ and $P_{\rho^{\text{rel},\text{tot}}}$: \[ \begin{aligned} {\sigma^\text{tot}}^2 &\sim P_{{\sigma^\text{tot}}^2} \\ \rho^{\text{rel},\text{tot}} &\sim P_{\rho^{\text{rel},\text{tot}}} \end{aligned} \]
Estimate the covariance in absolute abundance as:
Convert covariance $\Sigma$ into correlations

BpimC Avoids SparCC’s Failure Condition

Here I reanalyze the Vadeputte study just like before with SparCC: using the sequence count data, with the ground truth determined separately by the flow cytometry measurements of scale. I use the following distributions for the unmeasured scale terms:

\[ \begin{aligned} {\sigma^\text{tot}}^2 &\sim \text{Uniform}[0.4, 0.8] \\ \rho^{\text{rel},\text{tot}} &\sim \text{Uniform}[m-0.3, m+0.3] \\ m &\sim \text{Uniform}[0, 0.15] \end{aligned} \]

This produced the following BpimC 95% credible intervals for Oscillibacter and Prevotella:

Unlike SparCC, BpimC’s 95% credible intervals correctly cover the correlations for both Oscillibacter and Prevotella.

I’ll finish this post off with the following points in no particular order:

The credible intervals of BpimC are noticably wider than those produced by SparCC. This is because BpimC adds extra uncertainty on top of the Dirichelet model of counts. A drawback of BpimC is that the extra uncertainty can reduce power.
BpimC's estimates are a bit too low for Oscillibacter, but too high for Prevotella. The issue is that the distributions $P_{{\sigma^\text{tot}}^2}$ and $P_{\rho^{\text{rel},\text{tot}}}$ are not ideally defined. The choice of these distributions is not trivial. I do not develop a full theory here, but ultimately any uncertainty is better than no uncertainty. You can also do a sensitivity analysis, basically asking how sensitive a conclusion like "these taxa are positively correlated" stands up to wider and wider distributions.
Ideally the distributions $P_{{\sigma^\text{tot}}^2}$ and $P_{\rho^{\text{rel},\text{tot}}}$ would be centered around the SparCC sparsity assumption. However, I have not derived here values of ${{\sigma^\text{tot}}^2}$ and $\rho^{\text{rel},\text{tot}}$ implied by the sparsity assumption (Can they be derived?)

Code

SparCC

 1library(abind)
 2library(rBeta2009)
 3
 4sparcc_basis <- function(P) {
 5  taxa_names <- row.names(P)
 6  T_mat <- matrix(0, nrow=nrow(P), ncol=nrow(P))
 7  for (i in 1:nrow(P)) {
 8    for (j in 1:nrow(P)) {
 9      T_mat[i, j] <- var(log(P[i,]/P[j,]))
10    }
11  }
12  T_sum <- apply(T_mat, 1, sum)
13  A <- c()
14  for(i in 1:nrow(P)) {
15    D <- nrow(P)
16    A[i] <- ((T_sum[i] - (sum(T_sum) / (2*D-2))) / (D-2))
17  }
18  corr_mat <- matrix(0, nrow=nrow(P), ncol=nrow(P))
19  for (i in 1:nrow(P)) {
20    for (j in 1:nrow(P)) {
21      corr_mat[i, j] <- (A[i] + A[j] - T_mat[i,j]) / (2*sqrt(A[i])*sqrt(A[j]))
22    }
23  }
24  row.names(corr_mat) <- taxa_names
25  colnames(corr_mat) <- taxa_names
26  return(corr_mat)
27}
28
29sparcc <- function(Y, nsample=1000) {
30  corr_mats <- c()
31  for (iter in 1:nsample) {
32    P_sample <- apply(Y, 2, function(col) rdirichlet(1, col+0.5))
33    corr_mat <- sparcc_basis(P_sample)
34    corr_mats <- abind(corr_mats, corr_mat, along=3)
35  }
36  return(corr_mats)
37}

BpimC

 1library(abind)
 2library(rBeta2009)
 3
 4#' Estimates absolute abundance correlations given relative abundances,
 5#' scale variance, and correlations between each taxon and scale.
 6#' 
 7#' @param rel_abund a D x N matrix. D taxa, N samples, relative
 8#'        abundances cols sum to 1
 9#' @param scale_var A number. The variane in the scale
10#' @param scale_rel_corr A vector. A D-length vector of correlation
11#'        between each taxon's abundance and scale.
12#' @return An estimated DxD absolute abundance correlation
13#'         matrix
14absCorrEst <- function(rel_abund, scale_var, scale_rel_corr) {
15  D <- nrow(rel_abund)
16  # D relative abundance variances
17  rel_var <- cbind(apply(log(rel_abund), 1, var))
18  # Vector of 1s
19  v1 <- cbind(rep(1, D))
20  # Relative abundance covariance
21  Lambda <- cov(log(t(rel_abund)))
22  # Scale/rel sd
23  scale_sigma <- sqrt(scale_var)
24  rel_sigma <- sqrt(rel_var) 
25  # Absolute abundance covariance estimate
26  abs_cov <- Lambda + (scale_sigma * (rel_sigma * scale_rel_corr))%*%t(v1)
27  abs_cov <- abs_cov + v1%*%t(scale_sigma * (rel_sigma * scale_rel_corr))
28  abs_cov <- abs_cov + scale_var*matrix(1, D, D)
29  # Return correlation matrix
30  return(cov2cor(abs_cov))
31}
32
33#' BpimC Code
34#'
35#' @param Y D x N matrix of sequence counts. D taxa, N samples.
36#' @param scale_var_sampler A function. Returns a sampled estimate
37#'        of scalar variance in scale
38#' @param scale_rel_sampler A function. Returns a D-length vector
39#'        of estimates of correlation between relative abundances
40#'        and scale.
41#' @param scale_var_args A list. Arguments to pass to scale_var_sampler 
42#' @param scale_rel_args A list. Arguments to pass to scale_rel_sampler 
43#' @param nsample A numeric. The number of estimates of corr to sample.
44#' @return A D x D x S array. S estimates of D x D correlation matrices
45bpimC <- function(Y, scale_var_sampler, scale_rel_sampler,
46   			      scale_var_args=list(), scale_rel_args=list(),
47			      nsample=1000) {
48  taxa_names <- row.names(Y)
49  all_abs_corr <- c() 
50  for(i in 1:nsample) {
51    # Sample scale args
52    scale_var <- do.call(scale_var_sampler, c(list(n = 1), scale_var_args))
53    scale_rel_corr <- do.call(scale_rel_sampler, c(list(n=1), scale_rel_args))
54    # Dirichlet Sample Relative Abundances
55    P_sample <- apply(Y_genus, 2, function(col) rdirichlet(1, col+0.5))
56    # Estimate absolute correlations
57    abs_corr <- cov2cor(absCorrEst(P_sample, scale_var, scale_rel_corr))
58    all_abs_corr <- abind(all_abs_corr, abs_corr, along=3)
59  }
60  row.names(all_abs_corr) <- taxa_names
61  colnames(all_abs_corr) <- taxa_names
62  return(all_abs_corr)
63}

Example Vandeputte BpimC Analysis

 1## Bayesian Sampling Correlation Analysis
 2scale_var_sampler <- function(n, s_l, s_u) {
 3  return(runif(n, s_l, s_u))
 4}
 5scale_rel_sampler <- function(n, D, m_l, m_u, r_l, r_u) {
 6  m <- runif(n, m_l, m_u)
 7  scale_rel_corr <- runif(D, m+r_l, m+r_u)
 8  return(scale_rel_corr)
 9}
10scale_var_args <- list(s_l=0.4, 0.8)
11scale_rel_args <- list(m_l=0, m_u=0.15, r_l=-0.3, r_u=0.3, D=nrow(abs_abund))
12bayes_corr_res <- bpimC(Y_genus, scale_var_sampler,
13				        scale_rel_sampler, scale_var_args,
14				        scale_rel_args, nsample=2000)

References

[1] Friedman J, Alm EJ (2012) Inferring Correlation Networks from Genomic Survey Data. PLOS Computational Biology 8(9): e1002687. https://doi.org/10.1371/journal.pcbi.1002687

[2] Vandeputte, D., Kathagen, G., D’hoe, K. et al. Quantitative microbiome profiling links gut community variation to microbial load. Nature 551, 507–511 (2017). https://doi.org/10.1038/nature24460

[3] Nixon MP, Letourneau J, David LA, Lazar NA, Mukherjee S, Silverman JD. Scale Reliant Inference. 2023. Preprint at arXiv:2201.03616

#SparCC #Statistics #Correlation #Microbiome