The powerbend distribution provides a unified model for the species abundance distribution across animals, plants and microbes

SAD data

To test the universality of SAD models in animals and plants, the species abundance data were downloaded from Baldridge et al. study¹⁸, which includes the Breeding Bird Survey (BBS)³³, Alwyn H. Gentry’s Forest Transect Data Set (GENTRY)³⁴, the Mammal Community Database (MCDB)³⁵, the Forest Inventory and Analysis (FIA)³⁶ and SAD data compiled from literature for an assortment of taxa³⁷. To test microbial SADs, the bacterial and archaeal species abundance data were downloaded from Shoemaker et al. study²⁵, which includes data from the Earth Microbiome Project (EMP)³⁸, the first phase of Human Microbiome Project (HMP1)³⁹, and the MG-RAST repository (MGRAST)⁴⁰. Because the rare tail of the SAD contains critical information for model fitting, we included species with a single read, known as singletons, in our analysis to avoid potential bias in model selection, consistent with the practice of Shoemaker et al.²⁵ Our simulations showed that when the number of observed species in a SAD is small or when the sampling effort is low, model selection using AIC does not have sufficient power to distinguish SAD models, often favoring the simpler models even when the simpler model is incorrect (Supplementary Fig. S1). The more complex the model is, the more data points (i.e., species) are required to recover the true model. To strike a balance between the number of SADs we can analyze and the power of model selection, and following the practice of the previous studies^17,18, we filtered out SADs with less than 10 species to test base SAD models for animal and plant SAD data, and SADs with less than 100 Operational Taxonomic Units (OTUs) to test compound SAD models for microbial SAD data. In addition, in the microbial SAD data, we identified 687 outlier SADs where the number of doubleton species exceeded the number of singleton species by more than 10-fold (Supplementary Fig. S3). These outliers all originated from the EMP dataset compiled by Shoemaker et al.²⁵ but are no longer present in the current EMP dataset (year 2017 version). Consequently, we excluded these outliers from our analyses. This resulted in a total of 13,819 animal and plant SADs and 15,329 microbial SADs (see Supplementary Table S2).

The animal and plant dataset encompasses diverse taxonomic groups (see Supplementary Table S2). To ensure equitable representation across these groups, we applied weighting to each SAD based on the size of the dataset it originated from when assessing the frequency of a model being the best by AIC or its goodness of fit. For instance, a SAD within the Mammal Community Database (MCDB, 103 samples) carried 26.9 times the weight of a SAD from the Breeding Bird Survey (BBS, 2,769 samples).

SAD models

The lognormal SAD model has the probability density function:

$${\varPhi }_{{lognorm}}\left({n;}\,\mu,\sigma \right)=\frac{1}{n\sigma \sqrt{2\pi }}{e}^{-\frac{{\left(\log \left(n\right)-\mu \right)}^{2}}{2{\sigma }^{2}}}\,\left(n\in {{\mathbb{R}}}^{+}\right)$$

(1)

where n represents the species abundance, and μ and σ are the mean and standard deviation of log-transformed n, respectively. The logarithm is calculated with the natural base e. Given that the lognormal distribution is continuous while the SAD is inherently discrete, it is necessary to convolute it with a sampling error when fitting it to SAD data.

The logseries distribution has the probability mass function:

$${\varPhi }_{{ls}}\left({n;}\lambda \right)=-\frac{1}{\log \left(1-{e}^{-\lambda }\right)}\cdot \frac{{e}^{-\lambda n}}{n}\,\left(n\in {{\mathbb{Z}}}^{+}\right)$$

(2)

where λ is the exponential rate at which the numerator decays.

The power law distribution has the probability mass function:

$${\varPhi }_{{power}}\left({n;s}\right)=\frac{{n}^{-s}}{\,\zeta \left(s\right)}\,\left(n\in {{\mathbb{Z}}}^{+}\right)$$

(3)

where s > 1 is the scaling parameter that controls the distribution’s decay rate, and $\zeta \left(s\right)$ is the Riemann zeta function, serving as a normalization factor to ensure the sum of the probabilities over all possible values of n equals 1.

The powerbend distribution takes a more general form compared to the logseries and power law distributions. It is a hybrid of a power law and an exponential function, in which the exponential function bends the power law by setting an upper bound to the power law¹⁶. It has the probability mass function:

$${\varPhi }_{{powbend}}\left({n;s},\lambda \right)=\frac{1}{Z}\cdot \frac{{e}^{-\lambda n}}{{n}^{s}}\,\left(n\in {{\mathbb{Z}}}^{+}\right)$$

(4)

where s is the order of the denominator, λ is the exponential rate at which the numerator decays, and Z is the probability normalizer. It should be noted that the powerbend distribution is a generalized distribution that can degenerate into the logseries, geometric, broken-stick, and power law distributions by setting its parameters to certain fixed values (Supplementary Note 1).

Modeling sampling effort in 16S rRNA survey

To relate the number of observed reads in 16S rRNA profiling data to the number of individual cells in the community, we explicitly modeled the sampling effort by convoluting a sampling error to the base SAD models. The 16S rRNA profile of a community typically comprises thousands of reads. Assuming no bias, the sampling error can be approximated by a Poisson distribution whose mean represents the expected number of reads given the number of individual cells. We used parameter η to denote the expected number of reads per individual cell, which represents the sampling effort. By convoluting the sampling error to the base SAD models, we derived the sampling-explicit SAD models defined on non-negative integers $k\in {0\cup {\mathbb{Z}}}^{+}$:

$${\varPhi }_{{poils}}\left({k;}\,\lambda,\eta \right)={\sum }_{n=1}^{\infty }\frac{{\left(\eta n\right)}^{k}{e}^{-\eta n}}{k!}\cdot {\varPhi }_{{ls}}\left({n;}\,\lambda \right)$$

(5)

$${\varPhi }_{{poipower}}\left({k;s},\eta \right)={\sum }_{n=1}^{\infty }\frac{{\left(\eta n\right)}^{k}{e}^{-\eta n}}{k!}\cdot {\varPhi }_{{power}}\left({n;s}\right)$$

(6)

$${\varPhi }_{{poipowbend}}\left({k;s},\lambda,\eta \right)={\sum }_{n=1}^{\infty }\frac{{\left(\eta n\right)}^{k}{e}^{-\eta n}}{k!}\cdot {\varPhi }_{{powbend}}\left({n;s},\lambda \right)$$

(7)

It should be noted that for Poisson lognormal distribution, because the number of individuals n and the sampling effort η always appear together in the formula as their product, there is no way to disentangle the two parameters. Therefore, the sampling effort η is fixed to 1, and the probability mass function of the Poisson lognormal distribution has the same number of parameters as the base distribution:

$${\varPhi }_{{poilog}}\left({k;}\,\mu,\sigma \right)={\int }_{0}^{\infty }\frac{{n}^{k}{e}^{-n}}{k!}\cdot {\varPhi }_{{lognorm}}\left({n;}\,\mu,\sigma \right)\cdot {dn}$$

(8)

In practice, there may be bias in 16S rRNA sequencing resulting from biases in DNA extraction and PCR amplification. Thus, a Poisson sampling error may not be sufficient to accurately relate the sequence reads to the number of individual cells. Because the direction and magnitude of such bias are often unknown, its effect on the distribution of sequence reads can be seen as an inflation of the variance or over-dispersion without changing the mean. As a result, we used the negative binomial distribution to model such bias⁴¹, and derived the corresponding probability mass function for SAD models:

$${\varPhi }_{{nblog}}\left({k;}\,\mu,\sigma,r\right)={\int }_{0}^{\infty }\frac{\varGamma \left(r+k\right)\cdot {r}^{k}\cdot {n}^{r}}{\varGamma \left(r\right)\cdot k!\cdot {\left(r+n\right)}^{r+k}}\cdot {\varPhi }_{{lognorm}}\left({n;}\,\mu,\sigma \right)\cdot {dn}$$

(9)

$${\varPhi }_{{nbls}}\left({k;}\,\lambda,\eta,r\right)={\sum }_{n=1}^{\infty }\frac{\varGamma \left(r+k\right)\cdot {r}^{k}\cdot {\left(\eta n\right)}^{r}}{\varGamma \left(r\right)\cdot k!\cdot {\left(r+\eta n\right)}^{r+k}}\cdot {\varPhi }_{{ls}}\left({n;}\,\lambda \right)$$

(10)

$${\varPhi }_{{nbpowber}}\left({k;s},\eta,r\right)={\sum }_{n=1}^{\infty }\frac{\varGamma \left(r+k\right)\cdot {r}^{k}\cdot {\left(\eta n\right)}^{r}}{\varGamma \left(r\right)\cdot k!\cdot {\left(r+\eta n\right)}^{r+k}}\cdot {\varPhi }_{{powber}}\left({n;s}\right)$$

(11)

$${\varPhi }_{{nbpowbend}}\left({k;s},\lambda,\eta,r\right)={\sum }_{n=1}^{\infty }\frac{\varGamma \left(r+k\right)\cdot {r}^{k}\cdot {\left(\eta n\right)}^{r}}{\varGamma \left(r\right)\cdot k!\cdot {\left(r+\eta n\right)}^{r+k}}\cdot {\varPhi }_{{powbend}}\left({n;s},\lambda \right)$$

(12)

In the above equations, r measures the degree of over-dispersion or bias in the negative binomial distribution and Γ is the gamma function.

Using 200 randomly selected microbial SADs, we compared the use of Poisson and negative binomial distributions to model the sampling effort in the 16S rRNA sequencing pipeline. Our result indicated that SAD models with the Poisson error structure are superior to those with the negative binomial error structure (Supplementary Table S8). Therefore, we used the Poisson error structure for testing the full microbial SAD dataset.

Supplementary Table S1 lists the models tested in this study and their free parameters.

Fitting SAD models to empirical data

Because all SAD models (both base models and their sampling-explicit counterparts) are formulated as probability distributions, we fitted them to the observed abundances in empirical data using the maximum likelihood (ML) framework. Because species with zero observations are not recorded in empirical data, the likelihood of a single species with k observations (denoted as L(k)) for sampling-explicit SAD models is normalized by the cumulative probability of any positive observation:

$$L\left(k\right)=\frac{P\left(k\right)}{{\sum }_{m=1}^{\infty }P\left(m\right)}$$

(13)

In the above formula, P(k) is the probability mass function for one of the sampling-explicit SAD models described in the previous section. We compared the performance of SAD models using AIC.

We fitted all models with a Poisson or negative binomial error structure using the R package ‘microSAD’. Additionally, we fitted the logseries, power law, powerbend, and Weibull models using the R package ‘sads’. For the gambin model, we used the R package ‘gambin’²³.

Evaluating the goodness of fit of SAD models

The goodness of fit of SAD models was evaluated by comparing predicted and observed species abundances in the rank-abundance distribution (RAD). Briefly, to generate the expected RAD for a SAD with S observed species, we placed S quantiles on the cumulative distribution function (CDF) of the fitted SAD model, evenly dividing the cumulative probability (the y-axis of the CDF curve). For comparison between the expected and the observed RADs, we used the modified coefficient of determination around the 1:1 line ${r}_{m}^{2}$ as described in previous studies^17,25,27 to quantify the goodness of fit:

$${r}_{m}^{2}=1-\frac{{\sum }_{i=1}^{S}{\left({log}\left({x}_{i}\right)- {log}\left({y}_{i}\right)\right)}^{2}}{{\sum }_{i=1}^{S}{\left({log}\left({x}_{i}\right)-\frac{1}{S}{\sum }_{j=1}^{S}{log}\left({x}_{j}\right)\right)}^{2}}$$

(14)

In the above formula, x and y are the observed and the predicted abundances (e.g., number of reads in 16S rRNA profiling data), respectively. The subscripts denote the rank of species, and S is the total number of observed species in the SAD. Because the coefficient of determination here is calculated against the 1:1 line with a fixed intercept of 0, it is possible that its value drops below zero, which indicates a poor fit.

The value of ${{r}}_{m}^{2}$ reaches 1 if and only if the observed and the predicted abundances match perfectly, indicating a perfect fit of SAD. To determine if the ${r}_{m}^{2}$ of a model fitted to a SAD (observed ${r}_{m}^{2}$) is significantly lower than 1, we established a baseline distribution of ${r}_{m}^{2}$ values from 1000 iterations of Monte-Carlo simulation. In each iteration, we simulated a RAD based on the fitted SAD model and calculated ${r}_{m}^{2}$ between the simulated RAD and the expected RAD. We then employed a one-sided test to assess the lack-of-fit by calculating the empirical frequency at which the baseline ${r}_{m}^{2}$ values were smaller than the observed ${r}_{m}^{2}$.

In addition, the goodness of fit of SAD models was evaluated by comparing observed and predicted SAD evenness, rareness, and dominance. Evenness of a SAD was measured using Shannon’s evenness E_H, also known as Pielou’s evenness J⁴². Rareness was measured by the log-modulo of skewness of a SAD, as described in refs. ^25,29, with log-transformed species abundances utilized for the skewness calculation. Dominance is measured simply as the abundance of the most abundant species (N_max) in the SAD, also as described in^25,29.

Evaluating the effects of observed species number and sampling effort on SAD model selection

We investigated the influence of the number of observed species on the efficacy of model selection by AIC. We generated simulated communities employing either a logseries ($\lambda$ = 10⁻²) or powerbend ($s=1.5,\,\lambda$ = 10⁻²) SAD model. The simulated communities varied in the observed species number, ranging from 10 to 100 species per community. For each level of species count, we simulated 100 communities under each SAD model. Subsequently, we fitted the logseries and powerbend models to the simulated data and recorded the frequency at which each model emerged as the best-fit model by AIC.

Likewise, we examined the effect of sampling effort on SAD model selection. We conducted simulations using communities generated according to either a powerbend (s = 1.5 and $\lambda$ = 10⁻⁵) or lognormal ($\mu$ = 0 and $\sigma$ = 3.25) SAD model. For each SAD model, we simulated 100 communities, each consisting of 10⁴ species, by randomly drawing species abundances from the respective SAD distribution. Next, we simulated the sampling process by randomly drawing the number of observations for each species from a Poisson distribution, whose mean was set to be the product of the species abundance and the sampling effort. We simulated 9 levels of sampling effort, evenly spaced from 10⁻¹ to 10⁻⁵ on a log-scale. Subsequently, we fitted the Poisson powerbend and Poisson lognormal models to the simulated data and recorded the frequency at which each model appeared as the best model at each level of sampling effort.

Evaluating the effect of OTU sequence identity threshold on SAD model selection

To investigate the effect of OTU sequence identity threshold on SAD model selection, we clustered OTU at different identity thresholds for the 565 SADs from the HMP1 dataset using Mothur (version 1.48)⁴³. Specifically, we extracted all aligned 16S rRNA gene sequences in the HMP1 dataset from the summary table of the Mothur pipeline. We calculated the pairwise distance between unique sequences using the function “dist.seqs” (with arguments cutoff=0.20 and output=lt) and clustered the unique sequences with the average neighbor algorithm using the function “cluster” (with arguments: method=average and cutoff=0.20). OTU tables were generated at the sequence identity threshold of 95% and 99% using the function “make.shared” (with argument: label=0.01-0.05). We fitted SAD models to these two datasets, conducted model selection through AIC, and compared the outcome with that of the original HMP1 dataset.

Evaluating the effect of 16S rRNA gene copy number (GCN) correction on SAD model selection

Because 16S rRNA GCN variation can bias the species composition estimated using 16S rRNA read counts⁴⁴, we assessed its impact on the outcome of SAD model selection by performing model selection on the 565 SADs from the HMP1 dataset. We predicted GCN for each OTU and then fitted models on the species abundance data corrected for GCN variation. Specifically, we selected the most abundant sequence in each OTU as its representative sequence. We then predicted the 16S GCN of each OTU using RasperGade16S⁴⁵, an R package that predicts 16S GCN based on phylogenetic relatedness. In SAD model fitting, we modeled 16S GCN as an OTU-specific multiplier to the sampling effort η. Consequently, only SAD models with a Poisson sampling error structure were included in this analysis. Model selection was performed through AIC, and the outcomes were compared with those obtained without 16S GCN correction.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

The powerbend distribution provides a unified model for the species abundance distribution across animals, plants and microbes

SAD data

SAD models

Modeling sampling effort in 16S rRNA survey

Fitting SAD models to empirical data

Evaluating the goodness of fit of SAD models

Evaluating the effects of observed species number and sampling effort on SAD model selection

Evaluating the effect of OTU sequence identity threshold on SAD model selection

Evaluating the effect of 16S rRNA gene copy number (GCN) correction on SAD model selection

Reporting summary

More From Forest Beat

.tdi_122{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_122{margin-bottom:8px!important}}@media (max-width:767px){.tdi_122{margin-bottom:8px!important}}Threat reduction and targeted recovery are both essential

.tdi_142{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_142{margin-bottom:8px!important}}@media (max-width:767px){.tdi_142{margin-bottom:8px!important}}Book review: ‘The Dales Slipper: Past-Present’ by Paul Redshaw

.tdi_162{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_162{margin-bottom:8px!important}}@media (max-width:767px){.tdi_162{margin-bottom:8px!important}}anti-colonialism, conservation and climate change

About

Service

Newsletter

Threat reduction and targeted recovery are both essential

Book review: ‘The Dales Slipper: Past-Present’ by Paul Redshaw

anti-colonialism, conservation and climate change