Study area
Our study areas were situated in central and southern Belize (Fig. 1). The northernmost study area, hereafter referred to as Manatee, comprised Manatee Forest Reserve, several smaller neighbouring protected areas, and several sites outside the protected areas (Fig. 1). The more southerly study area, hereafter referred to as Cockscomb, comprised Cockscomb Basin Wildlife Sanctuary, Mango Creek Forest Reserve and several sites outside the protected areas (Fig. 1). Generally, both areas are characterised by similar habitat types, dominated by broad-leaved moist forest on steep hills, with some lowland savannah27. However, Cockscomb has a much steeper elevational gradient, with a peak of 1041 m compared to 593 m in Manatee. Cockscomb also has greater overall forest cover compared to Manatee, which has experienced some forest loss due to hurricane damage and human activity. The final area, Tapir Mountain Nature Reserve (TMNR), was used only to collect training data for the machine learning classifier and was not used to model occupancy. This area is also dominated by broad-leaved moist forest on steep hills.
Location of autonomous recorder unit (ARU) deployments in Belize. Data collected in Tapir Mountain Nature Reserve (TMNR; depicted in an inset zoom panel) and Manatee were used for classifier training. Data collected in Manatee and Cockscomb Basin Wildlife Sanctuary (Cockscomb) were used for the occupancy modelling. Map created using QGIS 3.10 (QGIS.org, 2024. QGIS Geographic Information System. QGIS Association. http://www.qgis.org).
Study species
The Yucatán black howler monkey (Alouatta pigra) is a primate species native to Belize, eastern Guatemala, and south-eastern Mexico. A. pigra is an arboreal, forest-dependent species that lives in groups ranging from 2 to 12 individuals28. Home range sizes for this species vary widely, with research in Belize documenting home range sizes between 0.5 and 50 ha29,30,31. Howler monkeys emit loud, low-frequency calls, known as howls, which can be heard over long distances through the forest (> 1 km)32,33,34. Throughout the day, howler monkeys commonly engage in several bouts of howling, with each bout lasting on average around 10–15 min, and sometimes ranging up to an hour33,35. The rate of howling bouts is highest around dawn, followed by a smaller peak in the afternoon33,35.
Data collection
We deployed ARUs (AudioMoths36) at 10 points in TMNR from March 2018 to March 2019, and at 59 points across Manatee (n = 35) and Cockscomb (n = 24) between January 2021 and June 2021. ARUs were deployed along logging roads and trails, with a minimum separation distance of 2 km, except for in TMNR where device placement was determined by a probabilistic algorithm37. ARUs recorded simultaneously within each area, with no rotation of points.
ARUs were all deployed in pairs using the same configuration and setup. At each point, one ARU recorded during the day (07:00–17:00), and the other during the night (17:00–07:00). Devices were deployed primarily to monitor hunting, using custom firmware designed to detect gunshots38. The onboard classifier triggered a 4-s recording whenever it detected a gunshot-like sound, characterised by a sharp, broadband blast followed by rapid decay38. Howler monkey vocalizations share similar traits, including a sharp onset and broadband signal followed by attenuation34. To maximize gunshot detections, the classifier was configured for high recall, resulting in low precision and a high rate of false positives. Consequently, howler monkey vocalizations frequently triggered the detection algorithm and appeared as false-positive gunshot detections in the recordings.
To avoid excessive triggering, a threshold of 100 recordings per hour was set, after which the device stopped recording for the rest of the hour. This threshold was reached roughly 25% of the time at each sensor and there was an average of 69 min of recordings per sensor per day. While data collection was not optimised for howler-monkey detection, ‘messy’ datasets such as these are a reality in conservation39,40 and those from surveys using monitoring technology increasingly contain information on by-catch of potential value to conservation41.
Howler monkey classifier
We trained a classifier to provide scores in the range [0, 1] for howler monkey presence within each sound file. We used a convolutional neural network (CNN) for our classification model, as these models have proven to be powerful for sound classification for a broad range of species including howler monkeys10,19,42. We curated training data for the model by randomly selecting 1,000 4-s recordings from each of TMNR and Manatee, and reviewing files for howler monkey presence by visually inspecting spectrograms and listening to the audio recordings. We supplemented these data with 7 recordings obtained from the Macaulay Library at the Cornell Lab of Ornithology, which were split into 40 4-s clips containing howler monkey vocalisations following manual review (see acknowledgements for file accreditations). We used this dataset to train a ResNet18 convolutional neural network (CNN) classifier on spectrograms of the 4-s clips. We trained the classifier using OpenSoundscape v0.7.143 (Supplementary Information). We subsequently applied the classifier to the entire dataset, generating a machine learning score for each clip.
Occupancy modelling
Comparison of different frameworks
We compared occupancy frameworks using data collected within Cockscomb and Manatee. For consistency between points, we filtered the dataset to retain only data from the first 28 days of sampling at each recorder point, and we additionally filtered the dataset to retain only data from the period coinciding with peak howling activity around dawn (4.00–7.00 am). We removed one sampling point that did not have 28 days of recordings leaving 58 points for the analysis.
We considered the collection of recordings from each point as an individual ‘site’. In line with other occupancy studies utilising ARU data processed by machine learning classifiers15,24, we defined true occupancy at each site as the sample-at-hand occupancy, determined by whether the species was present in at least one of all the recordings from that site. We defined p as the probability of detecting the species at a site given the species was present.
Sample-at-hand occupancy is typically estimated by manually verifying all the data used within the model24. Due to the high volume of data used in our models (over 200 h of recordings), it was not feasible to verify all the files used, so we employed a sampling approach to obtain our best estimate of sample-at-hand occupancy. We used a combination of sampling methods to minimise the chance of underestimating sample-at-hand. Sampling included: (1) review of the top-ten highest scoring files (global top ten) at each site, (2) review of the highest scoring file at each site on every third day, (3) review of files above a decision threshold of 0.77 (on the softmax scale) with early stopping when a presence was detected at each site, (4) scheduled listening to the first recording made every third day at each site, and (5) random listening of 2921 files (190 min). The threshold of 0.77 was chosen as it provided a good balance between recall and precision, with 85% recall and 92% precision on the validation data (Supplementary Fig. S1). This combination of sampling methods involving both random and targeted listening was used to ensure that, in addition to high-scoring files, files spanning a wide range of machine learning scores were verified at each site (Supplementary Fig. S2).
In total we reviewed 4,829 files (~ 5.5 h of recordings), with a mean of 58 files (SD = 27.7) reviewed per site and a maximum of 214 files reviewed at a single site. We considered the information gained from the extensive manual verification process as the best available estimate of ‘true’ parameter values, and that it represents the best estimate that could be achieved from the reduced dataset (i.e. reduced level of verification)15. To assess performance of each model we thus calculated the difference between model-estimated occupancy and the estimated sample-at-hand occupancy. While we acknowledge that this estimate is not perfect, it is the most accurate approximation of the ground truth achievable under the circumstances. Although we cannot completely rule out the possibility of underestimating the true occupancy, we took extensive steps to minimize this risk through diverse verification methods.
For initial comparisons of methods involving the use of thresholding (ii and iii) we used the decision threshold of 0.77. For continuous-score occupancy models we transformed classifier scores into the logit scale to satisfy model assumptions (Supplementary Fig. S1). We did not include environmental covariates for occupancy as we were primarily interested in comparing occupancy estimates, and furthermore their inclusion would have vastly increased the complexity of our model comparisons. We included environmental covariates to model detection probability, as we found that this was necessary to account for the unexplained variation in detection probability in our dataset. Data exploration confirmed that detection probability was associated with environmental covariates including forest cover, and that there was also a difference between the two survey grids (Supplementary Information). Furthermore, we found that the association with forest cover was different between the two survey grids. Consequently we modelled detection probability as a function of ‘forest’ (calculated as standardised mean value of forest height44 within a 2 km circular buffer around each point), ‘grid’ (categorical covariate with two levels), and an interaction term. The forest covariate ranged 8.2–23.6 m in Cockscomb (mean = 19.6 m, SD = 5.1), and 1.7–20.9 m in Manatee (mean = 14.1 m, SD = 5.2). We included this structure to account for heterogeneity in detection probability in all models going forward. For all false-positive models, we assumed that false-positive rate was constant across sites. All models were run in R version 4.2.045.
Standard occupancy model
This approach involved manually verifying howler monkey presence from a small subset of files and using the resulting detection data in a standard single season occupancy model (referred to throughout as the ‘standard’ occupancy model). We compared several different approaches for selecting data to verify, including (1) verification of the file with the highest classifier score on every third day from each site, resulting in ten files verified per site (referred to from here on as ‘top-ten listening’), (2) verification of all files above a threshold of 0.77 on every third day from each site (referred to from here on as ‘thresholded listening’), (3) random verification of a single file on every third day from each site (resulting in ten files verified per site), and (4) scheduled listening (as described above). The same set of randomly verified files was used from here on whenever the use of random verification is referred to, and likewise whenever the use of top-ten files is referred to. These sampling approaches all resulted in verified files for ten separate days at each site. We subsequently constructed the detection histories by creating a separate survey replicate for each day and assigning a detection (1) to any site/day combination that had at least one file with a confirmed howler monkey detection.
Because howler monkey vocalisations are distinctive we assumed manual verification of files did not result in any false-positive detections24. Thus these data were suitable for the standard occupancy model4, where site specific occupancy, \(z_{i}\) (occupied = 1, not occupied = 0), is assumed to arise from a Bernoulli distribution:
$$z_{i} \sim Bernoulli\left( \psi \right)$$
(1)
where \(\psi\) is the probability of occurrence. As \(z\) is not directly observed, we model the observation process, where the observed detection, \(y\) at site \(i\) during survey replicate \(j\) also arises from a Bernoulli process conditional on the latent occurrence process if the species is present:
$$y_{i , j} \sim Bernoulli\left( {p_{i} , z_{i} } \right)$$
(2)
$$logit\left( {p_{i} } \right) = \alpha_{0} + \alpha_{1} \times forest_{i} + \alpha_{2} \times grid_{i} + \alpha_{3} \times forest_{i} \times grid_{i}$$
(3)
where if the species is present at site \(i\), then \(p_{i}\) is the probability of detecting the species. If the species is absent at site \(i\), then \(p_{i}\) equals 0 such that we assume there is no potential for a false-positive detection. We fit the occupancy models using the R package Unmarked46 and evaluated model fit using the MacKenzie-Bailey goodness of fit test47.
Binary false-positive occupancy model (Royle-Link model and extensions)
This approach involved extracting all files above a chosen decision threshold and declaring them as presences. As this process will result in false-positive detections, we used the Royle-Link false-positive occupancy model17, the simplest extension of the original occupancy model. In this model, we use the parameter \(p_{11}\) to describe the probability of detecting a species given it is present (\(y = 1 | z = 1\)), and \(p_{10}\) to describe the probability of a false-positive detection (\(y = 1|z = 0\)). The observation model is then altered to allow \(y\) to arise from a Bernoulli distribution even when \(z\) is 0:
$$y_{i , j} \sim Bernoulli\left( {p_{11, i} z_{i} + p_{10} \left( {1 – z_{i} } \right) } \right)$$
(4)
with detection probability given the species is present (\(p_{11}\)) modelled as a function of forest height and survey grid:
$$logit\left( { p_{11, i} } \right) = \alpha_{0} + \alpha_{1} \times forest_{i} + \alpha_{2} \times grid_{i} + \alpha_{3} \times forest_{i} \times grid_{i}$$
(5)
We constructed the detection history by creating a separate survey replicate for each day and assigning a detection (1) to any site/day combination that had at least one file with a classifier score above the decision threshold. This model has multimodal likelihood, resulting in identical support for different parameter values. We addressed this issue by constraining the parameters so that \(p_{11} > p_{10}\), an assumption that is supported by the validation data at the chosen threshold17. We imposed this constraint by providing arbitrarily chosen starting values that align with this condition (0.7 for \(p_{11}\) and 0.1 for \(p_{10}\))23.
Extensions of this model accommodate additional sources of verified data26. We investigated two approaches for incorporating verified data, firstly the multiple detection method (‘multi-method’) design which involves treating verified data as ordinary occupancy data (i.e. only subject to false negatives), and modelling these data alongside the false-positive data in a joint occupancy model26. The second approach, the multiple detection state model (‘multi-state’) involves modelling the probability of three different detection states (no detection, uncertain detection, and certain detection), with an additional parameter \(b\) which is the probability of an uncertain detection being classified as certain26. We implemented both models using the manually verified top-ten files from each site and the thresholded score data used for the Royle-Link model. For the multi-method model, we defined the method as a detection covariate as specified in Kéry and Royle (2020). We did not supply starting values for these models as the verified data alleviated multi-modality issues. We fit the models using the package Unmarked46.
False positive-occupancy model using detection counts
This model, first proposed by Chambert et al. (2018) and expanded on in Kéry and Royle (2020), uses the counts of detections derived from automated classification of acoustic data, \(y_{i, j }^{a}\). This framework uses two independent data sources that are incorporated into separate observation models; binary false-positive data (\(y_{i , j}\)) and frequency data (\(y_{i, j }^{a}\)). In our case, detection data from the ARUs was used as the source for the frequency data, and as there was no independent data source available for the binary false-positive model, this component was removed from the code. The observation model for \(y_{i, j }^{a}\) assumes that frequency of detections is a Poisson random variable, with a baseline rate of true positives \(\lambda\), and a false positive rate \(\omega\), where:
$$y_{i, j }^{a} \sim Poisson\left( {\lambda_{i} z_{i} + \omega } \right)$$
(6)
And \(\lambda\) is modelled as a function of detection covariates, forest height and survey grid, as follows:
$$log\left( {\lambda_{i} } \right) = \alpha_{0} + \alpha_{1} \times forest_{i} + \alpha_{2} \times grid_{i} + \alpha_{3} \times forest_{i} \times grid_{i}$$
(7)
A further extension of this model allows for manual verification of a subset of the detection data to improve parameter estimates. However, model convergence with the addition of verification data was unstable, therefore, this version of the model was not implemented.
We created the detection history by summing the number of putative positive detections (i.e. all files above the chosen decision threshold) for each day. We ran the model in the R package jagsUI using Markov-chain Monte Carlo (MCMC) simulations48 (see Supplementary Information for details of model implementation). We ran the model using 3 chains with 33,000 samples, 3000 burn-in samples, and 5000 adaptation samples. We evaluated model convergence by inspecting trace plots and evaluating Gelman-Rubin statistics49, and assumed convergence when parameters had Gelman-Rubin statistics < 1.1.
Continuous-score occupancy models using classifier scores (Rhinehart et al. & Kéry and Royle)
These approaches use the classifier scores of each file, which are related to the likelihood that the target sound is present and bypasses the need to use a decision threshold. We compared two models using this approach which were independently developed by Kéry and Royle (2020) and Rhinehart et al. (2022). Both models incorporate a model of classifier scores into the occupancy framework, whereby scores are assumed to be derived from a normal distribution with a mean and standard deviation that vary by group \((g)\) (present or absent). If the file does not contain the species, then the score value \(x_{k}\) for sample \(k\) is assumed to be assigned by the classifier as:
$$x_{k} |g = 0 \sim Normal(\mu_{0} ,\sigma_{0} )$$
(8)
where \(\mu_{0}\) and \(\sigma_{0}\) are the mean and standard deviation of the scores returned by the classifier for files that do not contain the target species. If the file does contain the species, then the score is assumed to be assigned by the classifier as:
$$x_{k} |g = 1 \sim Normal\left( {\mu_{1} ,\sigma_{1} } \right)$$
(9)
where \(\mu_{1}\) and \(\sigma_{1}\) are the mean and standard deviation of the scores returned by the classifier for files that do contain the target species. The parameters of the score distribution reflect the classifier quality24. Classifiers that have a large difference between \(\mu_{1}\) and \(\mu_{0}\) can better distinguish between the two groups. Similarly, classifiers with smaller values of \(\sigma_{1}\) and \(\sigma_{0}\) can achieve higher precision in distinguishing between the groups. Both models describe the overall distribution of scores as a two-component Gaussian mixture, comprising negative files and positive files. Where \(z_{i } = 1\) (the site is occupied by the species), both components (positive files and negative files) are present in the mixture, with the positive component weighted by the call rate (and in the case of the Kéry model also by the false positive rate). In sites where \(z_{i } = 0\) (the site is not occupied), only the component describing negative files is present. For both models a subset of verified files can be added to improve performance. For the Kéry model, the mixture model for scores is incorporated as an additional level in the detection-count false-positive model (model (iii) described above), whereas in the Rhinehart model, the model structure is more akin to the multiple detection methods model of Miller et al. (2011). For the Kéry model we used the same model adjustments detailed in (iii), including modelling the baseline detection rate of true positives (\(\lambda\)) as a function of the detection covariates. The Rhinehart model has an equivalent term, \(\theta\), which describes the probability that a species appears in a file at an occupied site, which we modelled as follows:
$$logit\left( {\theta_{i} } \right) = \alpha_{0} + \alpha_{1} \times forest_{i} + \alpha_{2} \times grid_{i} + \alpha_{3} \times forest_{i} \times grid_{i}$$
(10)
We implemented the models using the scripts provided in each publication with some minor adjustments to accommodate our survey design and improve model convergence (Supplementary Information). The input data consisted of machine learning scores, sampled from the full dataset by selecting the first file from each ten-minute interval. This approach was used to speed up model running time and to aid model convergence issues that arose with the full dataset. As for the detection-count false-positive model, The Kéry continuous-score model failed to converge with the addition of verification data, therefore this version of the model was not implemented. We ran the Kéry model with 3 chains of 12,000 samples, with 2000 burn-in, 1000 adaptation, and a thinning rate of 2. We ran the Rhinehart models with and without verification data with 3 chains with 20,000 samples and 3000 burn-in. We checked convergence of all models using the Gelman-Rubin diagnostic.
Influence of decision thresholds on occupancy estimates
We investigated the influence of decision threshold on occupancy estimates by constructing detection histories using incremental decision thresholds between 0.01 and 0.99 with an interval of 0.03. We ran the binary false-positive models (Royle-Link, the multi-state, and multi-method) and the detection-count false-positive model (without verification data) using each dataset (see Supplementary Information for details of model implementation).
We additionally explored the influence of decision threshold on the occupancy estimate of the standard occupancy model. Our baseline threshold for this test was 0.77, as this was the minimum threshold above which we verified data. We selected thresholds between 0.77 and 1 using an interval of 0.001, and re-ran the standard occupancy model using only verified data above the chosen threshold.
Influence of verification method on occupancy estimates
The false-positive models included extensions to accommodate manually verified data, however it is not obvious what method to use to select verification data, and if the choice of method influences estimates. Previous studies have used annotations from randomly selected files15,24. For common and frequently vocalising species such as the howler monkey, there will be a relatively high probability of randomly selected files containing positive detections. However, for rare and infrequently vocalising species this process is unlikely to confirm many positive detections without substantially increasing the number of randomly selected files to verify. For such species, an alternative strategy focusing on verifying high-scoring files, which are more likely to contain true positives, may be beneficial.
We compared model estimates for the false-positive occupancy models with the addition of (i) 10 randomly verified files, and (ii) top-ten verified files. We did not include the detection-count or the Kéry continuous score false-positive models in this comparison as they failed to converge with verification data. We used a decision threshold of 0.01 for the binary false-positive models to allow the detection histories to accommodate the same set of randomly chosen files.
Influence of temporal subsampling regime on occupancy estimates
Temporal subsampling of the data may be required to meet the models’ assumption of independence of surveys, but it is not evident how to subsample ARU and classifier score data, or how the approach used influences estimates. We compared two subsampling approaches: (i) selecting the first file within a time interval (systematic first-file subsampling), and (ii) selecting the highest scoring file within a time interval (maximum score subsampling). We explored time intervals of 10 min and 30 min for both approaches. We compared estimates from the three binary false-positive models, the detection-count false-positive model, and the Rhinehart continuous-score false-positive models without verified data using these subsampling approaches. We omitted the Kéry continuous-score false-positive model from this comparison due to instability of model convergence. We used a decision threshold of 0.77 to construct the detection history for the binary and detection-count false-positive models. A summary of all model comparisons is found in Supplementary Table S1.
Efficiency of methods
We recorded computational running time for all methods. All computations were run on a Macbook M1 Pro Chip with 8-core CPU except for the CNN which was trained in Google Colab using one GPU.