An improved lightweight method based on EfficientNet for birdsong recognition

Data resources

The study employed two publicly available datasets: the BirdsData dataset, released by BirdsData Technology, and the World’s Wild Bird Sounds dataset obtained from the Xeno-canto website (https://xeno-canto.org). We acquired birdsong audios that corresponded to the BirdsData dataset from the Xeno-canto website and augmented the dataset. To ensure consistency, we employed the Pydub tool to uniformly slice and convert the downloaded birdsong audio data. Subsequently, we processed them into WAV files with a fixed duration of 2 s, aligning with the BirdsData dataset. Detailed information regarding the scientific names, counts, and recording durations of the included bird species can be found in Table 1.

Table 1 Bird data sample table.

Experimental environment and evaluation metrics

To assess the effectiveness of our proposed model, we conducted experiments on an experimental platform equipped with an Intel i7-13700 F processor, 32GB of RAM, and an NVIDIA RTX 4070Ti graphics processor. The software environment consisted of a 64-bit Windows 11 operating system, and we utilized PyTorch version 2.0 as our deep learning framework.

To facilitate birdsong recognition across diverse species, we partitioned a dataset of 32,219 birdsong samples into training and test sets at a ratio of 7:3. The training set encompassed 22,553 samples, while the test set comprised 9666 samples. The specific parameters employed in training the model are presented in Table 2.

Table 2 Model parameter table.

The model’s classification performance on birds was assessed using accuracy rate, precision rate, recall rate, and F1-Score as evaluation metrics. Accuracy represents the proportion of samples correctly classified by the model to the total number of samples, precision reflects the model’s ability to distinguish between negative samples, while recall indicates the its ability to recognize positive samples. The F1 score is directly proportional to the classification effect, and a higher value indicates an improved effectiveness of the model. Formulas 14, 15 and 16 can be utilized for precision and recall calculations.

$$\:\begin{array}{c}Accuracy=\frac{TP+TN}{TP+\text{T}\text{N}+FP+FN}\end{array}$$

(14)

$$\:\begin{array}{c}Precision=\frac{TP}{TP+FP}\end{array}$$

(15)

$$\:\begin{array}{c}Recall=\frac{TP}{TP+FN} \end{array}$$

(16)

where TP denotes the number of correctly predicted positive instances, FP represents the number of incorrectly predicted positive instances, TN denotes the number of correctly predicted negative instances, and FN denotes the number of incorrectly predicted negative instances. The precision and recall obtained can be utilized to calculate the F1-Score, which serves as a comprehensive evaluation metric reflecting the overall performance of a model based on precision and recall rates. The closer the F1-Score is to 1, the better the model’s classification effect. The calculation formula is as follows:

$$\:\begin{array}{c}F1=2\times\:\frac{Precision\times\:Recall}{Precision+Recall} \end{array}$$

(17)

CBAM position and parameters

The purpose of this study was to investigate the impact of parameters and the specific insertion point for integrating the CBAM attention mechanism on the performance of the model. To achieve this, we conducted experiments on Model-1 (EfficientNet-B0 + ECA + 3 × 3DW convolution), which had already undergone prior improvement.

(1)

We tested two values, 8 and 16, for the scaling parameter (r) of the MLP in the CAM of CBAM attention mechanism.
(2)

In order to investigate its optimal placement, we incorporated the CBAM attention mechanism into the middle layer of the MBConv structure through two different schemes.

We propose two approaches for the insertion position of CBAM in the intermediate layers. Option 1 focused on replacing all ECA attention mechanisms with CBAM attention mechanisms in the intermediate layer (layer 3–5) of the MBConv structure in the model backbone. Option 2 specifically targeted only the first layer of the ECA attention mechanism within the intermediate layer (layers 3–5) of the MBConv structure in the model backbone.

The experimental results, including recognition accuracy and the number of parameters, are presented in Table 3. Optimal results were achieved with a scaling parameter of 8 and the position described in Option 2. Therefore, for this study, we set the scaling parameter (r) to 8 and incorporated the CBAM attention mechanism into the 3rd to 5th layers of the MBConv structure within the main architecture during the experimental process.

Table 3 The influence of parameters and positions on experimental results.

Comparison of the effects of different models

The performance of the models proposed in this study is compared with several popular lightweight deep learning models, including MobileNetV2(2018), MobileNetV3(2019), EfficientNetV2(2021), GhostNetV2(2022), and ShuffleNet(2018). To ensure the scientific rigor and comparability of the experimental results, all models utilized a standardized dataset and were implemented using the Pytorch 2.0 deep learning framework. Each model performed three independent experiments with an epoch of 100, and the highest value among the results was considered as the final recognition accuracy. The experimental results of different models are presented in Table 4.

Table 4 Comparison of experimental results of different models.

The experimental results show that the proposed model achieves an accuracy of 96.04%, surpassing other models proposed in the literature. In comparison to the MobileNetV2 model, our model exhibits a performance improvement of approximately 1.43% while reducing the parameter count by 4%. Similarly, when compared to the GhostNetV2 model, our model achieves a remarkable accuracy increase of 3.35% with a reduction in parameters by 31%. Compared to the MobileNetV3 small model and the ShuffleNet model, our model demonstrates an improvement in accuracy by 2.15% and 3.3%, respectively, despite an increase in the number of parameters by 32% and 47%. Furthermore, when compared to the official enhanced version of the EfficientNetV2 small model, our model exhibits an accuracy improvement of 1.42%, accompanied by a reduction in parameter count by 84%.

To comprehensively compare the performance of the models in terms of both accuracy and parameter quantity, this study standardized the two metrics into a unified dimensionless scale and calculated overall scores using weighted averages. Accuracy was transformed to a decimal range from 0 to 10, while the reciprocal of parameter count ranged from 0 to 1. The difference unit between these metrics was set at 0.1 to ensure consistency and scientific calculation in the weighted score. The initial weight ratio of accuracy to parameter count was set at 9:1, which means that accuracy is more important than parameter count in this overall score. Subsequently, the proportion of accuracy was continuously reduced while the proportion of parameter count was increased (from 9:1 to 1:9). Model scores were then obtained for nine groups with different weight ratios. The specific scoring results are depicted in Fig. 6.

The results clearly demonstrate that our model outperforms other models, achieving the highest scores among the top six groups within the nine scoring sets. In the seventh group, our model’s score is comparable to that of MobileNetV3 and ShuffleNet, both of which exceed the scores of the remaining three models. In the last two groups, our model’s score is slightly lower than those of MobileNetV3 and ShuffleNet due to their smaller parameter count. Overall, our model consistently exhibits strong performance across the seven over nine weighted scoring sets, showcasing its ability to effectively adapt to a wide range of requirements.

Figure 7 shows the variation in recognition accuracy of six models over 100 epochs. The accuracy of all models exhibits a gradual increase with the number of iterations. In the initial five rounds, all models, except for EfficientNetV2, demonstrate a faster rate of improvement, resulting in noticeable changes in their respective curves. However, beyond these initial rounds, the improvement curves for each model become less distinct and exhibit fluctuations. Following minor oscillations, the growth rate of accuracy begins to decelerate. By the 55th round, the accuracy curves for each model start to stabilize. The curves of our improved model surpassed those of the other models at the 15th iteration, and from that point onwards, they consistently maintained a significantly higher level until convergence.

To further demonstrate the performance of the proposed model, we conducted a detailed comparison with four state-of-the-art sound recognition methods. All models were trained for 100 epochs, and the best-performing model for each method was selected based on the validation set. Considering the significant impact of hyperparameters on model performance, we used publicly available implementations with default hyperparameters when applicable to ensure fairness in the comparison. The results are shown in Table 5. Experimental results indicate that our model achieved the highest accuracy on the same dataset. Compared to Method 1, our model reduced the number of parameters by 36.6%, while achieving accuracy gains of 4.22%, respectively. Although the parameter count of our model is higher than that of Method 2 and Method 4, the accuracy improved by 3.25% and 0.26%, respectively. When compared to Method 3, our model exhibited a slight increase in parameter count but achieved the highest accuracy improvement of 5.28% among the five models.

Table 5 Comparison of experimental results of existing models.

Ablation experiment

To evaluate the effectiveness of the various improvements and their impact on the model’s performance, ablation experiments were performed using EfficientNet-B0 as the benchmark model. In the study, only the EfficientNet-B0 benchmark model was utilized, where the SE attention mechanism in the MBConv structure of the EfficientNet-B0 backbone was substituted with the ECA attention mechanism. Additionally, the convolution kernel size of the DW convolution in the MBConv structure was set to 3 for the 3rd, 5th, and 6th layers in the model backbone. Then, the CBAM attention mechanism replaces the ECA attention mechanism in the first layer of the MBConv structure within the middle layer (layers 3–5) of the model backbone. Lastly, based on above modifications mentioned, the Adam optimization algorithm is employed. This study conducted three independent experiments, each comprising 100 epochs, to test various enhancement strategies. The final experimental result was determined by selecting the highest recognition accuracy achieved among these three experiments.

Table 6 Ablation experiment.

Table 6 shows the experimental results of the methods mentioned above. Method 2, which incorporates the ECA attention mechanism, exhibits a recognition accuracy that is 0.46% higher and utilizes 0.64 M fewer parameters compared to Method 1 employing the SE attention mechanism. This demonstrates the superior efficiency of the ECA attention mechanism in parameter reduction. The experimental results from Methods 2 and 3 indicate that adjusting the convolution kernel of DW convolution to a size of 3 × 3 leads to a notable enhancement in the model’s recognition accuracy by 2.1%, while also effectively reducing the parameter count. This observation suggests that employing DW convolution with a 3 × 3 convolution kernel can efficiently accommodate the ECA attentional mechanism, expand the receptive field, and improve the generalization ability of the model. Consequently, resulting in a significant improvement in recognition accuracy. The comparative results of Methods 3 and 4 demonstrate that the incorporation of the CBAM attention mechanism can improve recognition accuracy with minimal parameter increment, thereby substantiating its efficacy. Furthermore, the results of Methods 4 and 5 (ours) show that the integration of the Adam optimization algorithm can further augment model accuracy. Hence, the proposed improvement scheme in this study is both effective and feasible.

To analyze the classification results of the model, this study utilized a confusion matrix to visualize the test outcomes and assess the model’s effectiveness in classifying each bird species. A confusion matrix, comparing the predicted labels with the true labels, was plotted for 10 classifications, as shown in Fig. 8. The figure illustrates the true label values of the birdsong samples along the rows and the predicted label values along the columns. The color intensity represents prediction effectiveness, with darker shades indicating higher recognition accuracy. Correctly predicted results are predominantly located on the diagonal line within the figure. Moreover, this confusion matrix facilitates efficient calculation of precision, recall, and F1 score for each bird species. Table 7 presents detailed results.

The confusion matrix presented in Fig. 8 reveals that Tringa totanus and Himantopus himantopus exhibited a higher frequency of misclassified predictions and lower F1-scores compared to other species. Both birds were misclassified as each other during the identification process. This misclassification can be attributed to their shared genus Sandpiper, which possesses similar and shorter songs resulting in comparable characteristics, thereby leading to misclassification.

Table 7 Analysis results of confusion matrix.

The results presented in Table 7 demonstrate with a precision, recall, and F1-score of 1 for Passer, indicating accurate prediction of all 590 test samples. Anas platyrhynchos and Phasianus colchicus both exhibited a recall of 0.997, with only one misclassified sample each. Anas platyrhynchos was erroneously classified as Phasianus colchicus, while Phasianus colchicus was misclassified as Vanellus vanellus. Both Ardea cinerea and Cygnus cygnus achieved a recall score of 1.000, demonstrating the successful prediction of all true samples, with a precision of 0.999. The confusion matrix reveals that in the case of Cygnus cygnus, a Phalacrocorax carbo was incorrectly classified as Cygnus cygnus, while in the case of Ardea cinerea, a Tringa glareola was misclassified as Ardea cinerea. The precision of 1.000 in the Phalacrocorax carbo sample indicates that the predicted samples for Phalacrocorax carbo are accurate, with a recall of 0.995. The confusion matrix shows three predicted errors in the Phalacrocorax carbo sample, which were classified as Cygnus cygnus, Vanellus vanellus, and Tringa tetanus. In contrast, Himantopus himantopus and Tringa tetanus, belonging to the same genus, exhibited lower scores compared to the other eight species, suggesting that approximately 15% of the samples from both species were misclassified as each other.

In summary, through targeted adjustments to the model’s architecture and the integration of more effective attention mechanisms, we have enhanced its discriminative ability while also achieving a reduction in model complexity.

Model performance experiment

We performed a t-test analysis on the baseline model and ours model to ensure the statistical significance of the experimental comparison. The detailed results are presented in Table 8.

Table 8 Analysis results of t-test.

As shown in Table 8, based on the t-statistic calculation method, we obtained a t-statistic of 45.67 and a p-value of < 0.0001, indicating that the accuracy difference between the improved model and the baseline model is highly statistically significant. Additionally, the mean difference is 3.06%, with a confidence interval of [2.91%, 3.21%], further confirming the reliability of the improvement.

we also employed 5-fold cross-validation to evaluate the generalization ability of the model and eliminate potential biases. The detailed results are presented in Table 9. Based on the results in Table 9, we calculated the mean and standard deviation of accuracy, obtaining a mean value of 96.04% (± 0.18% standard deviation). The results indicate that accuracy fluctuates between 95.87% and 96.33% across different folds, demonstrating the strong robustness of the model to varying data partitions. The standard deviation of 0.18% further confirms that the model effectively mitigates overfitting.

Table 9 Analysis results of five-fold cross-validation.

Comparison of convergence speeds

To enhance the convergence speed and accuracy of the model, we replaced the original SGD optimization algorithm with the Adam optimization algorithm. We conducted three independent experiments with a fixed epoch of 100 using different strategies. Based on the obtained results, we calculated both the average training time for each model and their corresponding average recognition accuracy.

Table 10 Convergence speed comparison.

Table 10 displays the specific experimental outcomes. The results indicate that under different learning rates, the Adam optimizer significantly reduces training time compared to the SGD optimizer. When the learning rate is set to 0.001, Adam achieves the optimal training performance, with an accuracy of 96.04% and a training duration of 809 s. Compared to SGD, the training time is reduced by 28%, 33%, and 37%, while the accuracy is improved by 6.72%, 0.27%, and 0.22%, respectively. Figure 9 presents the error curves for the Adam(lr = 0.001) and SGD(lr = 0.01) optimizers. The figure indicates that as the number of epochs increases, the error curve for Adam converges more rapidly than that of SGD. Furthermore, the error for Adam is smaller when the curve levels off. These experimental results indicate that using the Adam optimization algorithm can enhance convergence speed and improve model recognition accuracy.

Source link

An improved lightweight method based on EfficientNet for birdsong recognition

Data resources

Experimental environment and evaluation metrics

CBAM position and parameters

Comparison of the effects of different models

Ablation experiment

Model performance experiment

Comparison of convergence speeds

More From Forest Beat

.tdi_122{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_122{margin-bottom:8px!important}}@media (max-width:767px){.tdi_122{margin-bottom:8px!important}}Spatial niche differentiation and key driving factors of dominant benthic macroinvertebrates...

.tdi_142{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_142{margin-bottom:8px!important}}@media (max-width:767px){.tdi_142{margin-bottom:8px!important}}Machine learning-based mapping of Acidobacteriota and Planctomycetota using 16 S rRNA gene...

.tdi_162{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_162{margin-bottom:8px!important}}@media (max-width:767px){.tdi_162{margin-bottom:8px!important}}Standardized diversity estimation uncovers global distribution patterns and drivers of stream...

About

Service

Newsletter

Spatial niche differentiation and key driving factors of dominant benthic macroinvertebrates...

Machine learning-based mapping of Acidobacteriota and Planctomycetota using 16 S rRNA gene...

Standardized diversity estimation uncovers global distribution patterns and drivers of stream...