Multi-modal Language models in bioacoustics with zero-shot transfer: a case study


  • Hautier, Y. et al. Anthropogenic environmental changes affect ecosystem stability via biodiversity. Science 348 (6232), 336–340 (2015).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Barlow, J. et al. Anthropogenic disturbance in tropical forests can double biodiversity loss from deforestation. Nature 535 (7610), 144–147 (2016).

    Article 
    ADS 
    CAS 
    PubMed 
    MATH 

    Google Scholar
     

  • Ripple, W. J. et al. Conserving the world’s megafauna and biodiversity: the fierce urgency of now. Bioscience 67 (3), 197–200 (2017).


    Google Scholar
     

  • Dirzo, R. et al. Defaunation in the anthropocene, science, 345 (6195), 401–406, (2014).

  • O’Connell, A. F., Nichols, J. D. & Karanth, K. U. Camera Traps in Animal Ecology: Methods and Analyses (Springer Science & Business Media, 2010).

    MATH 

    Google Scholar
     

  • Burton, A. C. et al. Wildlife camera trapping: a review and recommendations for linking surveys to ecological processes. J. Appl. Ecol. 52 (3), 675–685 (2015).

    Article 
    MATH 

    Google Scholar
     

  • Kitzes, J. & Schricker, L. The necessity, promise and challenge of automated biodiversity surveys. Environ. Conserv. 46 (4), 247–250 (2019).

    Article 

    Google Scholar
     

  • Wang, D., Shao, Q. & Yue, H. Surveying wild animals from satellites, manned aircraft and unmanned aerial systems (uass): a review. Remote Sens. 11 (11), 1308 (2019).

    Article 
    ADS 

    Google Scholar
     

  • Kays, R., McShea, W. J. & Wikelski, M. Born-digital biodiversity data: millions and billions. Divers. Distrib. 26 (5), 644–648 (2020).

    Article 
    MATH 

    Google Scholar
     

  • Keitt, T. H. & Abelson, E. S. Ecology in the age of automation. Science 373 (6557), 858–859 (2021).

    Article 
    ADS 
    CAS 
    PubMed 
    MATH 

    Google Scholar
     

  • Tuia, D. et al. Perspectives in machine learning for wildlife conservation, Nature Communications. 13 (1), 792, https://doi.org/10.1038/s41467-022-27980-y (2022).

  • Laiolo, P. The emerging significance of bioacoustics in animal species conservation. Biol. conserva- tion. 143 (7), 1635–1645 (2010).

    Article 
    MATH 

    Google Scholar
     

  • Marques, T. A. et al. Estimating animal population density using passive acoustics. Biol. Rev. 88 (2), 287–309 (2013).

    Article 
    PubMed 
    MATH 

    Google Scholar
     

  • Sugai, L. S. M., Silva, T. S. F., Ribeiro, J. W. Jr & Llusia, D. Terrestrial passive acoustic monitoring: review and perspectives. BioScience 69 (1), 15–25 (2019).

    Article 

    Google Scholar
     

  • Dale, S. S. et al. Distinguishing sex of northern spotted owls with passive acoustic monitoring. J. Raptor Res. 56 (3), 287–299 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Roe, P. et al. The Australian acoustic observatory. Methods Ecol. Evol. 12 (10), 1802–1808 (2021).

    Article 
    MATH 

    Google Scholar
     

  • Potamitis, I., Ntalampiras, S., Jahn, O. & Riede, K. Automatic bird sound detection in long real-field recordings: applications and tools. Appl. Acoust. 80, 1–9 (2014).

    Article 
    MATH 

    Google Scholar
     

  • Stowell, D., Wood, M., Stylianou, Y. & Glotin, H. Bird detection in audio: a survey and a challenge, in IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6, (2016).

  • Stowell, D., Wood, M. D., Pamul-a, H., Stylianou, Y. & Glotin, H. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods Ecol. Evol. 10 (3), 368–380 (2019).

    Article 

    Google Scholar
     

  • Zhong, M. et al. Detecting, classifying, and counting blue whale calls with siamese neural networks. J. Acoust. Soc. Am. 149 (5), 3086–3094 (2021).

    Article 
    ADS 
    PubMed 
    MATH 

    Google Scholar
     

  • Zhong, M. et al. Acoustic detection of regionally rare bird species through deep convolutional neural networks. Ecol. Inf. 64, 101333 (2021).

    Article 

    Google Scholar
     

  • Gupta, G., Kshirsagar, M., Zhong, M., Gholami, S. & Ferres, J. L. Comparing recurrent convolutional neural networks for large scale bird species classification. Sci. Rep. 11 (1), 17085 (2021).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kahl, S., Wood, C. M., Eibl, M. & Klinck, H. Birdnet: a deep learning solution for avian diversity monitoring. Ecol. Inf. 61, 101236 (2021).

    Article 

    Google Scholar
     

  • Stowell, D. Computational bioacoustics with deep learning: a review and roadmap. PeerJ 10, e13152 (2022).

    Article 
    PubMed 
    PubMed Central 
    MATH 

    Google Scholar
     

  • Wa¨ldchen, J. & Ma¨der, P. Machine learning for image based species identification, Methods in Ecology and Evolution. 9 (11), 2216–2225 (2018).

  • LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition, Proceedings of the IEEE. 86 (11), 2278–2324, (1998).

  • Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst., 30 (2017).

  • Guo, M. H. et al. Attention mechanisms in computer vision: a survey. Comput. Visual Media. 8 (3), 331–368 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Politis, A., Mesaros, A., Adavanne, S., Heittola, T. & Virtanen, T. Overview and evaluation of sound event localization and detection in dcase 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing. 29, 684–698, (2020).

  • Elizalde, B. M. Never-ending learning of sounds, Ph.D. dissertation, Carnegie Mellon University Pittsburgh, PA, (2020).

  • Heller, L. M., Elizalde, B., Raj, B. & Deshmuk, S. Synergy between human and machine approaches to sound/scene recognition and processing: An overview of icassp special session, arXiv preprint arXiv:2302.09719, (2023).

  • Norouzzadeh, M. S. et al. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning, Proceedings of the National Academy of Sciences. 115 (25), E5716–E5725, https://www.pnas.org/content/115/25/E5716 (2018).

  • Miao, Z. et al. Iterative human and automated identification of wildlife images. Nat. Mach. Intell. 3 (10), 885–895 (2021).

    Article 
    MATH 

    Google Scholar
     

  • Miao, Z. et al. Challenges and solutions for automated avian recognition in aerial imagery. Remote Sens. Ecol. Conserv. 9(4), 439–453 (2023).

  • Hong, S. J., Han, Y., Kim, S. Y., Lee, A. Y. & Kim, G. Application of deep-learning methods to bird detection using unmanned aerial vehicle imagery. Sensors 19 (7), 1651 (2019).

    Article 
    ADS 
    PubMed 
    PubMed Central 
    MATH 

    Google Scholar
     

  • Weinstein, B. G. et al. A general deep learning model for bird detection in high resolution airborne imagery. bioRxiv, (2021).

  • Pijanowski, B. C. et al. Soundscape ecology: the science of sound in the landscape, BioScience. 61 (3), 203–216, (2011).

  • Farina, A. Soundscape Ecology. Springer Netherlands, tex.ids = Farina2014a. http://link.springer.com/https://doi.org/10.1007/978-94-007-7374-5 (2014).

  • Radford, A. et al. Learning transferable visual models from natural language supervision, in International conference on machine learning. PMLR, 8748–8763 (2021).

  • Alayrac, J. B. et al. Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022).

  • Huang, S. et al. Language is not all you need: Aligning perception with language models, arXiv preprint arXiv:2302.14045, (2023).

  • Li, B. et al. Otter: A multi-modal model with in-context instruction tuning, arXiv preprint arXiv:2305.03726, (2023).

  • Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning, (2023).

  • OpenAI Gpt-4 technical report, (2023).

  • Arjovsky, M., Bottou, L., Gulrajani, I. & Lopez-Paz, D. Invariant risk minimization, arXiv preprint arXiv:1907.02893, (2019).

  • Wu, Z., Xiong, Y., Yu, S. X. & Lin, D. Unsupervised feature learning via non-parametric instance discrimination, in Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742. (2018).

  • Gui, J. et al. A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends, arXiv preprint arXiv:2301.05712, (2023).

  • Wu, S., Fei, H., Qu, L., Ji, W. & Chua, T. S. Next-gpt: Any-to-any multimodal llm, (2023).

  • Sun, Q. et al. Generative pretraining in multimodality, arXiv preprint arXiv:2307.05222, (2023).

  • Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. Clap: Learning audio concepts from natural language supervision, arXiv preprint arXiv:2206.04769, (2022).

  • Hagiwara, M. et al. Beans: The benchmark of animal sounds, in ICASSP –2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. (2023).

  • Stowell, D. & Plumbley, M. D. An open dataset for research on audio field recording archives: freefield1010, arXiv preprint arXiv:1309.5275, (2013).

  • Lv, F., Chen, X., Huang, Y., Duan, L. & Lin, G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2554–2562 (2021).

  • Li, J. et al. Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021).


    Google Scholar
     

  • Stafylakis, T. & Tzimiropoulos, G. Combining residual networks with lstms for lipreading, arXiv preprint arXiv:1703.04105, (2017).

  • Deng, J. et al. Imagenet: A large-scale hierarchical image database, http://www.image-net.org (2009).

  • Jia, C. et al. Scaling up visual and vision-language representation learning with noisy text supervision, in International Conference on Machine Learning. PMLR, 4904–4916 (2021).

  • Bommasani, R. et al. On the opportunities and risks of foundation models, arXiv preprint arXiv:2108.07258, (2021).

  • Chen, K. et al. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection, in ICASSP –2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 646–650 (2022).

  • Kong, Q. et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing. 28, 2880–2894, (2020).

  • Fonseca, E., Favory, X., Pons, J., Font, F. & Serra, X. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process., (2022).

  • Drossos, K., Lipping, S. & Virtanen, T. Clotho: an audio captioning dataset, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2020).

  • Kim, C. D., Kim, B., Lee, H. & Kim, G. AudioCaps: Generating Captions for Audios in The Wild, in NAACL-HLT, (2019).

  • Mart´ın-Morat´o, I. & Mesaros, A. What is the ground truth? reliability of multi-annotator data for audio tagging, in 2021 29th European Signal Processing Conference (EUSIPCO), (2021).

  • Koepke, A. S., Oncescu, A. M., Henriques, J., Akata, Z. & Albanie, S. Audio retrieval with natural language queries: a benchmark study. IEEE Trans. Multimedia, (2022).

  • Deshmukh, S., Elizalde, B. & Wang, H. Audio retrieval with wavtext5k and clap training, arXiv preprint arXiv:2209.14275, (2022).

  • Defferrard, M., Benzi, K., Vandergheynst, P. & Bresson, X. Fma: A dataset for music analysis, arXiv preprint arXiv:1612.01840, (2016).

  • Engel, J. et al. Neural audio synthesis of musical notes with wavenet autoencoders, (2017).

  • Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E. & Morency, L. P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236–2246 (2018).

  • Poria, S. et al. Meld: A multimodal multi-party dataset for emotion recognition in conversations, arXiv preprint arXiv:1810.02508, (2018).

  • Busso, C. et al. Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation. 42 (4), 335–359 (2008).

  • Lotfian, R. & Busso, C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans. Affect. Comput. 10 (4), 471–483 (2017).

    Article 

    Google Scholar
     

  • Jeong, I. Y. & Park, J. Cochlscene: Acquisition of acoustic scene data using crowdsourcing, in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 17–21 (2022).

  • Gemmeke, J. F. et al. Audio set: An ontology and human-labeled dataset for audio events, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–780 (2017).

  • Kay, W. et al. The kinetics human action video dataset, arXiv preprint arXiv:1705.06950, (2017).

  • Akkermans, V. et al. Freesound 2: An improved platform for sharing audio clips, in Klapuri A, Leider C, editors. ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference; October 24–28; Miami, Florida (USA). Miami: University of Miami; 2011. International Society for Music Information Retrieval (ISMIR), (2011).

  • Hanish, M. Pro sound effects’ hybrid sound effects library. TV Technol., (2015).

  • He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778 (2016).

  • Szegedy, C. et al. Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9. https://ieeexplore.ieee.org/document/7298594 (2015).

  • Hestness, J. et al. Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409, (2017).

  • Morfi, V. et al. Few-shot bioacoustics event detection: A new task at the dcase 2021 challenge. in DCASE. 145–149 (2021).

  • Chronister, L. M., Rhinehart, T. A., Place, A. & Kitzes, J. An annotated set of audio recordings of eastern north American birds containing frequency, time, and species information. Ecology, e03329 (2021).

  • LeBien, J. et al. A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network. Ecol. Inf. 59, 101113 (2020).

    Article 
    MATH 

    Google Scholar
     

  • Katsis, L. K. et al. Automated detection of gunshots in tropical forests using convolutional neural networks. Ecol. Ind. 141, 109128 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Conditional prompt learning for vision-language models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (16) 816–16 825 (2022).

  • Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vision. 130 (9), 2337–2348 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Lin, T. H. & Tsao, Y. Source separation in ecoacoustics: a roadmap towards versatile soundscape information retrieval. Remote Sens. Ecol. Conserv. 6 (3), 236–247 (2020).

    Article 

    Google Scholar
     

  • Liu, Y. et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, (2023).

  • Shen, S. et al. K-lite: Learning transferable visual models with external knowledge. Adv. Neural. Inf. Process. Syst. 35, 15 558–15 573 (2022).

  • Berrios, W., Mittal, G., Thrush, T., Kiela, D. & Singh, A. Towards language models that can see: Computer vision through the lens of natural language, arXiv preprint arXiv:2306.16410, (2023).

  • Borsos, Z. et al. Audiolm: a language modeling approach to audio generation. IEEE/ACM Trans. Audio Speech Lang. Process., (2023).

  • Menon, S. & Vondrick, C. Visual classification via description from large language models, in The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=jlAjNL8z5cs(2023).



  • Source link

    More From Forest Beat

    Curated global occurrence dataset of the insect order Zoraptera

    To incorporate all species currently classified in this insect order into the dataset, the most recent comprehensive catalog of Zoraptera was utilized17. Subsequently...
    Biodiversity
    6
    minutes

    Ecological novelty is the new norm on our planet

    Kerr, M. R. et al. Nat. Ecol. Evol. https://doi.org/10.1038/s41559-025-02662-2 (2025).Article  ...
    Biodiversity
    0
    minutes

    Genetic survey of crucian carp Carassius carassius populations in Hungary for...

    Dudgeon, D. et al. Freshwater biodiversity: Importance, threats, status and conservation challenges. Biol. Rev. Camb. Philos. Soc. 81, 163–182. https://doi.org/10.1017/S1464793105006950 (2006).Article  ...
    Biodiversity
    11
    minutes

    Reconciling empathy with the utilitarian approach to biodiversity conservation

    Convincing policymakers of the importance and urgency of protecting nature is a common challenge faced by ecologists and conservation scientists. The different priorities...
    Biodiversity
    1
    minute
    spot_imgspot_img