Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

[ad_1]

Hautier, Y. et al. Anthropogenic environmental changes affect ecosystem stability via biodiversity. Science 348 (6232), 336–340 (2015).

Article
ADS
CAS
PubMed

Google Scholar

Barlow, J. et al. Anthropogenic disturbance in tropical forests can double biodiversity loss from deforestation. Nature 535 (7610), 144–147 (2016).

Article
ADS
CAS
PubMed
MATH

Google Scholar

Ripple, W. J. et al. Conserving the world’s megafauna and biodiversity: the fierce urgency of now. Bioscience 67 (3), 197–200 (2017).

Google Scholar

Dirzo, R. et al. Defaunation in the anthropocene, science, 345 (6195), 401–406, (2014).

O’Connell, A. F., Nichols, J. D. & Karanth, K. U. Camera Traps in Animal Ecology: Methods and Analyses (Springer Science & Business Media, 2010).

MATH

Google Scholar

Burton, A. C. et al. Wildlife camera trapping: a review and recommendations for linking surveys to ecological processes. J. Appl. Ecol. 52 (3), 675–685 (2015).

Article
MATH

Google Scholar

Kitzes, J. & Schricker, L. The necessity, promise and challenge of automated biodiversity surveys. Environ. Conserv. 46 (4), 247–250 (2019).

Article

Google Scholar

Wang, D., Shao, Q. & Yue, H. Surveying wild animals from satellites, manned aircraft and unmanned aerial systems (uass): a review. Remote Sens. 11 (11), 1308 (2019).

Article
ADS

Google Scholar

Kays, R., McShea, W. J. & Wikelski, M. Born-digital biodiversity data: millions and billions. Divers. Distrib. 26 (5), 644–648 (2020).

Article
MATH

Google Scholar

Keitt, T. H. & Abelson, E. S. Ecology in the age of automation. Science 373 (6557), 858–859 (2021).

Article
ADS
CAS
PubMed
MATH

Google Scholar

Tuia, D. et al. Perspectives in machine learning for wildlife conservation, Nature Communications. 13 (1), 792, https://doi.org/10.1038/s41467-022-27980-y (2022).

Laiolo, P. The emerging significance of bioacoustics in animal species conservation. Biol. conserva- tion. 143 (7), 1635–1645 (2010).

Article
MATH

Google Scholar

Marques, T. A. et al. Estimating animal population density using passive acoustics. Biol. Rev. 88 (2), 287–309 (2013).

Article
PubMed
MATH

Google Scholar

Sugai, L. S. M., Silva, T. S. F., Ribeiro, J. W. Jr & Llusia, D. Terrestrial passive acoustic monitoring: review and perspectives. BioScience 69 (1), 15–25 (2019).

Article

Google Scholar

Dale, S. S. et al. Distinguishing sex of northern spotted owls with passive acoustic monitoring. J. Raptor Res. 56 (3), 287–299 (2022).

Article
MATH

Google Scholar

Roe, P. et al. The Australian acoustic observatory. Methods Ecol. Evol. 12 (10), 1802–1808 (2021).

Article
MATH

Google Scholar

Potamitis, I., Ntalampiras, S., Jahn, O. & Riede, K. Automatic bird sound detection in long real-field recordings: applications and tools. Appl. Acoust. 80, 1–9 (2014).

Article
MATH

Google Scholar

Stowell, D., Wood, M., Stylianou, Y. & Glotin, H. Bird detection in audio: a survey and a challenge, in IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6, (2016).

Stowell, D., Wood, M. D., Pamul-a, H., Stylianou, Y. & Glotin, H. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods Ecol. Evol. 10 (3), 368–380 (2019).

Article

Google Scholar

Zhong, M. et al. Detecting, classifying, and counting blue whale calls with siamese neural networks. J. Acoust. Soc. Am. 149 (5), 3086–3094 (2021).

Article
ADS
PubMed
MATH

Google Scholar

Zhong, M. et al. Acoustic detection of regionally rare bird species through deep convolutional neural networks. Ecol. Inf. 64, 101333 (2021).

Article

Google Scholar

Gupta, G., Kshirsagar, M., Zhong, M., Gholami, S. & Ferres, J. L. Comparing recurrent convolutional neural networks for large scale bird species classification. Sci. Rep. 11 (1), 17085 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Kahl, S., Wood, C. M., Eibl, M. & Klinck, H. Birdnet: a deep learning solution for avian diversity monitoring. Ecol. Inf. 61, 101236 (2021).

Article

Google Scholar

Stowell, D. Computational bioacoustics with deep learning: a review and roadmap. PeerJ 10, e13152 (2022).

Article
PubMed
PubMed Central
MATH

Google Scholar

Wa¨ldchen, J. & Ma¨der, P. Machine learning for image based species identification, Methods in Ecology and Evolution. 9 (11), 2216–2225 (2018).

LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition, Proceedings of the IEEE. 86 (11), 2278–2324, (1998).

Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst., 30 (2017).

Guo, M. H. et al. Attention mechanisms in computer vision: a survey. Comput. Visual Media. 8 (3), 331–368 (2022).

Article
MATH

Google Scholar

Politis, A., Mesaros, A., Adavanne, S., Heittola, T. & Virtanen, T. Overview and evaluation of sound event localization and detection in dcase 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing. 29, 684–698, (2020).

Elizalde, B. M. Never-ending learning of sounds, Ph.D. dissertation, Carnegie Mellon University Pittsburgh, PA, (2020).

Heller, L. M., Elizalde, B., Raj, B. & Deshmuk, S. Synergy between human and machine approaches to sound/scene recognition and processing: An overview of icassp special session, arXiv preprint arXiv:2302.09719, (2023).

Norouzzadeh, M. S. et al. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning, Proceedings of the National Academy of Sciences. 115 (25), E5716–E5725, https://www.pnas.org/content/115/25/E5716 (2018).

Miao, Z. et al. Iterative human and automated identification of wildlife images. Nat. Mach. Intell. 3 (10), 885–895 (2021).

Article
MATH

Google Scholar

Miao, Z. et al. Challenges and solutions for automated avian recognition in aerial imagery. Remote Sens. Ecol. Conserv. 9(4), 439–453 (2023).

Hong, S. J., Han, Y., Kim, S. Y., Lee, A. Y. & Kim, G. Application of deep-learning methods to bird detection using unmanned aerial vehicle imagery. Sensors 19 (7), 1651 (2019).

Article
ADS
PubMed
PubMed Central
MATH

Google Scholar

Weinstein, B. G. et al. A general deep learning model for bird detection in high resolution airborne imagery. bioRxiv, (2021).

Pijanowski, B. C. et al. Soundscape ecology: the science of sound in the landscape, BioScience. 61 (3), 203–216, (2011).

Farina, A. Soundscape Ecology. Springer Netherlands, tex.ids = Farina2014a. http://link.springer.com/https://doi.org/10.1007/978-94-007-7374-5 (2014).

Radford, A. et al. Learning transferable visual models from natural language supervision, in International conference on machine learning. PMLR, 8748–8763 (2021).

Alayrac, J. B. et al. Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022).

Huang, S. et al. Language is not all you need: Aligning perception with language models, arXiv preprint arXiv:2302.14045, (2023).

Li, B. et al. Otter: A multi-modal model with in-context instruction tuning, arXiv preprint arXiv:2305.03726, (2023).

Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning, (2023).

OpenAI Gpt-4 technical report, (2023).

Arjovsky, M., Bottou, L., Gulrajani, I. & Lopez-Paz, D. Invariant risk minimization, arXiv preprint arXiv:1907.02893, (2019).

Wu, Z., Xiong, Y., Yu, S. X. & Lin, D. Unsupervised feature learning via non-parametric instance discrimination, in Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742. (2018).

Gui, J. et al. A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends, arXiv preprint arXiv:2301.05712, (2023).

Wu, S., Fei, H., Qu, L., Ji, W. & Chua, T. S. Next-gpt: Any-to-any multimodal llm, (2023).

Sun, Q. et al. Generative pretraining in multimodality, arXiv preprint arXiv:2307.05222, (2023).

Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. Clap: Learning audio concepts from natural language supervision, arXiv preprint arXiv:2206.04769, (2022).

Hagiwara, M. et al. Beans: The benchmark of animal sounds, in ICASSP –2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. (2023).

Stowell, D. & Plumbley, M. D. An open dataset for research on audio field recording archives: freefield1010, arXiv preprint arXiv:1309.5275, (2013).

Lv, F., Chen, X., Huang, Y., Duan, L. & Lin, G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2554–2562 (2021).

Li, J. et al. Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021).

Google Scholar

Stafylakis, T. & Tzimiropoulos, G. Combining residual networks with lstms for lipreading, arXiv preprint arXiv:1703.04105, (2017).

Deng, J. et al. Imagenet: A large-scale hierarchical image database, http://www.image-net.org (2009).

Jia, C. et al. Scaling up visual and vision-language representation learning with noisy text supervision, in International Conference on Machine Learning. PMLR, 4904–4916 (2021).

Bommasani, R. et al. On the opportunities and risks of foundation models, arXiv preprint arXiv:2108.07258, (2021).

Chen, K. et al. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection, in ICASSP –2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 646–650 (2022).

Kong, Q. et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing. 28, 2880–2894, (2020).

Fonseca, E., Favory, X., Pons, J., Font, F. & Serra, X. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process., (2022).

Drossos, K., Lipping, S. & Virtanen, T. Clotho: an audio captioning dataset, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2020).

Kim, C. D., Kim, B., Lee, H. & Kim, G. AudioCaps: Generating Captions for Audios in The Wild, in NAACL-HLT, (2019).

Mart´ın-Morat´o, I. & Mesaros, A. What is the ground truth? reliability of multi-annotator data for audio tagging, in 2021 29th European Signal Processing Conference (EUSIPCO), (2021).

Koepke, A. S., Oncescu, A. M., Henriques, J., Akata, Z. & Albanie, S. Audio retrieval with natural language queries: a benchmark study. IEEE Trans. Multimedia, (2022).

Deshmukh, S., Elizalde, B. & Wang, H. Audio retrieval with wavtext5k and clap training, arXiv preprint arXiv:2209.14275, (2022).

Defferrard, M., Benzi, K., Vandergheynst, P. & Bresson, X. Fma: A dataset for music analysis, arXiv preprint arXiv:1612.01840, (2016).

Engel, J. et al. Neural audio synthesis of musical notes with wavenet autoencoders, (2017).

Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E. & Morency, L. P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236–2246 (2018).

Poria, S. et al. Meld: A multimodal multi-party dataset for emotion recognition in conversations, arXiv preprint arXiv:1810.02508, (2018).

Busso, C. et al. Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation. 42 (4), 335–359 (2008).

Lotfian, R. & Busso, C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans. Affect. Comput. 10 (4), 471–483 (2017).

Article

Google Scholar

Jeong, I. Y. & Park, J. Cochlscene: Acquisition of acoustic scene data using crowdsourcing, in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 17–21 (2022).

Gemmeke, J. F. et al. Audio set: An ontology and human-labeled dataset for audio events, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–780 (2017).

Kay, W. et al. The kinetics human action video dataset, arXiv preprint arXiv:1705.06950, (2017).

Akkermans, V. et al. Freesound 2: An improved platform for sharing audio clips, in Klapuri A, Leider C, editors. ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference; October 24–28; Miami, Florida (USA). Miami: University of Miami; 2011. International Society for Music Information Retrieval (ISMIR), (2011).

Hanish, M. Pro sound effects’ hybrid sound effects library. TV Technol., (2015).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778 (2016).

Szegedy, C. et al. Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9. https://ieeexplore.ieee.org/document/7298594 (2015).

Hestness, J. et al. Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409, (2017).

Morfi, V. et al. Few-shot bioacoustics event detection: A new task at the dcase 2021 challenge. in DCASE. 145–149 (2021).

Chronister, L. M., Rhinehart, T. A., Place, A. & Kitzes, J. An annotated set of audio recordings of eastern north American birds containing frequency, time, and species information. Ecology, e03329 (2021).

LeBien, J. et al. A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network. Ecol. Inf. 59, 101113 (2020).

Article
MATH

Google Scholar

Katsis, L. K. et al. Automated detection of gunshots in tropical forests using convolutional neural networks. Ecol. Ind. 141, 109128 (2022).

Article
MATH

Google Scholar

Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Conditional prompt learning for vision-language models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (16) 816–16 825 (2022).

Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vision. 130 (9), 2337–2348 (2022).

Article
MATH

Google Scholar

Lin, T. H. & Tsao, Y. Source separation in ecoacoustics: a roadmap towards versatile soundscape information retrieval. Remote Sens. Ecol. Conserv. 6 (3), 236–247 (2020).

Article

Google Scholar

Liu, Y. et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, (2023).

Shen, S. et al. K-lite: Learning transferable visual models with external knowledge. Adv. Neural. Inf. Process. Syst. 35, 15 558–15 573 (2022).

Berrios, W., Mittal, G., Thrush, T., Kiela, D. & Singh, A. Towards language models that can see: Computer vision through the lens of natural language, arXiv preprint arXiv:2306.16410, (2023).

Borsos, Z. et al. Audiolm: a language modeling approach to audio generation. IEEE/ACM Trans. Audio Speech Lang. Process., (2023).

Menon, S. & Vondrick, C. Visual classification via description from large language models, in The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=jlAjNL8z5cs(2023).

[ad_2]

Source link

Asiatic Lioness ‘Kesari’ Welcomes Four Cubs at Assam State Zoo

Fred Johnson wins Caesar Kleberg Award

Latest rhino assessment finds two species recovering, but three continue to decline

Journalist Given Relief In 16 Year Old Gir Forest Case

To save humanity and nature we must tackle wealth inequality, says Cambridge researcher

The Power of place: chasing blue skies

Oregon has the tools to repair and revamp our aging grid

Seattle shows how local governments can play a big role in cutting pollution

This hybrid electric ferry crosses the sound without a sound

Macro Wins for Microgrids!

Global 1-km habitat distribution for endangered species and its spatial changes under future warming scenarios

Italian still life paintings as a resource for reconstructing past Mediterranean aquatic biodiversity

Trait mediation explains decadal distributional shifts for a wide range of insect taxa

Why are there large gaps in the British distribution of Common Elder?

Origin and crop type affect the biodiversity pressures of fruits and vegetables

COP30: Brasil llama a un mutirão global por la acción climática, ¿qué significa?

Daya tarik kisah lama bagi para pendaki baru di Gunung Hantu Papua Nugini

From WWII ordeal to eco-tourism

AE-TPP’s 2025 Forum in Hanoi

10 Key Themes Shaping Brazil’s COP30 Agenda in Belém

Heritage Hub Exclusive! Glasgow Restored: Maryhill Burgh Halls

Going for Zero: A Q&A with Carl Elefante

Exploring the Chesapeake Bay’s Unsung Black History

What is Left Behind: Builders at 6 National Trust Historic Sites Leave Their Mark

An Exceptional Contribution to Preservation: Camille and Duncan Strachan

Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

More From Forest Beat

Global 1-km habitat distribution for endangered species and its spatial changes...

Italian still life paintings as a resource for reconstructing past Mediterranean...

Trait mediation explains decadal distributional shifts for a wide range of...

Why are there large gaps in the British distribution of Common...

About

Service

Newsletter

Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

More From Forest Beat

.tdi_122{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_122{margin-bottom:8px!important}}@media (max-width:767px){.tdi_122{margin-bottom:8px!important}}Italian still life paintings as a resource for reconstructing past Mediterranean...

.tdi_142{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_142{margin-bottom:8px!important}}@media (max-width:767px){.tdi_142{margin-bottom:8px!important}}Trait mediation explains decadal distributional shifts for a wide range of...

.tdi_162{margin-bottom:15px!important}@media (min-width:768px) and (max-width:1018px){.tdi_162{margin-bottom:8px!important}}@media (max-width:767px){.tdi_162{margin-bottom:8px!important}}Why are there large gaps in the British distribution of Common...

About

Service

Newsletter

Italian still life paintings as a resource for reconstructing past Mediterranean...

Trait mediation explains decadal distributional shifts for a wide range of...

Why are there large gaps in the British distribution of Common...