Machine Learning For The Birds: Building Your Own Bird Vocalization Classifier

Introduction

Scientists use automated systems to study large ecosystems. In the case of forest and jungle areas, autonomous recording units (ARUs) are used to record audio which can be used to help identify different species of animals and insects. This information can be used to develop a better understanding of the distribution of species within a given environment. In the case of birds, Google Research notes in their article Separating Birdsong in the Wild for Classification that “ecologists use birds to understand food systems and forest health — for example, if there are more woodpeckers in a forest, that means there’s a lot of dead wood.” Further, they note the value of audio-based identification: “[Since] birds communicate and mark territory with songs and calls, it’s most efficient to identify them by ear. In fact, experts may identify up to 10x as many birds by ear as by sight.”

Recently, the BirdCLEF+ 2025 competition launched on Kaggle under the umbrella of the ImageCLEF organization. ImageCLEF supports investigation into cross–language annotation and retrieval of images across a variety of domains. The goal of the competition is straight-forward: design a classification model that can accurately predict the species of bird from an audio recording.

At first, the task seems trivial given the availability of the Google Bird Vocalization (GBV) Classifier, also known as Perch. The GBV classifier is trained on nearly 11,000 bird species and hence is an obvious choice as the classification model.

However, the competition includes bird species that lie outside of the GBV classifier training set. As a result, the GBV classifier only achieves ~60% accuracy on the BirdCLEF+ 2025 competition test dataset. As a result, a custom model must be developed.

This guide details an approach to build your own bird vocalization classifier that can be used in conjunction with the GBV classifier to classify a wider selection of bird species. The approach employs the same basic techniques described in the Google Research article mentioned above. The design leverages the BirdCLEF+ 2025 competition dataset for training.

Training Data

The BirdCLEF+ 2025 training dataset, inclusive of supporting files, is approximately 12 GB. The main directories and files comprising the dataset structure are:

birdclef_2025
|__ train_audio
|__ train_soundscapes
|__ test_soundscapes
recording_location.txt
taxonomy.csv
train.csv

`train_audio`

The train_audio directory is the largest component of the dataset, containing 28,564 training audio recordings in the .ogg audio format. Audio recordings are grouped in sub-directories which each represent a particular bird species, e.g.:

train_audio
|__amakin1
   |__ [AUDIO FILES]
|__amekes
   |__ [AUDIO FILES]
...

The taxonomy.csv file can be used to look up the actual scientific and common names of the bird species represented by the sub-directory names, e.g.:

SUB-DIRECTORY NAME          SCIENTIFIC NAME             COMMON NAME
amakin1                     Chloroceryle amazona        Amazon Kingfisher
amekes                      Falco sparverius            American Kestrel
...

Amazon Kingfisher

American Kestrel

The competition dataset involves 206 unique bird species, i.e. 206 classes. As suggested in the Introduction, 63 of these classes are not covered by the GBV Classifier. These Non-GBV classes are generally labeled using a numeric class identifier:

1139490, 1192948, 1194042, 126247, 1346504, 134933, 135045, 1462711, 1462737, 1564122, 21038, 21116, 21211, 22333, 22973, 22976, 24272, 24292, 24322, 41663, 41778, 41970, 42007, 42087, 42113, 46010, 47067, 476537, 476538, 48124, 50186, 517119, 523060, 528041, 52884, 548639, 555086, 555142, 566513, 64862, 65336, 65344, 65349, 65373, 65419, 65448, 65547, 65962, 66016, 66531, 66578, 66893, 67082, 67252, 714022, 715170, 787625, 81930, 868458, 963335, grasal4, verfly, y00678

Some of the Non-GBV classes are characterized by:

Limited training data.
- Class 1139490, for example, only contains 2 audio recordings. By contrast, class amakin1, which is a “known” GBV class, contains 89 recordings.
Poor recording quality.
- Highlighting class 1139490 again, both training recordings are of poor quality with one being particularly difficult to discern.

These 2 conditions lead to a significant imbalance among classes in terms of quantity of available audio and audio quality.

Many of the training audio recordings across both GBV and Non-GBV classes also include human speech, with the speaker annotating the recording with details such as the species of bird that was recorded and location of the recording. In most – but not all – cases, the annotations follow the recorded bird vocalizations.

Tactics used to tackle the class imbalance and presence of human speech annotations are discussed in the Building the Classifier section.

`train_soundscapes`

The train_soundscapes directory contains nearly 10,000 unlabeled audio recordings of birdsong. As will be discussed in the Building the Classifier section, these audio recordings can be incorporated into the training data via pseudo-labeling.

`test_soundscapes`

The test_soundscapes directory is empty except for a readme.txt file. This directory is populated with a hidden set of test audio when submitting prediction results to the BirdCLEF+ 2025 competition.

Building the Classifier

Basic Approach and Background

The basic approach used by Google Research to train their bird vocalization classifier is as follows:

Split recorded audio into 5 second segments.
Convert audio segments to mel spectrograms.
Train an image classifier on the mel spectrograms.

The same approach will be followed in this guide. The image classifier that will be trained is Google’s EfficientNet B0 model. If you have familiarity with the EfficientNet family of models, you know that they were designed for efficient image processing.

However, before audio samples can be split and converted to mel spectrograms, we must deal with the class imbalance and human annotation problems mentioned in the Training Data section. Broadly, these problems will be addressed respectively via data augmentation and slicing the audio samples.

Before diving into the actual design, the following sub-sections provide some brief background information.

EfficientNet Models

Google Research introduced its family of EfficientNet models in 2019 as a set of convolutional neural network models that surpassed state-of-the-art models, at that time, with respect to both size and performance.

EfficientNet model family performance

EfficientNetV2 models, released in 2021, offer even better performance and parameter efficiency.

Though trained on ImageNet data, EfficientNet models have demonstrated their utility when transferred to other datasets, making them an attractive choice as the classification technology for this project.

Mel Spectrograms

A mel spectrogram is a visual representation of an audio signal. It might be best analogized to a heatmap for sound.

Sample mel spectrogram

The x-axis of a mel spectrogram represents the time dimension of the audio signal, and the y-axis represents the frequencies of the sounds within the signal. However, instead of displaying all frequencies along a continuous scale, frequencies are grouped into mel bands. These bands are, in turn, spaced out using the mel scale. The mel scale is a logarithmic scale that approximates the human auditory system and how humans perceive sound. The colors of the mel spectrogram represent the amplitude of the sounds within bands. Brighter colors represent higher amplitudes while darker colors represent lower amplitudes.

Design

My objective in discussing the design is to provide a high-level review of the approach without getting into too much detail. The main training (fine-tuning) logic is captured in this Kaggle notebook (“training notebook”) which is composed of 4 main sections:

Section 1: Audio data loading.
Section 2: Audio data processing.
Section 3: Mel spectrogram generation and input preparation.
Section 4: Model training.

You will note that the first 2 cells of each main section are (1) imports used by that section and (2) a Config cell defining constants used in that section and later sections.

The training notebook actually begins with Section 0 where base Python packages used throughout the notebook are imported. This section also includes the logic to login to Weights & Biases (“WandB”) for tracking training runs. You will need to attach your own WandB API key to the notebook as a Kaggle Secret using the name WANDB_API_KEY.

As discussed in the Training Data section, the unlabeled training soundscapes can be incorporated into the training data via pseudo-labeling. Use of pseudo-labeled data is discussed in the Section 3.5 – Pseudo-Labeling sub-section below. Keep in mind that Kaggle non-GPU environments are limited to 30 GiB of memory.

A trained model following the experimental setup described in the following sub-sections has been posted to Kaggle here. If desired, you can use this model without training your own and jump directly to the Running Inference section to run inference on birdsong audio.

Section 1 – Audio Data Loading

The Audio Data Loading section of the notebook:

Extracts those classes in the BirdCLEF+ 2025 competition dataset that are not covered by the GBV classifier.
Loads raw audio data via the load_training_audio method.
Creates a processed_audio directory and saves a copy of loaded audio data as .wav files in that directory.

The Config cell of this section includes a MAX_FILES constant. This constant specifies the maximum number of audio files to load from a given class. This constant is arbitrarily set to the large value of 1000 to ensure that all audio files are loaded for the non-GBV classes. You may need to adjust this constant for your own experimental setup. For example, if you are loading audio data from all classes, you may need to set this constant to a lower value to avoid exhausting available memory.

The load_training_audio method can be called with a classes parameter, which is a list of classes whose audio will be loaded. For this project, the non-GBV classes are stored as a list and assigned to the variable missing_classes which is subsequently passed to the load_training_audio method via the classes parameter.

# `missing_classes` list
['1139490', '1192948', '1194042', '126247', '1346504', '134933', '135045', '1462711', '1462737', '1564122', '21038', '21116', '21211', '22333', '22973', '22976', '24272', '24292', '24322', '41663', '41778', '41970', '42007', '42087', '42113', '46010', '47067', '476537', '476538', '48124', '50186', '517119', '523060', '528041', '52884', '548639', '555086', '555142', '566513', '64862', '65336', '65344', '65349', '65373', '65419', '65448', '65547', '65962', '66016', '66531', '66578', '66893', '67082', '67252', '714022', '715170', '787625', '81930', '868458', '963335', 'grasal4', 'verfly', 'y00678']

You can load all 206 BirdCLEF+ 2025 classes by passing an empty list as the classes parameter.

The load_training_audio method also accepts an optional boolean use_slice parameter. This parameter works with the LOAD_SLICE constant defined in the Config cell. The use_slice parameter and LOAD_SLICE constant are not used with this implementation. However, they can be used to load a specific amount of audio from each file. For example, to only load 5 seconds of audio from each audio file, set LOAD_SLICE to 160000, which is calculated as 5 times the sampling rate of 32000; and pass True to the use_slice parameter.

The load_training_audio method accepts a boolean make_copy parameter. When this parameter is True, the logic creates a processed_audio directory and saves a copy of each each audio sample as a .wav file to the directory. Audio copies are saved in sub-directories reflecting the class to which they belong. The processed_audio directory is used in the next section to save modified audio samples to disk without affecting the BirdCLEF+ 2025 dataset directories.

The load_training_audio method returns a dictionary of loaded audio data using the class names as keys. Each value in the dictionary is a list of tuples of the form (AUDIO_FILENAME, AUDIO_DATA):

{'1139490': [('CSA36389.ogg', tensor([[-7.3379e-06,  1.0008e-05, -8.9483e-06,  ...,  2.9978e-06,
3.4201e-06,  3.8700e-06]])), ('CSA36385.ogg', tensor([[-2.9545e-06,  2.9259e-05,  2.8138e-05,  ..., -5.8680e-09, -2.3467e-09, -2.6546e-10]]))], '1192948': [('CSA36388.ogg', tensor([[ 3.7417e-06, -5.4138e-06, -3.3517e-07,  ..., -2.4159e-05, -1.6547e-05, -1.8537e-05]])), ('CSA36366.ogg', tensor([[ 2.6916e-06, -1.5655e-06, -2.1533e-05,  ..., -2.0132e-05, -1.9063e-05, -2.4438e-05]])), ('CSA36373.ogg', tensor([[ 3.4144e-05, -8.0636e-06,  1.4903e-06,  ..., -3.8835e-05, -4.1840e-05, -4.0731e-05]])), ('CSA36358.ogg', tensor([[-1.6201e-06,  2.8240e-05,  2.9543e-05,  ..., -2.9203e-04, -3.1059e-04, -2.8100e-04]]))], '1194042': [('CSA18794.ogg', tensor([[ 3.0655e-05,  4.8817e-05,  6.2794e-05,  ..., -5.1450e-05,
-4.8535e-05, -4.2476e-05]])), ('CSA18802.ogg', tensor([[ 6.6640e-05,  8.8530e-05,  6.4143e-05,  ...,  5.3802e-07, -1.7509e-05, -4.8914e-06]])), ('CSA18783.ogg', tensor([[-8.6866e-06, -6.3421e-06, -3.1125e-05,  ..., -1.7946e-04, -1.6407e-04, -1.5334e-04]]))] ...}

The method also returns basic statistics describing the data loaded for each class as a comma-separated-value string. You can optionally export these statistics to inspect the data.

class,sampling_rate,num_files,num_secs_loaded,num_files_loaded
1139490,32000,2,194,2
1192948,32000,4,420,4
1194042,32000,3,91,3
...

Section 2 – Audio Data Processing

The Audio Data Processing section of the notebook:

Optionally strips silent segments and slices audio to eliminate most human annotations from raw audio. Stripping silent segments eliminates irrelevant parts of the audio signal.
Optionally augments audio for minority classes to help address the class imbalance. Audio augmentation consists of (1) adding a randomly generated noise signal, (2) changing the tempo of the raw audio, or (3) adding a randomly generated noise signal and changing the tempo of the raw audio.

Section 2.1 – Detecting Silent Segments

The detect_silence method is used to “slide” over each raw audio sample and identify silent segments by comparing the root-mean square (RMS) value of a given segment to a specified threshold. If the RMS is below the threshold, the segment is identified as a silent segment. The following constants specified in the Config cell of this section control the behavior of the detect_silence method:

SIL_FRAME_PCT_OF_SR = 0.25
SIL_FRAME = int(SR * SIL_FRAME_PCT_OF_SR)
SIL_HOP = int(1.0 * SIL_FRAME)
SIL_THRESHOLD = 5e-5
SIL_REPLACE_VAL = -1000 # Value used to replace audio signal values within silent segments

The SIL_FRAME and SIL_HOP constants can be modified to adjust how the method “slides” over the raw audio. Similarly, the SIL_THRESHOLD value can be modified to make the method more aggressive or conservative with respect to identification of silent segments.

The method outputs a dictionary of silent segment markers for each file in each class. Audio files with no detected silent segments are identified by empty lists.

{'1139490': {'CSA36389.ogg': [0, 8000, 16000, 272000, 280000, 288000, 296000, 304000], 'CSA36385.ogg': [0, 8000, 16000, 24000, 240000, 248000, 256000]}, '1192948': {'CSA36388.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36366.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 280000, 288000], 'CSA36373.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36358.ogg': [8000]}, '1194042': {'CSA18794.ogg': [], 'CSA18802.ogg': [], 'CSA18783.ogg': [0, 8000, 16000, 24000, 600000, 608000, 616000]}, '126247': {'XC941297.ogg': [], 'iNat1109254.ogg': [], 'iNat888527.ogg': [], 'iNat320679.ogg': [0], 'iNat888729.ogg': [], 'iNat146584.ogg': []}, '1346504': {'CSA18803.ogg': [0, 8000, 16000, 24000, 3000000, 3008000, 3016000], 'CSA18791.ogg': [], 'CSA18792.ogg': [], 'CSA18784.ogg': [0, 8000, 16000, 1232000, 1240000, 1248000], 'CSA18793.ogg': [0, 8000, 16000, 24000, 888000]} ...}

Section 2.2 – Removing Silent Segments and Eliminating Human Annotations

The USE_REMOVE_SILENCE_AND_HUMAN_ANNOT constant defined in the Config cell of this section specifies if audio should be stripped of silent segments and sliced to remove most human annotations.

USE_REMOVE_SILENCE_AND_HUMAN_ANNOT = True

The remove_silence_and_human_annot method strips silent segments from audio samples using the output from the detect_silence method. Further, it implements logic to handle human annotations based on a simple observation: many audio samples, namely those with human annotations, tend to have the following structure:

|  < 10s   |   ~1s   |                  |
| BIRDSONG | SILENCE | HUMAN ANNOTATION |

The birdsong and human annotation sections themselves may contain silent segments. However, as seen in the diagram above, the bird vocalization recordings often occur within the first few seconds of audio. Therefore, a simple, if imperfect, approach to deal with human annotations is to slice audio samples at the first silent segment marker that occurs outside of a specified window, under the assumption that a human annotation follows that silent segment. The remove_silence_and_human_annot logic uses the ANNOT_BREAKPOINT constant in the Config cell to check if a silent segment marker lies outside the window specified by ANNOT_BREAKPOINT, expressed in number of seconds. If it does, the logic slices the raw audio at that marker and only retains the data that occurs before it. A manual inspection of processed audio during experimentation revealed this approach to be satisfactory. However, as mentioned in the Training Data section, there are some audio recordings where the human annotation precedes the birdsong recording. The logic described here does not address those cases. Some audio samples feature long sequences of recorded birdsong and these samples often do not have silent segments. Such samples are unaffected by the previously described logic and kept in their entirety.

A second constant, SLICE_FRAME, can be optionally used in a final processing step to return an even more refined slice of the processed audio. Set SLICE_FRAME to the number of seconds of processed audio that you want to retain.

The remove_silence_and_human_annot method saves processed audio to disk under the directory processed_audio via the save_audio parameter, which is passed as True. The method returns a dictionary of the total seconds of processed audio for each class.

{'1139490': 14, '1192948': 29, '1194042': 24, '126247': 48, '1346504': 40, '134933': 32, '135045': 77, ...}

The get_audio_stats method is used following remove_silence_and_human_annot to get the average number of seconds of audio across all classes.

Section 2.3 – Calculating Augmentation Turns for Minority Classes

As mentioned in the Training Data section, the classes are not balanced. Augmentation is used in this notebook section to help address the imbalance leveraging the average number of seconds of audio across all classes, as provided by the get_audio_stats method. Classes with total seconds of processed audio below the average are augmented. The get_augmentation_turns_per_class method determines the number of augmentation turns for each minority class using the average number of seconds per processed audio sample.

TURNS = (AVG_SECS_AUDIO_ACROSS_CLASSES - TOTAL_SECS_AUDIO_FOR_CLASS)/AVG_SECS_PER_AUDIO_SAMPLE

Minority classes further below the average will have more augmentation turns versus minority classes nearer the average which will have fewer augmentation turns.

The get_augmentation_turns_per_class includes a AVG_SECS_FACTOR constant which can be used to adjust the value for

average number of seconds of audio across all classes. The constant can be used to make the logic more conservative or aggressive when calculating the number of augmentation turns.

Section 2.4 – Running Augmentations

The USE_AUGMENTATIONS constant defined in the Config cell of this section specifies if audio should be augmented.

USE_AUGMENTATIONS = True

As mentioned earlier, audio augmentation consists of (1) adding a randomly generated noise signal, (2) changing the tempo of the raw audio, or (3) adding a randomly generated noise signal and changing the tempo of the raw audio. The add_noise and change_tempo methods encapsulate the logic for adding a noise signal and changing the tempo respectively. The noise signal range and tempo change range can be adjusted via the following constants in the Config cell:

NOISE_RNG_LOW = 0.0001
NOISE_RNG_HIGH = 0.0009
TEMPO_RNG_LOW = 0.5
TEMPO_RNG_HIGH = 1.5

The run_augmentations method runs the augmentations using the output from the get_augmentations_turns_per_class method. For those classes that will be augmented, the logic:

Randomly selects a processed audio sample (i.e. silent segments already removed) for augmentation.
Randomly selects the augmentation to perform: (1) adding noise, (2) changing the tempo, or (3) adding noise and changing the tempo.
Saves the augmented audio to disk under the appropriate class within the processed_audio directory.

While the notebook logic augments minority classes with total seconds of audio below the average, it ignores those classes with total seconds of audio above the average. This approach was taken to manage available memory and with the understanding that the class imbalance is further addressed through choice of the loss function.

Section 3 – Mel Spectrogram Generation and Input Preparation

The Mel Spectrogram Generation and Input Preparation section of the notebook:

Splits processed audio data into training and validation lists.
Splits audio into 5 second frames.
Generates mel spectrograms for each 5 second audio frame.
Resizes mel spectrograms to a target size of (224, 224).
Optionally loads pseudo-labeled data samples to augment training data.
One-hot encodes training data and validation data labels.
Constructs TensorFlow Dataset objects from training and validation data lists.
Optionally uses MixUp logic to augment training data.

Section 3.1 – Splitting Processed Audio Data

Processed audio data is loaded from the processed_audio folder. The data is split into 4 lists:

training_audio
training_labels
validation_audio
validation_labels

Labels are, of course, the class names associated with the audio examples. The SPLIT constant defined in the Config cell controls the split ratio between the training and validation data lists. Processed audio data is shuffled before splitting.

Section 3.2 – Splitting Audio into Frames

Audio is split into 5 second segments using the frame_audio method, which itself uses the TensorFlow signal.frame method to split each audio example. The following constants in the Config cell control the split operation:

FRAME_LENGTH = 5
FRAME_STEP = 5

Section 3.3 – Generating Mel Spectrograms

Mel spectrograms are generated for each 5 second audio frame generated in Section 3.2 via the audio2melspec method. The following constants in the Config cell specify the parameters used when generating the mel spectrograms, such as the number of mel bands, minimum frequency, and maximum frequency:

# Mel spectrogram parameters
N_FFT = 1024  # FFT size
HOP_SIZE = 256
N_MELS = 256
FMIN = 50  # minimum frequency
FMAX = 14000 # maximum frequency

The frequency band was chosen to reflect the potential range of most bird vocalizations. However, some bird species can vocalize outside this range.

Section 3.4 – Resizing Mel Spectrograms

The to_melspectrogram_image method is used to convert each mel spectrogram to a pillow Image object. Each Image object is subsequently resized to (224, 224) which is the input dimension expected by the EfficientNet B0 model.

Section 3.5 – Loading Pseudo-Labeled Data

As mentioned in the Training Data section, the train_soundscapes directory contains nearly 10,000 unlabeled audio recordings of birdsong. These audio recordings can be incorporated into the training data via pseudo-labeling. A simple process to create pseudo-labeled data is as follows:

Train a classifier without pseudo-labeled data.
Load training soundscape audio files.
Segment each audio soundscape into 5 second frames.
Generate mel spectrograms for each 5 second frame and resize to (224, 224).
Run predictions on each resized mel spectrogram using the classifier that you trained in the first step.
Keep the predictions above a desired confidence level and save the mel spectrograms for those predictions to disk under the predicted class label.
Train your classifier again using the psuedo-labeled data.

Pseudo-labeled data can improve the performance of your classifier. If you want to generate your own pseudo-labeled data, you should continue with the remaining sections to train a classifier without pseudo-labeled data. Then, use your classifier to create your own set of pseudo-labeled data using the process outlined above. Finally, re-train your classifier using your pseudo-labeled data.

This implementation does not use pseudo-labeled data. However, you can modify the inference notebook referenced in the Running Inference section to generate pseudo-labeled data.

Set the USE_PSEUDO_LABELS constant in the Config cell to False to skip the use of pseudo-labeled data.

Section 3.6 – Encoding Labels

The process_labels method is used to one-hot encode labels. One-hot encoded labels are returned as NumPy arrays and added to the training label and validation label lists.

Section 3.7 – Converting Training and Validation Data Lists to TensorFlow `Dataset` Objects

The TensorFlow data.Dataset.from_tensor_slices method is used to create TensorFlow Dataset objects from the training and validation data lists. The shuffle method is called on the training Dataset object to shuffle training data before batching. The batch method is called on both Dataset objects to batch the training and validation datasets. The BATCH_SIZE constant in the Config cell controls the batch size.

Section 3.8 – Using MixUp to Augment Training Data

As you may already know, MixUp is a data augmentation technique that effectively mixes two images together to create a new data sample. The class for the blended image is a blend of the classes associated with the original 2 images. The mix_up method, along with the sample_beta_distribution method, encapsulates the optional MixUp logic.

This implementation uses MixUp to augment the training data. To use MixUp, set the USE_MIXUP constant in the Config cell to True.

Section 4 – Model Training

The Model Training section of the notebook:

Initializes and configures a WandB project to capture training run data.
Builds and compiles the EfficientNet B0 model.
Trains the model.
Saves the trained model to disk.

Section 4.1 – Initializing and Configuring WandB Project

Ensure that you have attached your own WandB API key as a Kaggle Secret to the notebook and that the WandB login method in Section 0 of the notebook has returned True.

The Config cell in this section includes logic to initialize and configure a new WandB project (if the project doesn’t already exist) that will capture training run data:

wandb.init(project="my-bird-vocalization-classifier")
config = wandb.config
config.batch_size = BATCH_SIZE
config.epochs = 30
config.image_size = IMG_SIZE
config.num_classes = len(LABELS)

Obviously, you can change the project name my-bird-vocalization-classifier to your desired WandB project name.

Section 4.2 – Building and Compiling the EfficientNet B0 Model

The build_model method is used to load the pre-trained EfficientNet B0 model with ImageNet weights and without the top layer:

model = EfficientNetB0(include_top=False, input_tensor=inputs, weights="imagenet")

The model is frozen to leverage the pre-trained ImageNet weights with the objective to only unfreeze (i.e. train) layers in the final stage of the model:

# Unfreeze last `unfreeze_layers` layers and add regularization
for layer in model.layers[-unfreeze_layers:]:
   if not isinstance(layer, layers.BatchNormalization):
      layer.trainable = True
      layer.kernel_regularizer = tf.keras.regularizers.l2(L2_RATE)

The constant UNFREEZE_LAYERS in the Config cell specifies the number of layers to unfreeze.

The top of the model is rebuilt with a final Dense layer reflecting the number bird species classes. Categorical focal cross-entropy is chosen as the loss function to help address the class imbalance. The LOSS_ALPHA and LOSS_GAMMA constants in the Config cell are used with the loss function.

Section 4.3 – Model Training

The fit method is called on the compiled model from Section 4.2 to run training. Note that a learning rate scheduler callback, lr_scheduler, is used instead of a constant learning rate. An initial learning rate of 4.0e-4 is hardcoded into the callback. The learning rate is decreased in 2 stages based on the epoch count. The number of training epochs is controlled by the EPOCHS constant in the Config cell.

Section 4.4 – Model Saving

The save method is called on the compiled model following training to save the model to disk.

model.save("bird-vocalization-classifier.keras")

Training Results

Running the notebook should produce the following training results, assuming you used the experimental setup that was described in the Building the Classifier section:

Training results

As seen, accuracy is just above 90% and validation accuracy is about 70% after training for 30 epochs. However, as seen, validation accuracy fluctuates significantly. This variation is partially attributed to the class imbalance with available memory limiting the use of additional augmentations to fully address the imbalance. Results suggest that the model suffers from overfitting on training data and does not generalize as well as would be hoped for. Nonetheless, the model can be used for predictions alongside the GBV classifier in accordance with the original objective.

Running Inference

This Kaggle notebook (“inference notebook”) can be used for running inference. The inference notebook logic uses both the GBV classifier model and the model that you trained in the preceding section. It runs inference on the unlabeled soundscapes files in the train_soundscapes directory. Each soundscapes audio file is split into 5 second frames. The MAX_FILES constant defined in the Config cell of Section 0 of the notebook controls the number of soundscapes audio files that are loaded for inference.

The inference notebook first generates predictions using the GBV classifier. The predictions for the 143 BirdCLEF+ 2025 competition dataset classes known to the GBV classifier are isolated. If the maximum probability among the 143 “known” classes is above or equal to GBV_CLASSIFIER_THRESHOLD, then the GBV predicted class is selected as the true class. If the maximum probability among the 143 “known” classes is below GBV_CLASSIFIER_THRESHOLD, it is assumed that the true class is among the 63 classes “unknown” to the GBV classifier – i.e. the classes used to train the model in the preceding section. The logic then runs predictions using the finetuned model. The predicted class from that prediction set is subsequently selected as the true class.

The GBV_CLASSIFIER_THRESHOLD constant is defined in the Config cell of Section 5 of the inference notebook. Predictions are output to 2 files:

A preds.csv file that captures the prediction and prediction probability for each 5-second soundscape slice.
A submission.csv file that captures all class probabilities in the format for the BirdCLEF+ 2025 competition.

Set the path to your finetuned model in the first cell of Section 4 of the inference notebook.

Future Work

The training notebook can be used to train a model on all 206 BirdCLEF+ 2025 classes, eliminating the need for the GBV classifier, at least with respect to the competition dataset. As mentioned earlier, passing an empty list, [], to the load_training_audio method will load audio data from all classes. The MAX_FILES and LOAD_SLICE constants can be used to limit the amount of loaded audio in order to work within the confines of a Kaggle notebook environment.

Of course, a more accurate model can be trained using a larger amount of training data. Ideally, a greater number of augmentations would be used to address the class imbalance. Additionally, other augmentation techniques, such as CutMix, could be implemented to further augment the training data. However, these strategies demand a more robust development environment.

Introduction

Training Data

train_audio

train_soundscapes

test_soundscapes

Building the Classifier

Basic Approach and Background

EfficientNet Models

Mel Spectrograms

Design

Section 1 – Audio Data Loading

Section 2 – Audio Data Processing

Section 2.1 – Detecting Silent Segments

Section 2.2 – Removing Silent Segments and Eliminating Human Annotations

Section 2.3 – Calculating Augmentation Turns for Minority Classes

Section 2.4 – Running Augmentations

Section 3 – Mel Spectrogram Generation and Input Preparation

Section 3.1 – Splitting Processed Audio Data

Section 3.2 – Splitting Audio into Frames

Section 3.3 – Generating Mel Spectrograms

Section 3.4 – Resizing Mel Spectrograms

Section 3.5 – Loading Pseudo-Labeled Data

Section 3.6 – Encoding Labels

Section 3.7 – Converting Training and Validation Data Lists to TensorFlow Dataset Objects

Section 3.8 – Using MixUp to Augment Training Data

Section 4 – Model Training

Section 4.1 – Initializing and Configuring WandB Project

Section 4.2 – Building and Compiling the EfficientNet B0 Model

Section 4.3 – Model Training

Section 4.4 – Model Saving

Training Results

Running Inference

Future Work

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News

`train_audio`

`train_soundscapes`

`test_soundscapes`

Section 3.7 – Converting Training and Validation Data Lists to TensorFlow `Dataset` Objects