Introduction
Scientists use automated systems to study large ecosystems. In the case of forest and jungle areas, autonomous recording units (ARUs) are used to record audio which can be used to help identify different species of animals and insects. This information can be used to develop a better understanding of the distribution of species within a given environment. In the case of birds, Google Research notes in their article Separating Birdsong in the Wild for Classification that “ecologists use birds to understand food systems and forest health — for example, if there are more woodpeckers in a forest, that means there’s a lot of dead wood.” Further, they note the value of audio-based identification: “[Since] birds communicate and mark territory with songs and calls, it’s most efficient to identify them by ear. In fact, experts may identify up to 10x as many birds by ear as by sight.”
Recently, the BirdCLEF+ 2025 competition launched on Kaggle under the umbrella of the ImageCLEF organization. ImageCLEF supports investigation into cross–language annotation and retrieval of images across a variety of domains. The goal of the competition is straight-forward: design a classification model that can accurately predict the species of bird from an audio recording.
At first, the task seems trivial given the availability of the Google Bird Vocalization (GBV) Classifier, also known as Perch. The GBV classifier is trained on nearly 11,000 bird species and hence is an obvious choice as the classification model.
However, the competition includes bird species that lie outside of the GBV classifier training set. As a result, the GBV classifier only achieves ~60% accuracy on the BirdCLEF+ 2025 competition test dataset. As a result, a custom model must be developed.
This guide details an approach to build your own bird vocalization classifier that can be used in conjunction with the GBV classifier to classify a wider selection of bird species. The approach employs the same basic techniques described in the Google Research article mentioned above. The design leverages the BirdCLEF+ 2025 competition dataset for training.
Training Data
The BirdCLEF+ 2025 training dataset, inclusive of supporting files, is approximately 12 GB. The main directories and files comprising the dataset structure are:
birdclef_2025
|__ train_audio
|__ train_soundscapes
|__ test_soundscapes
recording_location.txt
taxonomy.csv
train.csv
train_audio
The train_audio
directory is the largest component of the dataset, containing 28,564 training audio recordings in the .ogg
audio format. Audio recordings are grouped in sub-directories which each represent a particular bird species, e.g.:
train_audio
|__amakin1
|__ [AUDIO FILES]
|__amekes
|__ [AUDIO FILES]
...
The taxonomy.csv
file can be used to look up the actual scientific and common names of the bird species represented by the sub-directory names, e.g.:
SUB-DIRECTORY NAME SCIENTIFIC NAME COMMON NAME
amakin1 Chloroceryle amazona Amazon Kingfisher
amekes Falco sparverius American Kestrel
...
The competition dataset involves 206 unique bird species, i.e. 206 classes. As suggested in the Introduction, 63 of these classes are not covered by the GBV Classifier. These Non-GBV classes are generally labeled using a numeric class identifier:
1139490, 1192948, 1194042, 126247, 1346504, 134933, 135045, 1462711, 1462737, 1564122, 21038, 21116, 21211, 22333, 22973, 22976, 24272, 24292, 24322, 41663, 41778, 41970, 42007, 42087, 42113, 46010, 47067, 476537, 476538, 48124, 50186, 517119, 523060, 528041, 52884, 548639, 555086, 555142, 566513, 64862, 65336, 65344, 65349, 65373, 65419, 65448, 65547, 65962, 66016, 66531, 66578, 66893, 67082, 67252, 714022, 715170, 787625, 81930, 868458, 963335, grasal4, verfly, y00678
Some of the Non-GBV classes are characterized by:
- Limited training data.
- Class
1139490
, for example, only contains 2 audio recordings. By contrast, classamakin1
, which is a “known” GBV class, contains 89 recordings.
- Class
- Poor recording quality.
- Highlighting class
1139490
again, both training recordings are of poor quality with one being particularly difficult to discern.
- Highlighting class
These 2 conditions lead to a significant imbalance among classes in terms of quantity of available audio and audio quality.
Many of the training audio recordings across both GBV and Non-GBV classes also include human speech, with the speaker annotating the recording with details such as the species of bird that was recorded and location of the recording. In most – but not all – cases, the annotations follow the recorded bird vocalizations.
Tactics used to tackle the class imbalance and presence of human speech annotations are discussed in the Building the Classifier section.
train_soundscapes
The train_soundscapes
directory contains nearly 10,000 unlabeled audio recordings of birdsong. As will be discussed in the Building the Classifier section, these audio recordings can be incorporated into the training data via pseudo-labeling.
test_soundscapes
The test_soundscapes
directory is empty except for a readme.txt
file. This directory is populated with a hidden set of test audio when submitting prediction results to the BirdCLEF+ 2025 competition.
Building the Classifier
Basic Approach and Background
The basic approach used by Google Research to train their bird vocalization classifier is as follows:
- Split recorded audio into 5 second segments.
- Convert audio segments to mel spectrograms.
- Train an image classifier on the mel spectrograms.
The same approach will be followed in this guide. The image classifier that will be trained is Google’s EfficientNet B0 model. If you have familiarity with the EfficientNet family of models, you know that they were designed for efficient image processing.
However, before audio samples can be split and converted to mel spectrograms, we must deal with the class imbalance and human annotation problems mentioned in the Training Data section. Broadly, these problems will be addressed respectively via data augmentation and slicing the audio samples.
Before diving into the actual design, the following sub-sections provide some brief background information.
EfficientNet Models
Google Research introduced its family of EfficientNet models in 2019 as a set of convolutional neural network models that surpassed state-of-the-art models, at that time, with respect to both size and performance.
EfficientNetV2 models, released in 2021, offer even better performance and parameter efficiency.
Though trained on ImageNet data, EfficientNet models have demonstrated their utility when transferred to other datasets, making them an attractive choice as the classification technology for this project.
Mel Spectrograms
A mel spectrogram is a visual representation of an audio signal. It might be best analogized to a heatmap for sound.
The x-axis of a mel spectrogram represents the time dimension of the audio signal, and the y-axis represents the frequencies of the sounds within the signal. However, instead of displaying all frequencies along a continuous scale, frequencies are grouped into mel bands. These bands are, in turn, spaced out using the mel scale. The mel scale is a logarithmic scale that approximates the human auditory system and how humans perceive sound. The colors of the mel spectrogram represent the amplitude of the sounds within bands. Brighter colors represent higher amplitudes while darker colors represent lower amplitudes.
Design
My objective in discussing the design is to provide a high-level review of the approach without getting into too much detail. The main training (fine-tuning) logic is captured in this Kaggle notebook (“training notebook”) which is composed of 4 main sections:
- Section 1: Audio data loading.
- Section 2: Audio data processing.
- Section 3: Mel spectrogram generation and input preparation.
- Section 4: Model training.
You will note that the first 2 cells of each main section are (1) imports used by that section and (2) a Config cell defining constants used in that section and later sections.
The training notebook actually begins with Section 0 where base Python packages used throughout the notebook are imported. This section also includes the logic to login to Weights & Biases (“WandB”) for tracking training runs. You will need to attach your own WandB API key to the notebook as a Kaggle Secret using the name WANDB_API_KEY
.
As discussed in the Training Data section, the unlabeled training soundscapes can be incorporated into the training data via pseudo-labeling. Use of pseudo-labeled data is discussed in the Section 3.5 – Pseudo-Labeling sub-section below. Keep in mind that Kaggle non-GPU environments are limited to 30 GiB of memory.
A trained model following the experimental setup described in the following sub-sections has been posted to Kaggle here. If desired, you can use this model without training your own and jump directly to the Running Inference section to run inference on birdsong audio.
Section 1 – Audio Data Loading
The Audio Data Loading section of the notebook:
- Extracts those classes in the BirdCLEF+ 2025 competition dataset that are not covered by the GBV classifier.
- Loads raw audio data via the
load_training_audio
method. - Creates a
processed_audio
directory and saves a copy of loaded audio data as.wav
files in that directory.
The Config cell of this section includes a MAX_FILES
constant. This constant specifies the maximum number of audio files to load from a given class. This constant is arbitrarily set to the large value of 1000
to ensure that all audio files are loaded for the non-GBV classes. You may need to adjust this constant for your own experimental setup. For example, if you are loading audio data from all classes, you may need to set this constant to a lower value to avoid exhausting available memory.
The load_training_audio
method can be called with a classes
parameter, which is a list of classes whose audio will be loaded. For this project, the non-GBV classes are stored as a list and assigned to the variable missing_classes
which is subsequently passed to the load_training_audio
method via the classes
parameter.
# `missing_classes` list
['1139490', '1192948', '1194042', '126247', '1346504', '134933', '135045', '1462711', '1462737', '1564122', '21038', '21116', '21211', '22333', '22973', '22976', '24272', '24292', '24322', '41663', '41778', '41970', '42007', '42087', '42113', '46010', '47067', '476537', '476538', '48124', '50186', '517119', '523060', '528041', '52884', '548639', '555086', '555142', '566513', '64862', '65336', '65344', '65349', '65373', '65419', '65448', '65547', '65962', '66016', '66531', '66578', '66893', '67082', '67252', '714022', '715170', '787625', '81930', '868458', '963335', 'grasal4', 'verfly', 'y00678']
You can load all 206 BirdCLEF+ 2025 classes by passing an empty list as the classes
parameter.
The load_training_audio
method also accepts an optional boolean use_slice
parameter. This parameter works with the LOAD_SLICE
constant defined in the Config cell. The use_slice
parameter and LOAD_SLICE
constant are not used with this implementation. However, they can be used to load a specific amount of audio from each file. For example, to only load 5 seconds of audio from each audio file, set LOAD_SLICE
to 160000
, which is calculated as 5
times the sampling rate of 32000
; and pass True
to the use_slice
parameter.
The load_training_audio
method accepts a boolean make_copy
parameter. When this parameter is True
, the logic creates a processed_audio
directory and saves a copy of each each audio sample as a .wav
file to the directory. Audio copies are saved in sub-directories reflecting the class to which they belong. The processed_audio
directory is used in the next section to save modified audio samples to disk without affecting the BirdCLEF+ 2025 dataset directories.
The load_training_audio
method returns a dictionary of loaded audio data using the class names as keys. Each value in the dictionary is a list of tuples of the form (AUDIO_FILENAME, AUDIO_DATA)
:
{'1139490': [('CSA36389.ogg', tensor([[-7.3379e-06, 1.0008e-05, -8.9483e-06, ..., 2.9978e-06,
3.4201e-06, 3.8700e-06]])), ('CSA36385.ogg', tensor([[-2.9545e-06, 2.9259e-05, 2.8138e-05, ..., -5.8680e-09, -2.3467e-09, -2.6546e-10]]))], '1192948': [('CSA36388.ogg', tensor([[ 3.7417e-06, -5.4138e-06, -3.3517e-07, ..., -2.4159e-05, -1.6547e-05, -1.8537e-05]])), ('CSA36366.ogg', tensor([[ 2.6916e-06, -1.5655e-06, -2.1533e-05, ..., -2.0132e-05, -1.9063e-05, -2.4438e-05]])), ('CSA36373.ogg', tensor([[ 3.4144e-05, -8.0636e-06, 1.4903e-06, ..., -3.8835e-05, -4.1840e-05, -4.0731e-05]])), ('CSA36358.ogg', tensor([[-1.6201e-06, 2.8240e-05, 2.9543e-05, ..., -2.9203e-04, -3.1059e-04, -2.8100e-04]]))], '1194042': [('CSA18794.ogg', tensor([[ 3.0655e-05, 4.8817e-05, 6.2794e-05, ..., -5.1450e-05,
-4.8535e-05, -4.2476e-05]])), ('CSA18802.ogg', tensor([[ 6.6640e-05, 8.8530e-05, 6.4143e-05, ..., 5.3802e-07, -1.7509e-05, -4.8914e-06]])), ('CSA18783.ogg', tensor([[-8.6866e-06, -6.3421e-06, -3.1125e-05, ..., -1.7946e-04, -1.6407e-04, -1.5334e-04]]))] ...}
The method also returns basic statistics describing the data loaded for each class as a comma-separated-value string. You can optionally export these statistics to inspect the data.
class,sampling_rate,num_files,num_secs_loaded,num_files_loaded
1139490,32000,2,194,2
1192948,32000,4,420,4
1194042,32000,3,91,3
...
Section 2 – Audio Data Processing
The Audio Data Processing section of the notebook:
- Optionally strips silent segments and slices audio to eliminate most human annotations from raw audio. Stripping silent segments eliminates irrelevant parts of the audio signal.
- Optionally augments audio for minority classes to help address the class imbalance. Audio augmentation consists of (1) adding a randomly generated noise signal, (2) changing the tempo of the raw audio, or (3) adding a randomly generated noise signal and changing the tempo of the raw audio.
Section 2.1 – Detecting Silent Segments
The detect_silence
method is used to “slide” over each raw audio sample and identify silent segments by comparing the root-mean square (RMS) value of a given segment to a specified threshold. If the RMS is below the threshold, the segment is identified as a silent segment. The following constants specified in the Config cell of this section control the behavior of the detect_silence
method:
SIL_FRAME_PCT_OF_SR = 0.25
SIL_FRAME = int(SR * SIL_FRAME_PCT_OF_SR)
SIL_HOP = int(1.0 * SIL_FRAME)
SIL_THRESHOLD = 5e-5
SIL_REPLACE_VAL = -1000 # Value used to replace audio signal values within silent segments
The SIL_FRAME
and SIL_HOP
constants can be modified to adjust how the method “slides” over the raw audio. Similarly, the SIL_THRESHOLD
value can be modified to make the method more aggressive or conservative with respect to identification of silent segments.
The method outputs a dictionary of silent segment markers for each file in each class. Audio files with no detected silent segments are identified by empty lists.
{'1139490': {'CSA36389.ogg': [0, 8000, 16000, 272000, 280000, 288000, 296000, 304000], 'CSA36385.ogg': [0, 8000, 16000, 24000, 240000, 248000, 256000]}, '1192948': {'CSA36388.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36366.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 280000, 288000], 'CSA36373.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36358.ogg': [8000]}, '1194042': {'CSA18794.ogg': [], 'CSA18802.ogg': [], 'CSA18783.ogg': [0, 8000, 16000, 24000, 600000, 608000, 616000]}, '126247': {'XC941297.ogg': [], 'iNat1109254.ogg': [], 'iNat888527.ogg': [], 'iNat320679.ogg': [0], 'iNat888729.ogg': [], 'iNat146584.ogg': []}, '1346504': {'CSA18803.ogg': [0, 8000, 16000, 24000, 3000000, 3008000, 3016000], 'CSA18791.ogg': [], 'CSA18792.ogg': [], 'CSA18784.ogg': [0, 8000, 16000, 1232000, 1240000, 1248000], 'CSA18793.ogg': [0, 8000, 16000, 24000, 888000]} ...}
Section 2.2 – Removing Silent Segments and Eliminating Human Annotations
The USE_REMOVE_SILENCE_AND_HUMAN_ANNOT
constant defined in the Config cell of this section specifies if audio should be stripped of silent segments and sliced to remove most human annotations.
USE_REMOVE_SILENCE_AND_HUMAN_ANNOT = True
The remove_silence_and_human_annot
method strips silent segments from audio samples using the output from the detect_silence
method. Further, it implements logic to handle human annotations based on a simple observation: many audio samples, namely those with human annotations, tend to have the following structure:
| < 10s | ~1s | |
| BIRDSONG | SILENCE | HUMAN ANNOTATION |
The birdsong and human annotation sections themselves may contain silent segments. However, as seen in the diagram above, the bird vocalization recordings often occur within the first few seconds of audio. Therefore, a simple, if imperfect, approach to deal with human annotations is to slice audio samples at the first silent segment marker that occurs outside of a specified window, under the assumption that a human annotation follows that silent segment. The remove_silence_and_human_annot
logic uses the ANNOT_BREAKPOINT
constant in the Config cell to check if a silent segment marker lies outside the window specified by ANNOT_BREAKPOINT
, expressed in number of seconds. If it does, the logic slices the raw audio at that marker and only retains the data that occurs before it. A manual inspection of processed audio during experimentation revealed this approach to be satisfactory. However, as mentioned in the Training Data section, there are some audio recordings where the human annotation precedes the birdsong recording. The logic described here does not address those cases. Some audio samples feature long sequences of recorded birdsong and these samples often do not have silent segments. Such samples are unaffected by the previously described logic and kept in their entirety.
A second constant, SLICE_FRAME
, can be optionally used in a final processing step to return an even more refined slice of the processed audio. Set SLICE_FRAME
to the number of seconds of processed audio that you want to retain.
The remove_silence_and_human_annot
method saves processed audio to disk under the directory processed_audio
via the save_audio
parameter, which is passed as True
. The method returns a dictionary of the total seconds of processed audio for each class.
{'1139490': 14, '1192948': 29, '1194042': 24, '126247': 48, '1346504': 40, '134933': 32, '135045': 77, ...}
The get_audio_stats
method is used following remove_silence_and_human_annot
to get the average number of seconds of audio across all classes.
Section 2.3 – Calculating Augmentation Turns for Minority Classes
As mentioned in the Training Data section, the classes are not balanced. Augmentation is used in this notebook section to help address the imbalance leveraging the average number of seconds of audio across all classes, as provided by the get_audio_stats
method. Classes with total seconds of processed audio below the average are augmented. The get_augmentation_turns_per_class
method determines the number of augmentation turns for each minority class using the average number of seconds per processed audio sample.
TURNS = (AVG_SECS_AUDIO_ACROSS_CLASSES - TOTAL_SECS_AUDIO_FOR_CLASS)/AVG_SECS_PER_AUDIO_SAMPLE
Minority classes further below the average will have more augmentation turns versus minority classes nearer the average which will have fewer augmentation turns.
The get_augmentation_turns_per_class
includes a AVG_SECS_FACTOR
constant which can be used to adjust the value for
average number of seconds of audio across all classes. The constant can be used to make the logic more conservative or aggressive when calculating the number of augmentation turns.
Section 2.4 – Running Augmentations
The USE_AUGMENTATIONS
constant defined in the Config cell of this section specifies if audio should be augmented.
USE_AUGMENTATIONS = True
As mentioned earlier, audio augmentation consists of (1) adding a randomly generated noise signal, (2) changing the tempo of the raw audio, or (3) adding a randomly generated noise signal and changing the tempo of the raw audio. The add_noise
and change_tempo
methods encapsulate the logic for adding a noise signal and changing the tempo respectively. The noise signal range and tempo change range can be adjusted via the following constants in the Config cell:
NOISE_RNG_LOW = 0.0001
NOISE_RNG_HIGH = 0.0009
TEMPO_RNG_LOW = 0.5
TEMPO_RNG_HIGH = 1.5
The run_augmentations
method runs the augmentations using the output from the get_augmentations_turns_per_class
method. For those classes that will be augmented, the logic:
- Randomly selects a processed audio sample (i.e. silent segments already removed) for augmentation.
- Randomly selects the augmentation to perform: (1) adding noise, (2) changing the tempo, or (3) adding noise and changing the tempo.
- Saves the augmented audio to disk under the appropriate class within the
processed_audio
directory.
While the notebook logic augments minority classes with total seconds of audio below the average, it ignores those classes with total seconds of audio above the average. This approach was taken to manage available memory and with the understanding that the class imbalance is further addressed through choice of the loss function.
Section 3 – Mel Spectrogram Generation and Input Preparation
The Mel Spectrogram Generation and Input Preparation section of the notebook:
- Splits processed audio data into training and validation lists.
- Splits audio into 5 second frames.
- Generates mel spectrograms for each 5 second audio frame.
- Resizes mel spectrograms to a target size of
(224, 224)
. - Optionally loads pseudo-labeled data samples to augment training data.
- One-hot encodes training data and validation data labels.
- Constructs TensorFlow
Dataset
objects from training and validation data lists. - Optionally uses MixUp logic to augment training data.
Section 3.1 – Splitting Processed Audio Data
Processed audio data is loaded from the processed_audio
folder. The data is split into 4 lists:
training_audio
training_labels
validation_audio
validation_labels
Labels are, of course, the class names associated with the audio examples. The SPLIT
constant defined in the Config cell controls the split ratio between the training and validation data lists. Processed audio data is shuffled before splitting.
Section 3.2 – Splitting Audio into Frames
Audio is split into 5 second segments using the frame_audio
method, which itself uses the TensorFlow signal.frame
method to split each audio example. The following constants in the Config cell control the split operation:
FRAME_LENGTH = 5
FRAME_STEP = 5
Section 3.3 – Generating Mel Spectrograms
Mel spectrograms are generated for each 5 second audio frame generated in Section 3.2 via the audio2melspec
method. The following constants in the Config cell specify the parameters used when generating the mel spectrograms, such as the number of mel bands, minimum frequency, and maximum frequency:
# Mel spectrogram parameters
N_FFT = 1024 # FFT size
HOP_SIZE = 256
N_MELS = 256
FMIN = 50 # minimum frequency
FMAX = 14000 # maximum frequency
The frequency band was chosen to reflect the potential range of most bird vocalizations. However, some bird species can vocalize outside this range.
Section 3.4 – Resizing Mel Spectrograms
The to_melspectrogram_image
method is used to convert each mel spectrogram to a pillow
Image
object. Each Image
object is subsequently resized to (224, 224)
which is the input dimension expected by the EfficientNet B0 model.
Section 3.5 – Loading Pseudo-Labeled Data
As mentioned in the Training Data section, the train_soundscapes
directory contains nearly 10,000 unlabeled audio recordings of birdsong. These audio recordings can be incorporated into the training data via pseudo-labeling. A simple process to create pseudo-labeled data is as follows:
- Train a classifier without pseudo-labeled data.
- Load training soundscape audio files.
- Segment each audio soundscape into 5 second frames.
- Generate mel spectrograms for each 5 second frame and resize to
(224, 224)
. - Run predictions on each resized mel spectrogram using the classifier that you trained in the first step.
- Keep the predictions above a desired confidence level and save the mel spectrograms for those predictions to disk under the predicted class label.
- Train your classifier again using the psuedo-labeled data.
Pseudo-labeled data can improve the performance of your classifier. If you want to generate your own pseudo-labeled data, you should continue with the remaining sections to train a classifier without pseudo-labeled data. Then, use your classifier to create your own set of pseudo-labeled data using the process outlined above. Finally, re-train your classifier using your pseudo-labeled data.
This implementation does not use pseudo-labeled data. However, you can modify the inference notebook referenced in the Running Inference section to generate pseudo-labeled data.
Set the USE_PSEUDO_LABELS
constant in the Config cell to False
to skip the use of pseudo-labeled data.
Section 3.6 – Encoding Labels
The process_labels
method is used to one-hot encode labels. One-hot encoded labels are returned as NumPy arrays and added to the training label and validation label lists.
Section 3.7 – Converting Training and Validation Data Lists to TensorFlow Dataset
Objects
The TensorFlow data.Dataset.from_tensor_slices
method is used to create TensorFlow Dataset
objects from the training and validation data lists. The shuffle
method is called on the training Dataset
object to shuffle training data before batching. The batch
method is called on both Dataset
objects to batch the training and validation datasets. The BATCH_SIZE
constant in the Config cell controls the batch size.
Section 3.8 – Using MixUp to Augment Training Data
As you may already know, MixUp is a data augmentation technique that effectively mixes two images together to create a new data sample. The class for the blended image is a blend of the classes associated with the original 2 images. The mix_up
method, along with the sample_beta_distribution
method, encapsulates the optional MixUp logic.
This implementation uses MixUp to augment the training data. To use MixUp, set the USE_MIXUP
constant in the Config cell to True
.
Section 4 – Model Training
The Model Training section of the notebook:
- Initializes and configures a WandB project to capture training run data.
- Builds and compiles the EfficientNet B0 model.
- Trains the model.
- Saves the trained model to disk.
Section 4.1 – Initializing and Configuring WandB Project
Ensure that you have attached your own WandB API key as a Kaggle Secret to the notebook and that the WandB login
method in Section 0 of the notebook has returned True
.
The Config cell in this section includes logic to initialize and configure a new WandB project (if the project doesn’t already exist) that will capture training run data:
wandb.init(project="my-bird-vocalization-classifier")
config = wandb.config
config.batch_size = BATCH_SIZE
config.epochs = 30
config.image_size = IMG_SIZE
config.num_classes = len(LABELS)
Obviously, you can change the project name my-bird-vocalization-classifier
to your desired WandB project name.
Section 4.2 – Building and Compiling the EfficientNet B0 Model
The build_model
method is used to load the pre-trained EfficientNet B0 model with ImageNet weights and without the top layer:
model = EfficientNetB0(include_top=False, input_tensor=inputs, weights="imagenet")
The model is frozen to leverage the pre-trained ImageNet weights with the objective to only unfreeze (i.e. train) layers in the final stage of the model:
# Unfreeze last `unfreeze_layers` layers and add regularization
for layer in model.layers[-unfreeze_layers:]:
if not isinstance(layer, layers.BatchNormalization):
layer.trainable = True
layer.kernel_regularizer = tf.keras.regularizers.l2(L2_RATE)
The constant UNFREEZE_LAYERS
in the Config cell specifies the number of layers to unfreeze.
The top of the model is rebuilt with a final Dense
layer reflecting the number bird species classes. Categorical focal cross-entropy is chosen as the loss function to help address the class imbalance. The LOSS_ALPHA
and LOSS_GAMMA
constants in the Config cell are used with the loss function.
Section 4.3 – Model Training
The fit
method is called on the compiled model
from Section 4.2 to run training. Note that a learning rate scheduler callback, lr_scheduler
, is used instead of a constant learning rate. An initial learning rate of 4.0e-4
is hardcoded into the callback. The learning rate is decreased in 2 stages based on the epoch count. The number of training epochs is controlled by the EPOCHS
constant in the Config cell.
Section 4.4 – Model Saving
The save
method is called on the compiled model
following training to save the model to disk.
model.save("bird-vocalization-classifier.keras")
Training Results
Running the notebook should produce the following training results, assuming you used the experimental setup that was described in the Building the Classifier section:
As seen, accuracy is just above 90% and validation accuracy is about 70% after training for 30 epochs. However, as seen, validation accuracy fluctuates significantly. This variation is partially attributed to the class imbalance with available memory limiting the use of additional augmentations to fully address the imbalance. Results suggest that the model suffers from overfitting on training data and does not generalize as well as would be hoped for. Nonetheless, the model can be used for predictions alongside the GBV classifier in accordance with the original objective.
Running Inference
This Kaggle notebook (“inference notebook”) can be used for running inference. The inference notebook logic uses both the GBV classifier model and the model that you trained in the preceding section. It runs inference on the unlabeled soundscapes files in the train_soundscapes
directory. Each soundscapes audio file is split into 5 second frames. The MAX_FILES
constant defined in the Config cell of Section 0 of the notebook controls the number of soundscapes audio files that are loaded for inference.
The inference notebook first generates predictions using the GBV classifier. The predictions for the 143 BirdCLEF+ 2025 competition dataset classes known to the GBV classifier are isolated. If the maximum probability among the 143 “known” classes is above or equal to GBV_CLASSIFIER_THRESHOLD
, then the GBV predicted class is selected as the true class. If the maximum probability among the 143 “known” classes is below GBV_CLASSIFIER_THRESHOLD
, it is assumed that the true class is among the 63 classes “unknown” to the GBV classifier – i.e. the classes used to train the model in the preceding section. The logic then runs predictions using the finetuned model. The predicted class from that prediction set is subsequently selected as the true class.
The GBV_CLASSIFIER_THRESHOLD
constant is defined in the Config cell of Section 5 of the inference notebook. Predictions are output to 2 files:
- A
preds.csv
file that captures the prediction and prediction probability for each 5-second soundscape slice. - A
submission.csv
file that captures all class probabilities in the format for the BirdCLEF+ 2025 competition.
Set the path to your finetuned model in the first cell of Section 4 of the inference notebook.
Future Work
The training notebook can be used to train a model on all 206 BirdCLEF+ 2025 classes, eliminating the need for the GBV classifier, at least with respect to the competition dataset. As mentioned earlier, passing an empty list, []
, to the load_training_audio
method will load audio data from all classes. The MAX_FILES
and LOAD_SLICE
constants can be used to limit the amount of loaded audio in order to work within the confines of a Kaggle notebook environment.
Of course, a more accurate model can be trained using a larger amount of training data. Ideally, a greater number of augmentations would be used to address the class imbalance. Additionally, other augmentation techniques, such as CutMix, could be implemented to further augment the training data. However, these strategies demand a more robust development environment.