Guide to YAMNet : Sound Event Classifier

Transfer Learning is a well-liked and popular machine learning technique in which one can train a model by reusing information learned from a previously existing model. You must have heard and read about common applications of transfer learning in the vision domain – training models to accurately classify images and do object detection or text-domain – sentiment analysis or question answering, and the list goes on …

We will learn how to apply transfer learning for a new(relatively) type of data: audio, by making a sound classifier. There are many vital use cases of sound classification, such as detecting whales and other creatures using sound as a necessity to travel, protecting wildlife from poaching and encroachment, etc.

With YAMNet, we can easily create a sound classifier in a few simple and easy steps!

YAMNet (Yet Another Mobile Network) – Yes, that is the full form, is a pretrained acoustic detection model trained by Dan Ellis on the AudioSet dataset which contains labelled data from more than 2 million Youtube videos. It employs the MobileNet_v1 depth-wise-separable convolution architecture. This pretrained model is readily available in Tensorflow Hub, which includes TFLite(lite model for mobile) and TF.js(running on the web) versions.

Let’s better understand this amazing model with a practical use case and hands-on Python Implementation.

Importing the Dependencies

Importing tensorflow-hub for leveraging the pre-trained model, wavfile for storing an audio file.

IPython.display lets us play audio right here in the notebook.

 import tensorflow as tf
 import tensorflow_hub as hub
 import numpy as np
 import csv
 import matplotlib.pyplot as plt
 from IPython.display import Audio
 from scipy.io import wavfile

Loading the Model

Instantiating the pre-trained model to a variable using hub.model method for usage in the below cells. The labels file will also be loaded from model assets and is present at model.class_map_path(). We require this to load it on the class_names variable later on.

 # Load the model.
 model = hub.load('https://tfhub.dev/google/yamnet/1')

Helper_Function_1

This helper function is created to find the name of the class with the top score when mean-aggregated across frames.

 # Find the name of the class with the top score when mean-aggregated across frames.
 def class_names_from_csv(class_map_csv_text):
   """Returns list of class names corresponding to score vector."""
   class_names = []
   with tf.io.gfile.GFile(class_map_csv_text) as csvfile:
     reader = csv.DictReader(csvfile)
     for row in reader:
       class_names.append(row['display_name'])
   return class_names
 class_map_path = model.class_map_path().numpy()
 class_names = class_names_from_csv(class_map_path)

Helper_Function_2

We need to add this method to verify and convert the loaded audio to the proper sample_rate which is 16K. This is mentioned in the YAMNet paper by the authors, as it can adversely affect our model results.

 def ensure_sample_rate(original_sample_rate, waveform,
                        desired_sample_rate=16000):
   """Resample waveform if required."""
   if original_sample_rate != desired_sample_rate:
     desired_length = int(round(float(len(waveform)) /
                                original_sample_rate * desired_sample_rate))
     waveform = scipy.signal.resample(waveform, desired_length)
   return desired_sample_rate, waveform

Downloading and Preparing the sound file

The Colab notebook will have all links required; you just have to run the notebook provided.

 !curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav
 !curl -O https://storage.googleapis.com/audioset/miaow_16k.wav

We can also listen to a sample audio file from the downloaded data set and check its properties by applying the following snippet. As shown below, we can play the sample audio file and look at some information about this particular audio file.

 # wav_file_name = 'speech_whistling2.wav'
 wav_file_name = 'miaow_16k.wav'
 sample_rate, wav_data = wavfile.read(wav_file_name, 'rb')
 sample_rate, wav_data = ensure_sample_rate(sample_rate, wav_data)
 # Show some basic information about the audio.
 duration = len(wav_data)/sample_rate
 print(f'Sample rate: {sample_rate} Hz')
 print(f'Total duration: {duration:.2f}s')
 print(f'Size of the input: {len(wav_data)}')
 # Listening to the wav file.
 Audio(wav_data, rate=sample_rate)

Running the Model

We are converting the wave data into numbers to feed to the pre-trained model. The model will give us scores , embeddings, and spectrograms as output that we can later display. Helper_Function_1 gives us the output as “Animal” which means that this is the label with the maximum number of audio files in our dataset.

 waveform = wav_data / tf.int16.max
 # Run the model, check the output.
 scores, embeddings, spectrogram = model(waveform)
 scores_np = scores.numpy()
 spectrogram_np = spectrogram.numpy()
 infered_class = class_names[scores_np.mean(axis=0).argmax()]
 print(f'The main sound is: {infered_class}')

Plotting the Output

Plotting the three outputs we got from running the model, namely scores, embeddings and spectrograms.

 plt.figure(figsize=(10, 6))
 # Plot the waveform.
 plt.subplot(3, 1, 1)
 plt.plot(waveform)
 plt.xlim([0, len(waveform)])
 # Plot the log-mel spectrogram (returned by the model).
 plt.subplot(3, 1, 2)
 plt.imshow(spectrogram_np.T, aspect='auto', interpolation='nearest', origin='lower')
 # Plot and label the model output scores for the top-scoring classes.
 mean_scores = np.mean(scores, axis=0)
 top_n = 10
 top_class_indices = np.argsort(mean_scores)[::-1][:top_n]
 plt.subplot(3, 1, 3)
 plt.imshow(scores_np[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r')
 # patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
 # values from the model documentation
 patch_padding = (0.025 / 2) / 0.01
 plt.xlim([-patch_padding-0.5, scores.shape[0] + patch_padding-0.5])
 # Label the top_N classes.
 yticks = range(0, top_n, 1)
 plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
 _ = plt.ylim(-0.5 + np.array([top_n, 0]))

After running the above snippets, it will be displayed as below.

And here is your sound classifier using the pretrained model YAMNet. I recommend trying out this model with different datasets from open source and those present in the links present in this blog. This is a peculiar use case that will enhance your skillset in Deep Learning.

The Google Colab notebook is present here for reference.

References:

The post Guide to YAMNet : Sound Event Classifier appeared first on Analytics India Magazine.

Guide to YAMNet : Sound Event Classifier

Related Posts

Starlark is Basically Python, But Not Really Python, and That’s Fine

Haskell is a Goated Programming Language That No One Uses

HTML to Prevent LLMs from Overdosing on Hallucinations