Guide to YAMNet : Sound Event Classifier
Transfer Learning is a well-liked and popular machine learning technique in which one can train a model by reusing information learned from a previously existing model. You must have heard and read about common applications of transfer learning in the vision domain – training models to accurately classify images and do object detection or text-domain – sentiment analysis or question answering, and the list goes on …
We will learn how to apply transfer learning for a new(relatively) type of data: audio, by making a sound classifier. There are many vital use cases of sound classification, such as detecting whales and other creatures using sound as a necessity to travel, protecting wildlife from poaching and encroachment, etc.
With YAMNet, we can easily create a sound classifier in a few simple and easy steps!
YAMNet (Yet Another Mobile Network) – Yes, that is the full form, is a pretrained acoustic detection model trained by Dan Ellis on the AudioSet dataset which contains labelled data from more than 2 million Youtube videos. It employs the MobileNet_v1 depth-wise-separable convolution architecture. This pretrained model is readily available in Tensorflow Hub, which includes TFLite(lite model for mobile) and TF.js(running on the web) versions.
Let’s better understand this amazing model with a practical use case and hands-on Python Implementation.
Importing the Dependencies
Importing tensorflow-hub for leveraging the pre-trained model, wavfile for storing an audio file.
IPython.display lets us play audio right here in the notebook.
import tensorflow as tf import tensorflow_hub as hub import numpy as np import csv import matplotlib.pyplot as plt from IPython.display import Audio from scipy.io import wavfile
Loading the Model
Instantiating the pre-trained model to a variable using hub.model method for usage in the below cells. The labels file will also be loaded from model assets and is present at model.class_map_path(). We require this to load it on the class_names variable later on.
# Load the model.
model = hub.load('https://tfhub.dev/google/yamnet/1')
Helper_Function_1
This helper function is created to find the name of the class with the top score when mean-aggregated across frames.
# Find the name of the class with the top score when mean-aggregated across frames. def class_names_from_csv(class_map_csv_text): """Returns list of class names corresponding to score vector.""" class_names = [] with tf.io.gfile.GFile(class_map_csv_text) as csvfile: reader = csv.DictReader(csvfile) for row in reader: class_names.append(row['display_name']) return class_names class_map_path = model.class_map_path().numpy() class_names = class_names_from_csv(class_map_path)
Helper_Function_2
We need to add this method to verify and convert the loaded audio to the proper sample_rate which is 16K. This is mentioned in the YAMNet paper by the authors, as it can adversely affect our model results.
def ensure_sample_rate(original_sample_rate, waveform, desired_sample_rate=16000): """Resample waveform if required.""" if original_sample_rate != desired_sample_rate: desired_length = int(round(float(len(waveform)) / original_sample_rate * desired_sample_rate)) waveform = scipy.signal.resample(waveform, desired_length) return desired_sample_rate, waveform
Downloading and Preparing the sound file
The Colab notebook will have all links required; you just have to run the notebook provided.
!curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav !curl -O https://storage.googleapis.com/audioset/miaow_16k.wav
We can also listen to a sample audio file from the downloaded data set and check its properties by applying the following snippet. As shown below, we can play the sample audio file and look at some information about this particular audio file.
# wav_file_name = 'speech_whistling2.wav'
wav_file_name = 'miaow_16k.wav'
sample_rate, wav_data = wavfile.read(wav_file_name, 'rb')
sample_rate, wav_data = ensure_sample_rate(sample_rate, wav_data)
# Show some basic information about the audio.
duration = len(wav_data)/sample_rate
print(f'Sample rate: {sample_rate} Hz')
print(f'Total duration: {duration:.2f}s')
print(f'Size of the input: {len(wav_data)}')
# Listening to the wav file.
Audio(wav_data, rate=sample_rate)
Running the Model
We are converting the wave data into numbers to feed to the pre-trained model. The model will give us scores , embeddings, and spectrograms as output that we can later display. Helper_Function_1 gives us the output as “Animal” which means that this is the label with the maximum number of audio files in our dataset.
waveform = wav_data / tf.int16.max
# Run the model, check the output.
scores, embeddings, spectrogram = model(waveform)
scores_np = scores.numpy()
spectrogram_np = spectrogram.numpy()
infered_class = class_names[scores_np.mean(axis=0).argmax()]
print(f'The main sound is: {infered_class}')
Plotting the Output
Plotting the three outputs we got from running the model, namely scores, embeddings and spectrograms.
plt.figure(figsize=(10, 6)) # Plot the waveform. plt.subplot(3, 1, 1) plt.plot(waveform) plt.xlim([0, len(waveform)]) # Plot the log-mel spectrogram (returned by the model). plt.subplot(3, 1, 2) plt.imshow(spectrogram_np.T, aspect='auto', interpolation='nearest', origin='lower') # Plot and label the model output scores for the top-scoring classes. mean_scores = np.mean(scores, axis=0) top_n = 10 top_class_indices = np.argsort(mean_scores)[::-1][:top_n] plt.subplot(3, 1, 3) plt.imshow(scores_np[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r') # patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS # values from the model documentation patch_padding = (0.025 / 2) / 0.01 plt.xlim([-patch_padding-0.5, scores.shape[0] + patch_padding-0.5]) # Label the top_N classes. yticks = range(0, top_n, 1) plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks]) _ = plt.ylim(-0.5 + np.array([top_n, 0]))
After running the above snippets, it will be displayed as below.
And here is your sound classifier using the pretrained model YAMNet. I recommend trying out this model with different datasets from open source and those present in the links present in this blog. This is a peculiar use case that will enhance your skillset in Deep Learning.
The Google Colab notebook is present here for reference.
References:
The post Guide to YAMNet : Sound Event Classifier appeared first on Analytics India Magazine.




